LLM backends¶
axiom.backends.base¶
llm_engine/base.py
Abstract base class and shared data types for all Axiom AI LLM backends.
Both the Arbitrator (narrative agent) and the Chronicler (world simulation agent) are decoupled from any concrete LLM provider through this interface. Swapping between a local Ollama model and a remote Gemini model requires only changing which concrete subclass is instantiated.
Tool Call Protocol: the LLM is instructed to wrap any structured
state-change JSON inside a fenced block delimited by ~~~json / ~~~
markers. This delimiter was chosen deliberately to avoid ambiguity with
standard markdown triple-backtick code fences that may appear legitimately
in narrative prose.
Example LLM output:
The dragon breathes fire. The knight loses his shield.
~~~json
{
"state_changes": [
{"entity_id": "knight", "stat_key": "Shield", "delta": -1}
],
"narrative_events": ["dragon_attack"]
}
~~~
- class axiom.backends.base.LLMResponse(narrative_text, tool_call, finish_reason)[source]¶
Parsed response from any LLM backend.
- narrative_text¶
The prose portion of the response, with the ~~~json … ~~~ block stripped out.
- Type:
- tool_call¶
The parsed JSON object or list from the fenced block, or None if the LLM produced no tool call.
- exception axiom.backends.base.LLMConnectionError[source]¶
Raised when the LLM backend is unreachable.
This covers connection refused, DNS failures, timeouts, and HTTP 5xx responses that indicate the server is down.
- exception axiom.backends.base.LLMParseError[source]¶
Raised when the LLM response cannot be parsed into the expected structure.
This covers malformed JSON inside the ~~~json block, missing required fields, or an entirely unexpected response format.
- exception axiom.backends.base.GenerationCancelled[source]¶
Voluntary cancellation of a generation (TICKET-033).
Raised when LLMBackend.cancel_event is set during a wait (429 retry, pacing) or at a cooperative boundary (between Populate chunks/targets). This is NOT an error: callers translate it into a “cancelled” signal, never into an error popup.
- class axiom.backends.base.LLMBackend[source]¶
Abstract interface for all Axiom AI LLM provider clients.
Concrete subclasses must implement complete(), stream_tokens(), and is_available(). The parse_tool_call() helper is provided here and is shared by all subclasses.
Optional hooks, set by the caller after construction — zero Qt, a backend that never consults them stays valid:
on_status: progress callback(str) (e.g. retry countdown);
cancel_event: a threading.Event set to request a stop — cooperative backends/callers then raise GenerationCancelled.
- abstractmethod complete(messages, stream=False, temperature=0.7, top_p=1.0, response_format=None, stop_sequences=None, max_tokens=None)[source]¶
Send a list of messages and return a fully assembled LLMResponse.
- Parameters:
messages (list[LLMMessage]) – Conversation history including the system prompt.
stream (bool) – If True, the implementation may still return a complete LLMResponse (assembled from the stream internally); for token-by-token streaming use stream_tokens().
temperature (float) – Sampling temperature (0.0 to 1.0).
top_p (float) – Nucleus sampling parameter (0.0 to 1.0).
response_format (str | None) – Optional format constraint (e.g. “json”).
stop_sequences (list[str] | None) – Optional list of strings that trigger generation stop.
max_tokens (int | None) – Optional limit on the number of tokens to generate.
- Returns:
Parsed LLMResponse with narrative_text, optional tool_call, and finish_reason.
- Raises:
LLMConnectionError – If the backend is unreachable.
LLMParseError – If the response structure is unrecognisable.
- Return type:
- abstractmethod stream_tokens(messages, temperature=0.7, top_p=1.0, response_format=None, stop_sequences=None, max_tokens=None)[source]¶
Yield individual tokens as they arrive from the LLM backend.
Intended for the PySide6 typewriter UI effect (Phase 3). The caller is responsible for accumulating tokens and calling parse_tool_call() on the assembled string when the stream ends.
- Parameters:
messages (list[LLMMessage]) – Conversation history including the system prompt.
temperature (float) – Sampling temperature (0.0 to 1.0).
top_p (float) – Nucleus sampling parameter (0.0 to 1.0).
response_format (str | None) – Optional format constraint (e.g. “json”).
stop_sequences (list[str] | None) – Optional list of strings that trigger generation stop.
max_tokens (int | None) – Optional limit on the number of tokens to generate.
- Yields:
Individual token strings in the order they are produced.
- Raises:
LLMConnectionError – If the backend becomes unreachable mid-stream.
- Return type:
- abstractmethod is_available()[source]¶
Perform a lightweight health check against the backend.
Must never raise; any failure must be caught and returned as False.
- Returns:
True if the backend is reachable and ready, False otherwise.
- Return type:
- classmethod parse_tool_call(raw_response)[source]¶
Extract narrative text and tool-call JSON from a raw LLM response.
Resilient parsing:
Checks for common markdown fences (
~~~json, triple-backtick json, etc).Fallback: heuristic search for JSON objects or arrays.
Normalizes minor schema deviations (e.g., missing ‘stats’ key or flat params).
- Parameters:
raw_response (str) – The complete raw string returned by the LLM.
- Returns:
A (narrative_text, tool_call) tuple — narrative_text is the response with the JSON block removed, and tool_call is the parsed dict/list, or None if no valid JSON was found.
- Return type:
axiom.backends.gemini¶
llm_engine/gemini_client.py
LLM backend client for Google Gemini models (remote / cloud fallback).
Uses the google-genai SDK (google.genai). The client translates Axiom AI’s internal list[LLMMessage] format to Gemini’s Content objects: the first system-role message becomes the model’s system_instruction, and the remaining turns become the contents list.
Typical usage:
from axiom.backends.gemini import GeminiClient
llm = GeminiClient(api_key="YOUR_KEY", model_name="gemini-2.0-flash")
if llm.is_available():
response = llm.complete(messages)
print(response.narrative_text)
- class axiom.backends.gemini.GeminiClient(api_key, model_name='gemini-2.0-flash', requests_per_minute=0, fallback_model='')[source]¶
LLM backend targeting Google Gemini via the google-genai SDK.
- Parameters:
api_key (str) – Google Generative AI API key.
model_name (str) – Gemini model identifier. Defaults to “gemini-2.0-flash”.
requests_per_minute (int) – Soft rate limiter (TICKET-031). 0 = unlimited.
fallback_model (str) – Model tried when the primary model’s quota is still exhausted after the retries (quotas are per-model). “” = none.
- is_available()[source]¶
Check whether the Gemini API is reachable with the configured key.
Attempts to list available models. Returns True on success, False on any exception. Never raises.
- Returns:
True if the API responds without error, False otherwise.
- Return type:
- list_models()[source]¶
Return the generation-capable model names, without the “models/” prefix.
Empty list on any error — best-effort, used by the settings dialog’s model picker (TICKET-062).
- complete(messages, stream=False, temperature=0.7, top_p=1.0, response_format=None, stop_sequences=None, max_tokens=None)[source]¶
Send messages to Gemini and return a parsed LLMResponse.
Translates list[LLMMessage] into Gemini format: - The first message with role=”system” becomes system_instruction. - Remaining messages map user/assistant → user/model roles.
- Parameters:
messages (list[LLMMessage]) – Conversation turns (system, user, assistant).
stream (bool) – Ignored; use stream_tokens() for token streaming.
temperature (float) – Sampling temperature (0.0 to 1.0).
top_p (float) – Nucleus sampling parameter (0.0 to 1.0).
response_format (str | None) – Currently unused for Gemini.
stop_sequences (list[str] | None) – Custom strings to trigger generation stop.
max_tokens (int | None) – Optional limit on the number of tokens to generate.
- Returns:
LLMResponse with narrative_text, optional tool_call, finish_reason.
- Raises:
LLMConnectionError – On network failure or API error.
LLMParseError – On unexpected response structure or bad tool-call JSON.
- Return type:
- generate_image_bytes(prompt, aspect_ratio=None)[source]¶
Generate an image from a text prompt and return the raw bytes.
Used by the “gemini” image backend (axiom/image_generator.py) with an image-capable model (e.g. “gemini-2.5-flash-image”). Goes through the same quota-resilience path as text calls (TICKET-031 pacing/429 retry, TICKET-033 status/cancellation hooks).
- Parameters:
- Returns:
The image bytes (PNG/JPEG as returned by the API), or None if the response contains no image part.
- Raises:
LLMConnectionError – On network failure, API error or exhausted quota.
GenerationCancelled – If cancel_event is set during a retry wait.
- Return type:
bytes | None
- stream_tokens(messages, temperature=0.7, top_p=1.0, response_format=None, stop_sequences=None, max_tokens=None)[source]¶
Yield tokens from a streaming Gemini response.
- Parameters:
messages (list[LLMMessage]) – Conversation turns (system, user, assistant).
temperature (float) – Sampling temperature (0.0 to 1.0).
top_p (float) – Nucleus sampling parameter (0.0 to 1.0).
response_format (str | None) – Currently unused for Gemini.
stop_sequences (list[str] | None) – Custom strings to trigger generation stop.
max_tokens (int | None) – Optional limit on the number of tokens to generate.
- Yields:
Individual token strings in arrival order.
- Raises:
LLMConnectionError – On network failure.
- Return type:
axiom.backends.universal¶
llm_engine/universal_client.py
Universal OpenAI-compatible API client for Axiom AI. Supports any local/remote backend that implements the OpenAI /v1/chat/completions API (e.g., LM Studio, KoboldCPP, Ollama, standard OpenAI, etc.).
- class axiom.backends.universal.UniversalClient(base_url, api_key, model_name, extra_headers=None, max_stop_sequences=None, fallback_api_keys=None)[source]¶
OpenAI-compatible LLM client using httpx.
- Parameters:
base_url (str) – The base URL (e.g., http://localhost:1234/v1).
api_key (str) – Optional API key for authorization.
model_name (str) – The model identifier to request.
extra_headers (dict[str, str] | None) – Optional headers merged into every request. Lets a provider use a non-Bearer auth scheme (e.g. Anthropic’s x-api-key + anthropic-version) — pass api_key=”” then.
max_stop_sequences (int | None) – Optional cap on the number of stop sequences sent (OpenAI rejects more than 4; most providers have no limit).
fallback_api_keys (list[str] | None) – Optional pool of spare Bearer keys. When a request fails with an auth/quota status (401/402/403/429), the client switches to the next key — stickily — and retries, until the pool is exhausted (TICKET-062: shared beta keys). Only meaningful with Bearer auth (api_key).
- complete(messages, stream=False, temperature=0.7, top_p=1.0, response_format=None, stop_sequences=None, max_tokens=None)[source]¶
Send a list of messages and return a fully assembled LLMResponse.
- Parameters:
- Return type:
axiom.backends.ollama¶
llm_engine/ollama_client.py
LLM backend client for locally running Ollama instances.
Ollama exposes an OpenAI-compatible REST API. This client targets the /api/chat endpoint for multi-turn conversation and /api/tags for the health check. Both streaming and non-streaming modes are supported.
Typical usage:
from axiom.backends.ollama import OllamaClient
llm = OllamaClient(base_url="http://localhost:11434", model_name="llama3.2")
if llm.is_available():
response = llm.complete(messages)
print(response.narrative_text)
- class axiom.backends.ollama.OllamaClient(model_name, base_url='http://localhost:11434')[source]¶
LLM backend targeting a locally running Ollama server.
- Parameters:
base_url (str) – Base URL of the Ollama HTTP API. Defaults to “http://localhost:11434”.
model_name (str) – Name of the Ollama model to use (e.g. “llama3.2”).
- is_available()[source]¶
Check whether the Ollama server is running and reachable.
GETs /api/tags. Returns True on HTTP 200, False on any exception. Never raises.
- Returns:
True if the server responds with HTTP 200, False otherwise.
- Return type:
- complete(messages, stream=False, temperature=0.7, top_p=1.0, response_format=None, stop_sequences=None, max_tokens=None)[source]¶
Send messages to Ollama’s /api/chat and return a parsed LLMResponse.
- Parameters:
messages (list[LLMMessage]) – Conversation turns (system, user, assistant).
stream (bool) – Ignored here; use stream_tokens() for token streaming.
temperature (float) – Sampling temperature (0.0 to 1.0).
top_p (float) – Nucleus sampling parameter (0.0 to 1.0).
response_format (str | None) – If “json”, forces Ollama to return a JSON object.
stop_sequences (list[str] | None) – Custom strings to trigger generation stop.
max_tokens (int | None) – Optional limit on the number of tokens to generate.
- Returns:
LLMResponse with narrative_text, optional tool_call, finish_reason.
- Raises:
LLMConnectionError – On connection refused, timeout, or HTTP 5xx.
LLMParseError – On malformed response JSON or invalid tool-call block.
- Return type:
- stream_tokens(messages, temperature=0.7, top_p=1.0, response_format=None, stop_sequences=None, max_tokens=None)[source]¶
Yield tokens from Ollama’s streaming NDJSON response.
POSTs to /api/chat with stream=true and yields each content token as it arrives.
- Parameters:
messages (list[LLMMessage]) – Conversation turns (system, user, assistant).
temperature (float) – Sampling temperature (0.0 to 1.0).
top_p (float) – Nucleus sampling parameter (0.0 to 1.0).
response_format (str | None) – If “json”, forces Ollama to return a JSON object.
stop_sequences (list[str] | None) – Custom strings to trigger generation stop.
max_tokens (int | None) – Optional limit on the number of tokens to generate.
- Yields:
Individual token strings in arrival order.
- Raises:
LLMConnectionError – On connection failure or HTTP 5xx.
- Return type: