Gian McCoy
AI ArchitectureMay 2026·9 min read

How I Architected a Multi-Provider LLM Abstraction Layer

The Lead Enrichment app calls four LLM providers: Anthropic Claude, OpenAI, Gemini, and local Ollama. At no point does a task-level agent know which provider is handling its request. That decoupling is not an accident — it is the most important architectural decision in the system’s LLM layer, and this is how it was designed and why.

The Problem With Direct-to-Provider Calls

The naive approach to integrating LLMs into a production application is to call the provider API directly from the task that needs it. A classification agent imports the Anthropic SDK and calls anthropic.messages.create(). A summarization agent imports the OpenAI SDK and calls openai.chat.completions.create(). This works fine for a prototype, but creates several problems as the system grows.

First, provider coupling. If Anthropic changes its SDK interface — or if you decide to switch a task from Claude to GPT for cost or capability reasons — you have to find and update every call site that references the Anthropic SDK. In a system with 20+ agents across multiple modules, that is a maintenance problem.

Second, inconsistent cost control. When every agent calls its preferred provider directly, there is no single place to enforce cost-control policies — input hash deduplication, model tier ceilings, temperature discipline. Each agent has to implement these independently, or they do not get implemented at all.

Third, error handling fragmentation. The Anthropic SDK raisesanthropic.RateLimitError. The OpenAI SDK raisesopenai.RateLimitError. These are different exception classes with different attribute shapes. Call sites end up with try/except blocks that only catch one provider’s error class, which means the other provider’s rate limit exceptions propagate as unhandled exceptions.

The abstraction layer solves all three by introducing a single interface between application code and provider SDKs.

The BaseLLMClient ABC

The abstraction layer is structured around an abstract base class (ABC) that defines the interface every provider client must implement. The Lead Enrichment app’s version lives in app/llm/base.py. The interface is minimal by design — two abstract methods:

  • complete(prompt: str, tier: ModelTier, **kwargs) -> str — takes a prompt string and a tier designation and returns a completion string. The caller does not specify a model name. The tier determines the model.
  • embed(text: str) -> list[float] — takes a text string and returns a vector embedding. Used by the pgvector RAG subsystem; isolated here so the embedding model can be swapped independently from the completion model.

The ABC also defines the domain exception hierarchy: LLMRateLimitError,LLMContextWindowError, LLMAuthError,LLMProviderError (the catch-all). Every concrete client maps its provider’s native exceptions to these classes in its error handling layer.

Each concrete client — AnthropicClient, OpenAIClient,GeminiClient, OllamaClient — inherits fromBaseLLMClient and implements complete() andembed() for its provider’s SDK. The client is also responsible for managing its own retry logic, backoff configuration, and request timeout handling.

The ModelTier Enum

The ModelTier enum is the mechanism that makes tier routing work. The Lead Enrichment app uses three tiers, defined in app/llm/tiers.py:

  • SIMPLE — classification tasks, short structured extractions, binary decisions. Tasks where speed and cost matter more than reasoning depth. Mapped to: Claude Haiku 4.5 (Anthropic), GPT-4.1-mini (OpenAI), Gemini Flash, Ollama Mistral (local).
  • MEDIUM — multi-step reasoning, prose generation, structured analysis requiring moderate context. Mapped to: Claude Sonnet 4.6 (Anthropic), GPT-4.1 (OpenAI), Gemini Pro.
  • COMPLEX — tasks requiring deep reasoning, long context synthesis, or high-stakes evaluation. Mapped to: Claude Opus 4.6 (Anthropic), GPT-5 (OpenAI). Used sparingly — the cost ceiling is meaningful.

The tier-to-model mapping is configuration, not code. A dictionary inapp/llm/config.py maps each tier to the preferred provider and model for that tier. Changing the MEDIUM-tier model from Sonnet 4.6 to a future Anthropic release requires one line change in configuration. No agent code changes.

Every agent and every task-level function in the Lead Enrichment app is assigned a tier at the call site:

  • Voicemail classification: SIMPLE (binary accept/reject decision)
  • Prospect enrichment summarization: MEDIUM (structured multi-field synthesis)
  • SEO article brief generation: MEDIUM (requires context from RAG + keyword data)
  • Resume tailoring in Job Agent: MEDIUM (structured generation against job description)
  • Job evaluation against override history: COMPLEX (judgment call requiring calibration context)

The tier assignment is explicit and visible in the call site. This makes the cost profile of the system readable from the code.

The Router: Selecting Provider and Model

The router is the component that takes a tier designation and returns a concrete client instance and model identifier. The Lead Enrichment app’s router in app/llm/router.py follows a priority order per tier:

  1. Check whether the preferred provider for this tier is available (API key present, not in a rate-limit backoff window).
  2. If the preferred provider is unavailable, fall back to the secondary provider for this tier.
  3. If all cloud providers are unavailable, fall back to Ollama (local inference) for SIMPLE and MEDIUM tiers — not COMPLEX, because local models are not yet competitive at that tier.

The fallback chain is configuration, not hardcoded logic. For each tier, the config specifies an ordered list of providers. The router iterates the list until it finds one that is available. This makes the failover behavior explicit and auditable.

Provider availability is tracked in a small in-memory state object (not a database) that records the last rate-limit error timestamp per provider. If the error occurred within the backoff window (configurable per provider), the router skips that provider in the current request cycle.

Cost Controls Built Into the Layer

The abstraction layer is also where the system’s cost-control mechanisms live. There are four:

Input hash deduplication. For each completion request, the layer computes a hash of the normalized prompt and the tier designation. If a completion for that hash already exists in the cache (a simple dict with a TTL), the cached response is returned without an API call. This prevents duplicate calls for identical inputs — a common source of wasted spend in systems where the same enrichment prompt fires for a prospect that was processed in a previous run.

Tier ceilings. The tier designation is enforced at the router level. There is no mechanism for application code to call a COMPLEX-tier model for a SIMPLE-tier task — the task’s tier determines the model ceiling. If an agent needs a higher-capability model for a specific input, it must explicitly declare a higher tier at the call site. The declaration is visible in code review.

Deterministic temperature. Classification tasks, structured extractions, and any task where response consistency matters are called with temperature=0.0. The abstraction layer enforces this for SIMPLE-tier calls by default; MEDIUM and COMPLEX tiers default to 0.2 but can be overridden via kwargs. The convention is that a call site that sets a non-default temperature must include a comment explaining why.

Token budget logging. Every completion request logs the prompt token count and response token count to a structured log. This feeds a cost monitoring dashboard that shows per-task and per-tier token consumption. The data is not used for real-time rate limiting — it is observability infrastructure that makes cost anomalies visible within minutes of occurrence.

Provider-Specific Considerations

Each provider has quirks that the concrete client implementations handle transparently.

Anthropic Claude — the Messages API is the standard interface. The system prompt is a separate parameter (not embedded in the messages array), which the AnthropicClient handles by extracting the first system-role message from the prompt structure if one is present. Claude’s context window (200K tokens for Sonnet and Opus) makes it practical for long-context SEO analysis tasks without chunking.

OpenAI (Responses API) — the Lead Enrichment app uses the Responses API rather than the older Chat Completions API for OpenAI calls, which provides a cleaner interface for structured output and tool use. GPT-4.1-mini handles the first stage of Job Agent’s two-stage evaluation pipeline (structured JD parsing and candidate extraction); GPT-5 handles the second stage (quality evaluation against override history). The two-stage pattern keeps COMPLEX-tier spend minimal by routing only the final evaluation step there.

Gemini — used for multimodal tasks where image context is relevant (e.g. screenshot-based competitive audit in the SEO module). The GeminiClient handles the file upload and inline data encoding that the Gemini API requires for image inputs, exposing a consistent interface to the caller.

Ollama — local inference via Ollama is the fallback for SIMPLE and MEDIUM tiers when cloud providers are unavailable. The OllamaClient wraps the Ollama REST API (running on localhost) in the same interface. Latency is higher than cloud providers for most tasks; this is acceptable for fallback use but not for primary routing.

What the Abstraction Does Not Do

Clarity about scope matters. The LLM abstraction layer in the Lead Enrichment app is not an agent framework. It does not manage tool call loops, multi-step chains, or conversation state. Those concerns live in the agent layer above the abstraction — individual agent classes that hold conversation history, manage tool registrations, and implement their own retry and escalation logic.

The abstraction layer is a transport layer: it takes a prompt and a tier, selects a provider and model, makes one API call, normalizes the response, and returns a string. The agent is responsible for everything above that — prompt construction, response parsing, tool dispatch, and state management. This narrow scope is intentional. A broader abstraction that tried to handle agent orchestration would be a framework, and the Lead Enrichment app deliberately avoids framework dependencies for its core AI infrastructure.

The test surface is correspondingly narrow. The abstraction layer has 200+ pytest tests covering: correct model selection per tier, correct provider fallback sequencing, exception normalization for each provider’s error classes, cache hit/miss behavior, and token logging correctness. Agent-level behavior is tested at the agent layer with mocked clients.

When to Build This Pattern and When Not To

The multi-provider abstraction layer is justified when: you are using more than one provider, you expect to add providers as the market evolves, you need systematic cost controls, or you have enough agents that inconsistent error handling would become a maintenance problem. All four conditions apply to the Lead Enrichment app.

It is over-engineering for a project that calls one provider in one place. If you are building a single-purpose tool that calls Claude and has no plans to add other providers, the Anthropic SDK directly is the right choice. Add the abstraction when it earns its keep — which means when the cost of the abstraction is less than the cost of maintaining direct-to-provider calls across a growing agent surface.

Frequently Asked Questions

What is an LLM abstraction layer?

An LLM abstraction layer sits between your application logic and provider APIs (Anthropic, OpenAI, Gemini, Ollama). Application code calls a single interface with a prompt and a tier designation; the abstraction layer handles provider selection, API key management, request formatting, error handling, and response normalization. The application does not know which provider handled any given request.

Why route by tier instead of by model name?

Routing by model name hardcodes provider assumptions into your business logic. When a provider releases a new model or changes pricing, you update every call site. Routing by tier (SIMPLE / MEDIUM / COMPLEX) decouples task complexity from the specific model. The tier-to-model mapping lives in one configuration location — updating it changes behavior everywhere at once.

How does the abstraction layer affect cost?

The layer makes cost-control mechanisms structurally enforceable. Tier routing prevents accidentally calling Opus-class models for simple tasks. Input hash deduplication prevents duplicate API calls for identical prompts. Deterministic temperature settings eliminate variance-driven retries. Together these reduce both token consumption and call volume systematically.

Can you switch LLM providers without rewriting agents?

Yes — that is the primary design goal. Agents call the abstraction interface with a prompt and a tier. The model mapping (SIMPLE → Haiku 4.5, MEDIUM → Sonnet 4.6, etc.) lives in configuration. Changing the MEDIUM-tier model is a one-line config change. Agent code does not change.

What is the most common bug in an LLM abstraction layer?

Inconsistent error surface: different providers raise different exception types for similar failures (rate limiting, context overflow, auth failure). If the abstraction does not normalize these into a consistent exception hierarchy, call sites end up with try/except blocks that only catch one provider's error class. The fix: define domain-specific exceptions (LLMRateLimitError, LLMContextWindowError, LLMAuthError) and map every provider's native exceptions to these in the client layer.

AI ArchitectureLLMAnthropic ClaudeOpenAIMulti-ProviderCost ControlsPythonProduction AI

Gian McCoy

AI Solutions Architect and Marketing Technology professional based in Los Angeles. The Lead Enrichment app’s LLM abstraction layer routes across Anthropic Claude (Haiku 4.5 / Sonnet 4.6 / Opus 4.6), OpenAI (GPT-4.1-mini / GPT-5), Gemini, and local Ollama. See the Apps page for full architecture details or the Expertise page for the full AI engineering capability profile.