Gian McCoy
AI ArchitectureMay 2026·9 min read

RAG Pipeline Against Personal Career Experience: pgvector, Cosine Threshold, Graceful Fallback

The SEO intelligence module in the Lead Enrichment app generates article briefs for contractor prospects. The problem: LLMs hallucinate specific experience claims. A brief that says “drawing on 20+ years of contractor marketing expertise” is worthless if the model invented the expertise. The solution was a RAG pipeline that retrieves actual documented experience before generating each brief — grounding the output in a structured knowledge base of real career history. Here is how it was built and what the design decisions cost and gained.

Why RAG for an SEO Article Brief Generator

The ArticleBriefAgent is the component in the Lead Enrichment app’s SEO intelligence module responsible for generating article briefs for contractor business clients. A brief includes: the target keyword, suggested headline, content outline, key claims to make, recommended word count, and suggested internal links. The brief is meant to be passed to a human writer or used as a structured prompt for content generation.

The first version of the ArticleBriefAgent called Claude directly with the keyword and SEO data (Google Search Console metrics, competitor gap analysis, search intent classification) and asked it to write a brief. The quality was reasonable for generic content — but it produced briefs that made specific experience claims (“Gian has helped dozens of contractors improve their online visibility”) that were plausible-sounding and entirely fabricated.

The fix was not a better prompt. A better prompt would reduce hallucination frequency but not eliminate it, and any hallucinated experience claim in a published article is a credibility problem. The correct fix was to give the model access to actual documented experience as retrieval context — so that any experience claim in the brief is traceable to a source chunk, not inferred from training weights.

That is the use case RAG was designed for: grounding LLM generation in a specific, verifiable knowledge corpus.

The Knowledge Base: What Gets Embedded

The knowledge base for the Lead Enrichment app’s RAG pipeline is a structured corpus of career experience. It is not a dump of unstructured text — it is a set of deliberate documents organized around specific expertise domains.

The documents embedded into the knowledge base fall into four categories:

  • Engagement summaries. Each significant client engagement documented at the level of specificity available: industry, problem type, technologies used, measurable outcomes. Not narratives — structured factual summaries.
  • Skill and capability records. Specific technical capabilities with evidence — “Configured GoHighLevel white-label with custom CNAME and per-sub-account authenticated sending domains (SPF, DKIM, DMARC)” rather than “GoHighLevel experience.”
  • Domain expertise statements. Structured claims about areas of deep knowledge, organized by domain: HubSpot CRM architecture, event marketing technology, Google Analytics 4, etc. Each statement includes the specific context (client type, duration, scope) that makes it credible.
  • Published frameworks and documented practice. The articles on this site — the Digital Operations Stack Model, the Marketing Automation Workflow Architecture — are themselves chunks in the knowledge base. A brief that references those frameworks can cite them accurately.

Each document is chunked before embedding. The chunking strategy is paragraph-level with a 200-token overlap between adjacent chunks, using a fixed tokenizer consistent with the embedding model. Chunk size matters: too large and the retrieval returns diffuse context; too small and the retrieval misses the surrounding context that gives a claim meaning.

The Embedding Model and pgvector Setup

The Lead Enrichment app uses the OpenAI text-embedding-3-small model for embeddings. The model produces 1536-dimensional vectors — a reasonable balance between embedding quality and storage cost. The embedding model is isolated behind the LLM abstraction layer’s embed() method, which means it can be swapped without changing retrieval code.

The pgvector setup in Postgres is straightforward:

  1. CREATE EXTENSION vector; — enables the pgvector extension. This requires pgvector to be installed on the Postgres instance (available as a package on most managed Postgres providers and on standard Postgres Docker images with the pgvector tag).
  2. A experience_chunks table with columns: id (uuid),content (text, the raw chunk), metadata (jsonb, source document, chunk index, domain tags), embedding (vector(1536)).
  3. An IVFFlat index on the embedding column:CREATE INDEX ON experience_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);IVFFlat is an approximate nearest-neighbor index — it trades a small amount of recall accuracy for significantly faster query performance at scale. For a corpus of a few thousand chunks, the performance difference from exact search is negligible, but the index is the right pattern to establish from the start.

The retrieval query in Python via psycopg2 uses the <=>(cosine distance) operator provided by pgvector. The query selects chunks where the cosine distance is below 1 - 0.45 = 0.55 (pgvector uses distance, not similarity, so the threshold is inverted), ordered by distance ascending, with a LIMIT of 8 chunks.

Calibrating the Cosine Similarity Threshold: 0.45

The 0.45 cosine similarity threshold (equivalently, cosine distance < 0.55) was calibrated by running retrieval against a sample of 50 representative article brief topics and manually inspecting the returned chunks.

The calibration process:

  1. Generate 50 test queries representing the range of topics the ArticleBriefAgent would be asked to brief — contractor marketing, HubSpot CRM for tradespeople, Google Business Profile optimization for plumbers, etc.
  2. Run retrieval at several thresholds (0.3, 0.4, 0.45, 0.5, 0.6, 0.7) and record the chunks returned at each level.
  3. For each threshold level, assess: (a) Are the returned chunks actually relevant to the query? (b) Are any returned chunks misleading — tangentially related in a way that could introduce incorrect claims? (c) How many queries return zero chunks (triggering graceful fallback)?

At 0.3, retrieval returned too many tangentially related chunks — the model would receive context about pharmacy work when asked to brief a plumbing marketing article, because both involve client-facing service. At 0.6, too many valid-but-specific chunks were excluded, triggering fallback on topics where real experience existed. At 0.45, the returned chunks were consistently on-topic and the false positive rate was low enough that manual spot-checking did not surface any misleading inclusions.

The right threshold will be different for every knowledge base and embedding model pair. There is no shortcut to the calibration step.

Graceful Fallback When No Chunks Meet the Threshold

Not every article brief topic maps to documented experience. A prospect who runs an electrical contracting business in a specialty the knowledge base does not cover — say, high-voltage industrial installations — may trigger a topic query that returns zero chunks above 0.45.

The graceful fallback behavior:

  1. The retrieval step returns an empty list of chunks.
  2. The ArticleBriefAgent detects the empty list and proceeds to generate the brief without RAG context — using only the structured SEO inputs (keyword, search intent, GSC data, competitor gap analysis) and the model’s own knowledge.
  3. The response payload includes experience_context_used: false andexperience_chunks_retrieved: 0.
  4. The SEO module UI renders a notice alongside the brief: “This brief was generated without grounded career experience context. Review experience claims before use.”

Fallback does not mean failure — it means the brief is usable but requires more editorial judgment before publication. The operator can review the brief, remove any ungrounded experience claims, and either use the structural outline (keyword, angle, competitor gaps) or flag the brief for a topic they do not want to publish without grounding.

The alternative — raising an error when no context is found — was explicitly rejected. An error stops the pipeline. A graceful fallback with a clear flag keeps the pipeline running and gives the operator useful partial output.

The ArticleBriefAgent Prompt Structure

When retrieval returns chunks, the ArticleBriefAgent constructs a prompt with a specific structure. The structure matters because Claude’s attention is not uniform across a long prompt — placement of the retrieved context affects how prominently the model weighs it.

The prompt structure in order:

  1. System prompt (role and constraints). The system prompt establishes the agent’s role (“You are an SEO content strategist generating article briefs”), the key constraint (“All experience claims in the brief must be drawn from the provided career context — do not invent experience”), and the output format (structured JSON with specific fields).
  2. Retrieved experience chunks. Each chunk is formatted with its source metadata and content. The model is instructed to cite the chunk source when using it to support a claim in the brief.
  3. SEO inputs. The target keyword, search intent classification, GSC performance data (impressions, clicks, position), competitor content gap analysis, and suggested article angle.
  4. Task instruction. A short, specific instruction: “Generate an article brief for [keyword] following the output format. Draw experience claims from the provided career context only.”

The placement of retrieved context before the SEO inputs — rather than after — was a deliberate choice. Preliminary testing showed that the model more consistently integrated the experience context when it appeared earlier in the prompt, before the task-specific data it was likely to weight most heavily.

What This Pattern Is and Is Not

This RAG pipeline is not a general-purpose document retrieval system. It is a narrow, purpose-built retrieval layer for one task: grounding article brief generation in documented career experience. The knowledge base is curated, not scraped — every chunk was deliberately created and reviewed.

This specificity is part of the design. A broader knowledge base with less curation would produce more retrievals but also more noise. The 0.45 threshold was calibrated against this specific corpus; it would need recalibration for a different one.

The pattern is generalizable to other narrow use cases: grounding customer support responses in a product knowledge base, grounding contract clause generation in a clause library, grounding medical documentation in clinical guidelines. In each case, the specific values — chunk size, embedding model, similarity threshold, fallback behavior — need to be calibrated for the domain rather than copied from another implementation.

The persistent lesson from building this: RAG is not a toggle you flip to prevent hallucination. It is a retrieval engineering problem that requires deliberate knowledge base curation, embedding model selection, threshold calibration, and fallback design. Each of those steps has design decisions with real consequences for output quality — and none of them can be defaulted.

Frequently Asked Questions

What is a RAG pipeline?

RAG (Retrieval-Augmented Generation) retrieves relevant documents from a knowledge store before generating a response, passing the retrieved chunks to the LLM as context. The result is generation grounded in specific, verifiable source material rather than the model's training weights — reducing hallucination for factual tasks and enabling reasoning over private or domain-specific knowledge.

Why use pgvector instead of a managed vector database?

pgvector is a Postgres extension that adds vector similarity search to an existing Postgres database. For a system already running Postgres, pgvector means zero additional infrastructure — no Pinecone, Weaviate, or Qdrant. Backup, access control, and query patterns stay unified. The tradeoff is that pgvector is not optimized for billion-scale collections, but for a knowledge base of a few thousand chunks, it is more than adequate.

What is a good cosine similarity threshold for RAG retrieval?

There is no universal answer — the correct threshold depends on the embedding model, domain specificity, and the tradeoff between false positives and false negatives in your use case. The Lead Enrichment app uses 0.45, calibrated by running retrieval against 50 representative queries at multiple thresholds and inspecting the returned chunks manually. Thresholds below 0.3 typically return too much noise; thresholds above 0.7 too few results. Calibrate against your actual corpus — don't copy someone else's threshold.

What happens when no chunks meet the cosine similarity threshold?

Graceful fallback: the ArticleBriefAgent generates the brief without RAG context, using only structured SEO inputs and the model's training knowledge. The response is flagged with experience_context_used: false and a UI notice warns the operator that experience claims should be reviewed. This is preferable to raising an error — the structural SEO outline (keyword angle, competitor gaps) is still useful even without grounded experience context.

Is RAG just for chatbots?

No. RAG is a general pattern for grounding LLM generation in a specific knowledge corpus. The Lead Enrichment app uses it for SEO article brief generation, not a chatbot. Other non-chatbot use cases: contract clause analysis against a clause library, customer support responses grounded in a product knowledge base, competitive analysis grounded in scraped competitor content, and medical documentation grounded in clinical guidelines.

RAGpgvectorVector EmbeddingsAnthropic ClaudeCosine SimilarityAI GroundingPostgreSQLProduction AI

Gian McCoy

AI Solutions Architect and Marketing Technology professional based in Los Angeles. The Lead Enrichment app’s RAG pipeline uses pgvector with a 0.45 cosine similarity threshold to ground Claude-generated SEO article briefs in documented career experience. See the Apps page for full architecture details or the Expertise page for the full AI engineering capability profile.