Why Agents Hallucinate
Large language models are trained to produce fluent, plausible text. When asked a question outside their training data, they do not say "I don't know" — they generate a confident-sounding answer that might be completely fabricated.
RAG (Retrieval-Augmented Generation) solves this by injecting real facts into the model's context before it generates a response.
Chunking: The Critical Step Most Get Wrong
The quality of your RAG system is determined primarily by chunking strategy, not model choice.
Our benchmarks on support documentation:
| Chunk size (chars) | Retrieval precision | Answer quality |
|---|---|---|
| 200 | 42% | Poor |
| 500 | 71% | Good |
| 800 | 78% | Very Good |
| 1200 | 73% | Good |
| 2000 | 61% | Fair |
The sweet spot is 500–900 characters with 100-character overlaps between chunks.
Embedding Model Choice
We use OpenAI text-embedding-3-small for all knowledge base embeddings.
At our scale:
text-embedding-3-small: $0.02 / 1M tokenstext-embedding-3-large: $0.13 / 1M tokensFor most RAG use cases, the precision improvement of 3-large does not justify 6.5x the cost. We validated this against a 5,000-question benchmark — 3-small achieves 94% of the answer quality at 15% of the cost.
The pgvector Query
Once chunks are embedded, retrieval is a single SQL query:
SELECT
c.id,
d.title AS document_title,
c.content,
(1 - (c.embedding <=> $1))::float AS similarity
FROM rag_chunks c
JOIN rag_documents d ON d.id = c.document_id
WHERE c.knowledge_base_id = $2
AND (1 - (c.embedding <=> $1)) > 0.65
ORDER BY c.embedding <=> $1
LIMIT 5;
The threshold of 0.65 (65% cosine similarity) filters out semantically unrelated chunks. We use an IVFFlat index (lists = 100) for ~10x faster search.
Context Injection
Retrieved chunks are injected into the agent's system prompt in a structured block. The citation instruction is critical — without it, models paraphrase context without indicating which source they used.
Evaluating Your RAG System
Before going to production, run these three checks: