RAG Without the Hallucinations: Building Grounded Agents

Why Agents Hallucinate

Large language models are trained to produce fluent, plausible text. When asked a question outside their training data, they do not say "I don't know" — they generate a confident-sounding answer that might be completely fabricated.

RAG (Retrieval-Augmented Generation) solves this by injecting real facts into the model's context before it generates a response.

Chunking: The Critical Step Most Get Wrong

The quality of your RAG system is determined primarily by chunking strategy, not model choice.

Our benchmarks on support documentation:

Chunk size (chars)	Retrieval precision	Answer quality
200	42%	Poor
500	71%	Good
800	78%	Very Good
1200	73%	Good
2000	61%	Fair

The sweet spot is 500–900 characters with 100-character overlaps between chunks.

Embedding Model Choice

We use OpenAI text-embedding-3-small for all knowledge base embeddings.

At our scale:

•text-embedding-3-small: $0.02 / 1M tokens

•text-embedding-3-large: $0.13 / 1M tokens

For most RAG use cases, the precision improvement of 3-large does not justify 6.5x the cost. We validated this against a 5,000-question benchmark — 3-small achieves 94% of the answer quality at 15% of the cost.

The pgvector Query

Once chunks are embedded, retrieval is a single SQL query:

SELECT
  c.id,
  d.title AS document_title,
  c.content,
  (1 - (c.embedding <=> $1))::float AS similarity
FROM rag_chunks c
JOIN rag_documents d ON d.id = c.document_id
WHERE c.knowledge_base_id = $2
  AND (1 - (c.embedding <=> $1)) > 0.65
ORDER BY c.embedding <=> $1
LIMIT 5;

The threshold of 0.65 (65% cosine similarity) filters out semantically unrelated chunks. We use an IVFFlat index (lists = 100) for ~10x faster search.

Context Injection

Retrieved chunks are injected into the agent's system prompt in a structured block. The citation instruction is critical — without it, models paraphrase context without indicating which source they used.

Evaluating Your RAG System

Before going to production, run these three checks:

1.Retrieval recall: For 50 hand-picked questions, does the correct chunk appear in the top 5? Target: >85%.

2.Answer faithfulness: Are claims in the answer supported by retrieved context? Target: >90%.

3.Out-of-scope detection: For questions your KB cannot answer, does the agent correctly say it doesn't know? Target: >80%.

Why Agents Hallucinate

RAG (Retrieval-Augmented Generation) solves this by injecting real facts into the model's context before it generates a response.

Chunking: The Critical Step Most Get Wrong

The quality of your RAG system is determined primarily by chunking strategy, not model choice.

Our benchmarks on support documentation:

Chunk size (chars)	Retrieval precision	Answer quality
200	42%	Poor
500	71%	Good
800	78%	Very Good
1200	73%	Good
2000	61%	Fair

The sweet spot is 500–900 characters with 100-character overlaps between chunks.

Embedding Model Choice

We use OpenAI text-embedding-3-small for all knowledge base embeddings.

At our scale:

•text-embedding-3-small: $0.02 / 1M tokens

•text-embedding-3-large: $0.13 / 1M tokens

The pgvector Query

Once chunks are embedded, retrieval is a single SQL query:

SELECT
  c.id,
  d.title AS document_title,
  c.content,
  (1 - (c.embedding <=> $1))::float AS similarity
FROM rag_chunks c
JOIN rag_documents d ON d.id = c.document_id
WHERE c.knowledge_base_id = $2
  AND (1 - (c.embedding <=> $1)) > 0.65
ORDER BY c.embedding <=> $1
LIMIT 5;

The threshold of 0.65 (65% cosine similarity) filters out semantically unrelated chunks. We use an IVFFlat index (lists = 100) for ~10x faster search.

Context Injection

Evaluating Your RAG System

Before going to production, run these three checks:

1.Retrieval recall: For 50 hand-picked questions, does the correct chunk appear in the top 5? Target: >85%.

2.Answer faithfulness: Are claims in the answer supported by retrieved context? Target: >90%.

3.Out-of-scope detection: For questions your KB cannot answer, does the agent correctly say it doesn't know? Target: >80%.

RAG Without the Hallucinations: Building Grounded Agents

Why Agents Hallucinate

Chunking: The Critical Step Most Get Wrong

Embedding Model Choice

The pgvector Query

Context Injection

Evaluating Your RAG System

More in Engineering

Context Engineering Is the New Prompt Engineering

Multi-Agent Pipelines in Production: Lessons from 10,000 Runs

RAG Without the Hallucinations: Building Grounded Agents

Why Agents Hallucinate

Chunking: The Critical Step Most Get Wrong

Embedding Model Choice

The pgvector Query

Context Injection

Evaluating Your RAG System

More in Engineering

Context Engineering Is the New Prompt Engineering

Multi-Agent Pipelines in Production: Lessons from 10,000 Runs