Strategic AIAIThe RAG Step Most Teams Skip: Parse Questions First
Enterprise RAG systems fail before they even search. Six contrarian lessons on why question parsing determines whether your document AI can actually be trusted.
The previous issue in this series established that good RAG starts before retrieval: structured question parsing routes queries to the right strategy before a single embedding is computed. This issue lives inside the retrieval brick itself — the moment your system decides which document chunks to surface in response to an enterprise query.
The industry has settled on a reflex: embed the query, compute cosine similarity, return the top-k chunks. That reflex is reasonable for demos. It is insufficient for enterprise document intelligence. Here are six positions that challenge it.
Cosine similarity tells you that two vectors point in similar directions in high-dimensional space. It does not tell you that a document is the right answer to a business question.
Consider a contract management system at a Colombian logistics company. The query is: "What are the force majeure clauses that apply to port delays?" An embedding model will return chunks that are semantically adjacent to force majeure — general liability clauses, exclusion clauses, event-of-default definitions. They score well on cosine. They are not the answer.
Relevance is a function of context, intent, and specificity — none of which cosine alone encodes. It captures what a chunk is about, not whether it answers this question for this user in this situation.
The business implication: a retrieval layer built exclusively on cosine will produce responses that sound plausible but are drawn from the wrong sections of the wrong documents. Legal, finance, and compliance use cases have low tolerance for that failure mode.
Enterprise documents are full of identifiers: contract numbers, product SKUs, regulatory article references, employee IDs, jurisdiction codes. Embeddings are trained on language — not on the precise distinctiveness of these tokens.
BM25, a classic keyword-based ranking function, handles exact-match retrieval with a precision that no embedding model reliably replicates. When a procurement director at a Mexican manufacturer asks "What does clause 14.3 of contract MX-2024-0091 say about delivery penalties?", BM25 finds clause 14.3 of that contract. A pure embedding search might return similar delivery penalty language from five different contracts.
The position is not that lexical retrieval replaces semantic retrieval. It is that removing lexical retrieval from the pipeline is a silent precision failure that only surfaces in production.
Most RAG implementations treat metadata filtering as a post-processing step or a UI convenience. In enterprise document intelligence, it is a first-class retrieval gate.
Before computing any embedding distance, the retrieval system should filter by:
An operations leader in Buenos Aires asking about HR policy updates does not need the system searching across financial reports and supplier agreements. Metadata filtering narrows the candidate set before vector search runs — improving precision and reducing compute cost simultaneously.
The question-parsing layer is responsible for extracting these filters from the user's query. The retrieval brick must be built to receive and apply them.
The research consensus — and production evidence from enterprise deployments — is clear: combining sparse retrieval (BM25) with dense retrieval (embeddings) and merging results using Reciprocal Rank Fusion (RRF) outperforms either method in isolation on both precision and recall.
RRF scores each document based on its rank position in each retrieval list, not its raw score. This makes it robust to the scale incompatibilities between BM25 scores and cosine distances.
The practical consequence for a LATAM company evaluating RAG platforms: any vendor or internal build that defaults to embeddings-only retrieval is leaving retrievable precision on the table. The architecture question is not "vector database or keyword search?" It is "how does the system fuse both — and which method gets priority when they disagree?"
Retrieval is a candidate generation step. Ranking is the relevance decision.
The top-10 chunks returned by hybrid retrieval are not necessarily the right 3 to pass to the LLM for generation. A cross-encoder re-ranker — a model that scores query-document pairs jointly, rather than comparing independent embeddings — assigns final relevance scores with significantly higher accuracy than cosine similarity.
Think of it in two stages: the retrieval layer casts a wide net efficiently; the re-ranker makes the precision call. In enterprise contexts where the LLM's context window is finite and every irrelevant chunk increases hallucination risk, this distinction is not academic. It is the difference between a system that answers correctly 70% of the time and one that answers correctly above 90% on documents that matter to the business.
Cross-encoders are computationally heavier than embedding comparisons, which is precisely why they run on a shortlist — not the full corpus.
Every retrieval system should output not just ranked chunks, but a confidence signal: how certain is the system that the retrieved content is relevant to this query?
When confidence is low — because no chunk scores above a meaningful threshold, because the query doesn't match the indexed document types, or because the question falls outside the corpus — the system has two honest options: escalate to a human or return a structured "no confident match" response.
What it should not do is pass low-confidence chunks to the generation layer and let the LLM fill the gaps. That is where hallucinations originate in enterprise RAG. Not from the LLM. From a retrieval layer that does not know how to say "I don't have enough."
A CFO reviewing an AI-generated regulatory compliance summary needs to trust the answer comes from the document — not from the model interpolating low-ranked tangential chunks.
The six positions above converge on a single principle: retrieval in enterprise RAG is a layered pipeline, not a single operation. A production-ready retrieval brick looks like this:
Skipping any of these steps does not simplify the system. It transfers complexity downstream to the generation layer, where it cannot be handled cleanly.
The cosine-first reflex is not wrong for general-purpose search. It is insufficient for enterprise document intelligence, where documents are structured, queries are specific, identifiers matter, and the cost of a wrong answer is measured in contracts, compliance exposure, and operational decisions — not click-through rates.
Next in this series: the generation brick — how the LLM is prompted, constrained, and audited once the right chunks have been retrieved.
Xenturia designs and implements enterprise document intelligence systems with mid-sized companies across Latin America. If your current RAG setup returns plausible answers from the wrong documents, that is a retrieval architecture problem — and it is fixable.
Schedule a free consultation with our team and discover how AI can transform your operations.
Schedule a consultation
Strategic AIAIEnterprise RAG systems fail before they even search. Six contrarian lessons on why question parsing determines whether your document AI can actually be trusted.
Strategic AIAIGetting AI to demo is easy. Keeping it running at scale is where most projects collapse—and where infrastructure becomes a business decision.
Strategic AIAIChoosing between local and cloud LLMs is the wrong question. A practical guide to hybrid patterns using Gemma 4 and GPT-5.4—with structured outputs that actually work in production.