Elevating Retrieval in RAG Systems with LLMs: From Multimodal Parsing to Smart Query Expansion

Dec 23 20246 min read

Retrieval-Augmented Generation (RAG) systems have risen to prominence as a powerful way for organizations to leverage large language models (LLMs). By combining external knowledge sources with the generative prowess of LLMs, RAG systems produce more accurate and contextually grounded responses. But to make these systems really sing, we must tackle a fundamental ingredient: better retrieval. After all, the best LLM in the world won’t help if it’s retrieving the wrong data or incomplete context.

In this article, we’ll explore how LLMs can improve retrieval in a few targeted ways:

Parsing data using multimodal understanding—incorporating both text- and vision-based inputs.
Preprocessing queries with smarter expansions and keyword extraction for improved search and reranking.

By focusing on these enhancements, we can build RAG systems that consistently produce high-fidelity, contextually rich responses.

1. Parsing Data with Multimodal Understanding

a. Text Parsing with Context Preservation

At the heart of retrieval lies the process of understanding large swaths of text. Traditional methods rely on keyword extraction, TF-IDF scoring, or basic embeddings to catalog content. However, LLMs—especially models fine-tuned to parse text with context—can go deeper:

Semantic Chunking: Instead of treating text in uniform blocks or n-grams, LLMs can detect semantic boundaries (e.g., transitions in topic, argument boundaries, or relevant entities) and mark these boundaries for more meaningful document chunking. This leads to fewer retrieval misses and more specific results when responding to a query.

Enhanced Metadata Extraction: LLMs can automatically extract key entities (people, organizations, events) or relational metadata (e.g., who did what, when, and why). This structured data can subsequently enable a more refined search and improve how we rank the retrieved documents.

b. Vision Parsing for Non-Text Data

But the real world (and enterprise archives!) often doesn’t limit itself to text. A wealth of knowledge lives in images, videos, or scanned PDFs. Vision-enabled LLMs (sometimes called multimodal LLMs) open the door to:

Detecting and Extracting Text from Images: OCR (Optical Character Recognition) has been around for a while, but LLMs integrated with vision models can interpret text in context. For example, they can detect the difference between handwritten notes on a photograph vs. labels on a chart. This context lets the system decide what data is relevant for downstream retrieval.

Object and Scene Understanding: Beyond simple text, vision-capable LLMs can parse objects in images, identify brand logos or product details in pictures, and connect those details to textual references in your database. This can be invaluable in systems that rely on both visual and textual cues—for instance, an e-commerce platform retrieving relevant product manuals and reviews.

By incorporating LLM-driven text and vision parsing, RAG systems gain a comprehensive understanding of the source data, ensuring that no relevant piece of information goes unnoticed.

2. Smarter Query Preprocessing and Expansion

While robust data parsing is one side of the retrieval coin, query preprocessing is the other. The way your system processes an incoming user query and transforms it into something search engines (or vector databases) understand can make all the difference.

a. Keyword Extraction and Synonym Generation

When a user asks a question, it might be ambiguous, incorrectly spelled, or missing crucial context. LLMs can step in by:

Extracting and ranking potential keywords.
Generating synonyms, related terms, or expansions (e.g., “AI” might expand to “machine learning,” “deep learning,” “ML research,” or “neural networks”).

For example, a user might type “How do I fix a jam in a kyocera 3212 printer?” An LLM can detect the key terms—“fix,” “jam,” “Kyocera 3212 printer”—and generate synonyms or alternate phrasings like “repair,” “paper jam,” “printer model 3212.” The final expanded query might be “(fix OR repair) AND (jam OR paper jam) AND (Kyocera 3212 printer).” Having these expansions boosts the recall of relevant documents during retrieval.

b. Query Embedding Refinement

Modern retrieval engines increasingly rely on embeddings to gauge semantic similarity. LLMs can produce query embeddings that are sensitive to context. However, if the initial user query is underspecified (e.g., “climate change policy outcomes?”), the embedding might not fully reflect the user’s underlying intent.

A helpful approach is to ask the model to expand or clarify the user’s query before generating the final embedding. For instance, the model might restate it as:

“A request for analysis of changes in legislation, regulations, or international agreements pertaining to climate change and the impacts of these policy decisions.”

Then, it generates an embedding that more accurately captures the user’s need, leading to more relevant search results.

c. Integrating Keywords + Embeddings for Document Search

An important technique is to combine classical keyword-based search with embedding-based search. You could:

Generate a set of keywords and expansions (classical search input).
Create a refined query embedding (semantic search input).

Why both? In some cases, exact matches (via keywords) are critical, especially for domain-specific terms, while in other cases, semantic closeness (via embeddings) finds conceptually relevant documents. Combining them can be done in two main ways:

Hybrid Index and Retrieval: Where the search engine uses both lexical ranking (like BM25) and vector ranking (e.g., cosine similarity in the embedding space).
Cascade or Fusion: Retrieve documents through keyword search first, then rerank them (or retrieve additional results) using semantic similarity. Or do the opposite—retrieve semantically, then filter or re-rank by keyword match density.

3. Reranking for Enhanced Accuracy

After retrieving an initial set of documents—usually the top k—there’s one more step that can significantly improve your RAG system’s accuracy: reranking. LLMs excel here, because they can read the entire set of candidate documents, compare them to the user’s query (expanded or otherwise), and produce a score or preference ordering.

LLMs can answer questions like:

Does this document directly answer the user’s question or provide relevant details?
Does it align with the user’s domain or the specific context?

You can prompt an LLM with instructions to score or label each document, such as:

“For each of these 10 documents, provide a score from 1 to 5 indicating how well it addresses the question ‘How do I fix a jam in a Kyocera 3212 printer?’”

A subsequent module can then finalize the ranking by combining the LLM’s scores with existing ranking signals (like BM25 or embedding similarity).

Bonus: Iterative or Conversational Reranking

For complex queries, an iterative approach can help. The system can:

Present the top results to the LLM.
Ask the LLM if more context is needed.
If yes, retrieve more documents or different expansions.
Repeat until the LLM confirms it has enough context.

This iterative approach refines retrieval in real time and can dramatically improve your RAG system’s capacity to handle open-ended or extremely specialized questions.

4. Putting It All Together

Imagine you have a tech support knowledge base with both text documents (manuals, FAQs, release notes) and images (diagrams, scanned setup instructions). The steps might look like this:

Query Input: User types “How to install the new spool in a model X printer?”
Query Expansion: LLM refines the query, extracting “install,” “spool,” “model X printer,” and synonyms like “replace spool,” “cartridge installation,” etc.
Keyword + Embedding Search:
- Keywords used in a lexical index to retrieve documents containing “install,” “spool,” or “model X printer.”
- A refined query embedding is generated, searching for semantically relevant documents that may not contain the exact phrases.
Multimodal Document Parsing:
- PDF manuals are chunked semantically.
- Images of spool diagrams are recognized by the vision-enabled LLM, labeled with text describing the spool model number.
Initial Ranking: Top 20 documents are presented based on a combined score.
Reranking via LLM: The LLM reviews those 20 documents, assigns a relevance score, and surfaces the best 5 to the user.

The result? The user gets a short, accurate snippet describing the spool installation steps, with relevant diagrams at the ready.

5. Key Takeaways for Building Better RAG Systems

Leverage LLMs for deeper data parsing: Text and vision-based LLMs can enhance how data is chunked, labeled, and understood, preventing important context from slipping through the cracks.
Refine queries with expansions and synonyms: Keyword-based retrieval is still powerful—don’t abandon it! Augment lexical search with semantic embeddings for a hybrid approach.
Use LLMs for query embedding: Preprocessing the query before embedding can yield more accurate retrieval.
Adopt reranking strategies: Once you have your candidate documents, let the LLM re-check them in context. This final pass often separates good from great answers.

These techniques inject an intelligent, context-aware layer into the retrieval pipeline. When done right, your RAG system transforms from a so-so aggregator into a powerful, knowledge-driven assistant—fulfilling the promise of LLMs to deliver smarter, more relevant information at scale.

Conclusion

Building robust RAG systems is all about marrying smart retrieval with generative AI. By allowing LLMs to parse text and visuals in context, expand queries accurately, and provide a final reranking step, you supercharge your system’s ability to return the right information—and thus generate more informed, correct responses. Whether you’re dealing with customer support, product manuals, or academic research, applying these techniques will make your RAG system not just a repository of knowledge, but a truly insightful solution.