Modern AI applications often need to index and search across a variety of documents (text files, HTML pages, PDFs, Word/Excel docs, JSON data, etc.). Microsoft’s Kernel Memory project provides a reference implementation of such an indexer, integrating Azure Cognitive Search (Azure AI Search) as a vector store and Azure OpenAI for language model processing (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic). This guide explores how Kernel Memory’s indexer works and how you can replicate its functionality in C#. We’ll cover the content ingestion pipeline (text extraction, chunking, embedding generation, and indexing), show how to use Azure OpenAI for embeddings and other LLM-driven processing, and discuss extending the approach for structured data like Excel tables and JSON. Implementation tips, code snippets, and architecture patterns are included to help you build a proof-of-concept that can scale into a robust framework.
Kernel Memory (KM) is a Retrieval-Augmented Generation (RAG) service that can ingest multi-format documents and enable natural language queries over them (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic) (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic). At its core, KM uses a four-step ingestion pipeline to index content:
- Extract Text – Automatically detect the file type (e.g. PDF, Word, Excel, HTML) and extract its text content (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic). This may involve OCR for images or using appropriate decoders for each format.
- Partition Text (Chunking) – Split the extracted text into smaller chunks or “partitions” optimized for search and LLM processing (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic). Each chunk is typically a few hundred words or tokens long.
- Generate Embeddings – For each text chunk, call an embedding model (e.g. OpenAI’s text-embedding-ada-002) to obtain a high-dimensional vector embedding that captures the semantic meaning (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic) (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic). Azure OpenAI is used as the embedding generator in KM’s default setup.
- Store in Vector Index – Save the embedding vectors and associated metadata into a vector database for retrieval (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic) (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic). In KM, this can be Azure Cognitive Search (Azure AI Search) configured with vector search capability, among other options (Architecture | Kernel Memory) (Architecture | Kernel Memory).
These steps form the default indexing pipeline in Kernel Memory (Architecture | Kernel Memory). The diagram below illustrates this pipeline and how it can be customized or extended (for tasks like summarization or synthetic data generation):
(GitHub - microsoft/kernel-memory: RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.) Kernel Memory’s default ingestion pipeline (top) consists of content extraction, chunking, vectorization (embedding generation), and storage. This pipeline can be extended or customized – for example, adding a summarization step or custom transformations – before storing results (Architecture | Kernel Memory) (Architecture | Kernel Memory).
Under the hood, KM implements each step as a handler in a processing chain (Architecture | Kernel Memory) (Architecture | Kernel Memory). For example, a TextExtractionHandler extracts content and a GenerateEmbeddingsHandler calls the Azure OpenAI embedding service for each chunk (Architecture | Kernel Memory). If you’re building your own indexer, you can follow a similar pattern: treat each stage as a separate component or function. In the sections below, we break down these stages and show how to implement them in C# with Azure OpenAI and Azure Cognitive Search.
Text extraction is the first step – pulling out textual content from files so it can be indexed. Kernel Memory supports many file types (DOCX, PDF, XLSX, PPTX, HTML, JSON, images with text, Markdown, etc.) (Querying data and documents using LLM models with Kernel Memory | by Adolfo | Globant | Medium). It uses format-specific decoders to handle each type (for instance, a Word decoder for .docx, an Excel decoder for .xlsx, etc.). In your C# solution, you have a few options for implementing this:
- Use Libraries/SDKs: Leverage file format libraries to extract text:
- Word Documents (.docx) – Use the Open XML SDK or libraries like DocumentFormat.OpenXml to read paragraphs, or Syncfusion/Aspose for richer parsing.
- PDFs – Use libraries like PdfSharp, iText7, or Azure SDKs (e.g. Azure.AI.FormRecognizer for OCR) to extract text from PDF content.
- Excel Spreadsheets (.xlsx) – Use a library such as ClosedXML or EPPlus to read cell values. You might concatenate cell texts row by row or column by column into a plain text representation.
- HTML/Web Pages – Use an HTML parser like HtmlAgilityPack or AngleSharp to extract visible text (skipping script/style tags).
- JSON files – Simply treat JSON as text, or format it into a more readable form (more on this in the Tabular Data section below).
- Images (scans) – Use Azure Cognitive Services (Azure AI Vision) OCR or Tesseract OCR to get text from images if needed.
Using specialized decoders ensures content is extracted in a clean, readable way. For example, KM’s Excel decoder reads each cell’s text content and replaces non-text cells with blanks by default ([Bug] MsExcelDecoder.DecodeAsync only works on text data types · Issue #447 · microsoft/kernel-memory · GitHub) ([Bug] MsExcelDecoder.DecodeAsync only works on text data types · Issue #447 · microsoft/kernel-memory · GitHub). In a custom implementation, you’d improve this by converting all cell values to text (e.g. using XLCell.Value.ToString() in ClosedXML to capture numbers) so nothing is lost. The goal is to produce a raw text stream (or set of text sections) from the input file.
When implementing extraction, preserve basic structure if possible. Kernel Memory organizes extracted text into sections (e.g. pages or segments) and ensures clean separation between sections in the output (kernel-memory/service/Core/Handlers/TextExtractionHandler.cs at main · microsoft/kernel-memory · GitHub) (kernel-memory/service/Core/Handlers/TextExtractionHandler.cs at main · microsoft/kernel-memory · GitHub). You can do similarly – for instance, insert double line breaks between logical sections like pages or spreadsheet tabs to signal boundaries in the text. This helps later during chunking and for retrieving context.
Implementation Tip: Start by identifying the MIME type or file extension and routing the file to the appropriate parser. In code, you might have a function like
ExtractTextFromFile(string filePath)that checks the extension and calls a specific helper (one for PDFs, one for Word, etc.). For example: if the file is.docx, open it via OpenXML and extract all text; if.html, use an HTML parser to get innerText of the body. Ensure you handle exceptions (e.g. if a file is encrypted or an unsupported format).
Once you have the raw text, the next step is partitioning it into smaller chunks. Long documents must be broken into chunks for two reasons: to fit token limits of models and to improve retrieval granularity. Kernel Memory’s default TextPartitioningHandler splits the extracted text into sentences and then groups sentences into paragraphs (partitions) up to a certain token length (Architecture | Kernel Memory) (Partitioning & chunking | Kernel Memory). By default, it targets ~1000 tokens per chunk with an overlap of ~100 tokens between chunks (Partitioning & chunking | Kernel Memory).
In a custom C# indexer, you can implement chunking as follows:
- Tokenization: If possible, use a tokenizer to count tokens (OpenAI uses GPT-3/GPT-4 tokenization). You might use the Azure OpenAI SDK’s
GetEmbeddingsas a rough measure (it will error if input is too large) or use a library like GPTToken or TokenizerSharp to estimate tokens. If a tokenizer isn’t available, splitting by character count (e.g. ~4,000 characters ~ 1000 tokens) can serve as a proxy, but token-aware splitting is more precise. - Sentence Splitting: Break the text into sentences or logical units. You can use simple heuristics (split on
.?!followed by whitespace) or regex, being careful with abbreviations. Another approach is splitting by newline for text that already has line breaks (e.g. OCR or certain formats). - Group into Chunks: Accumulate sentences until the chunk’s token count is near the limit (e.g. 800–1000 tokens). Ensure chunks don’t exceed the max token size of your embedding model (for text-embedding-ada-002, that’s around 8191 tokens, but keeping chunks much smaller is better for relevance (Partitioning & chunking | Kernel Memory)).
- Overlap: When starting a new chunk, prepend some content from the end of the previous chunk (e.g. last 1-2 sentences or ~100 tokens) (Partitioning & chunking | Kernel Memory) (Partitioning & chunking | Kernel Memory). This overlapping context helps preserve continuity – queries that hit content straddling a boundary might then find it in at least one of the chunks.
For example, you might implement chunking like:
List<string> ChunkText(string text, int maxTokens = 1000, int overlapTokens = 100)
{
var chunks = new List<string>();
// (Pseudo-code) Use a tokenizer to split into sentences with token counts
var sentences = SplitIntoSentences(text);
int i = 0;
while (i < sentences.Count)
{
var chunkBuilder = new StringBuilder();
int tokenCount = 0;
// Start each chunk with some overlap from end of last chunk
if (chunks.Count > 0 && overlapTokens > 0)
{
string overlapText = GetLastNTokens(chunks.Last(), overlapTokens);
chunkBuilder.Append(overlapText);
tokenCount = CountTokens(overlapText);
}
// Add sentences until reaching maxTokens
while (i < sentences.Count && tokenCount < maxTokens)
{
string sentence = sentences[i];
int sentenceTokens = CountTokens(sentence);
if (tokenCount + sentenceTokens > maxTokens) break;
chunkBuilder.Append(sentence).Append(" ");
tokenCount += sentenceTokens;
i++;
}
chunks.Add(chunkBuilder.ToString().Trim());
}
return chunks;
}In practice, you will need a real token counting mechanism (either via an API or a library). The above is a conceptual outline. The result should be a list of textual chunks, each a few hundred words/tokens long. These chunks are what we’ll feed into the embedding generator.
Note: The default KM partitioner handles most text as a simple string. It doesn’t have special logic for code or JSON, which may result in suboptimal splits for those formats (Architecture | Kernel Memory). If your use case involves a lot of structured text (like JSON), you might implement a custom chunker that, say, splits JSON objects or key-value pairs more intelligently (e.g. one JSON object per chunk if they are large, or grouping related key-values together). The goal is to make each chunk a self-contained, coherent piece of content.
With text chunks in hand, the indexer next creates an embedding vector for each chunk using an LLM embedding model. Kernel Memory supports multiple embedding providers (OpenAI, Azure OpenAI, local models, etc.), but commonly uses OpenAI’s Ada v2 model for text embeddings (Embedding | Kernel Memory). In our case, we’ll use Azure OpenAI to generate embeddings. Each chunk of text is sent to Azure OpenAI’s embedding endpoint, and we get back a vector (for text-embedding-ada-002, a 1536-dimension vector of floats).
Setup: First, make sure you have an Azure OpenAI resource with the embedding model deployed, or access to OpenAI’s API. In C#, you can use the Azure.AI.OpenAI SDK or call the REST API directly. Here’s a snippet using the Azure SDK:
// Initialize Azure OpenAI client
var openAiEndpoint = new Uri("https://<your-azure-openai-resource>.openai.azure.com/");
var openAiKey = new AzureKeyCredential("<YOUR-OPENAI-KEY>");
var client = new OpenAIClient(openAiEndpoint, openAiKey);
// Choose your deployment or model name
string embeddingModel = "text-embedding-ada-002"; // or the deployment name of the embedding model
// For each text chunk, get embedding
foreach (string chunk in chunks)
{
EmbeddingsOptions options = new EmbeddingsOptions(chunk);
Response<Embeddings> response = await client.GetEmbeddingsAsync(embeddingModel, options);
IList<float> vector = response.Value.Data[0].Embedding;
// TODO: Store the vector along with metadata (we'll handle storage in the next step)
}This will call Azure OpenAI and retrieve the embedding vector for the input text. In practice, you’ll want to handle exceptions (network issues, rate limits, etc.) and maybe throttle your requests. The Kernel Memory docs note that embedding generation can become a bottleneck if the AI service is throttling (since it processes one chunk at a time by default) (Architecture | Kernel Memory) (Architecture | Kernel Memory). To speed up indexing of large documents, you could send multiple embedding requests in parallel (Azure OpenAI allows concurrent requests within rate limit bounds) or batch chunks together if the API supports it.
Each vector you get back is typically a list/array of floats. Keep track of which chunk it corresponds to (e.g. by index or an ID). You will need to store both the vector and the chunk text (and possibly other metadata like document ID or section) in the search index.
Memory considerations: Each embedding is 1536 floats (~6 KB if using 4-byte floats). For thousands of chunks, this is manageable, but note that storing very large numbers of vectors will consume storage in the search index. Azure Cognitive Search can handle millions of vectors, but you should plan index sizing accordingly.
Finally, it’s worth noting that Azure OpenAI can also be used for other language processing tasks during indexing. For example, one could use an OpenAI completion model to summarize a long document and store the summary as additional metadata, or to extract key entities or generate keywords for tagging. Kernel Memory focuses on embeddings for indexing and then uses the language model at query time to generate answers (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic). In your custom pipeline, consider if any pre-processing with GPT models would help (though often it’s more efficient to let the model do that at query time rather than at indexing time). We’ll discuss using the model at query time later, but keep in mind Azure OpenAI is a versatile tool in the pipeline beyond just embeddings.
The final step is to store the embeddings and related data into Azure Cognitive Search (Azure AI Search) so that you can run vector similarity queries and retrieve content. Azure Cognitive Search now supports vector search, which allows you to index a field of type “Collection(Edm.Single)” (a float array) with a specified dimensionality and similarity algorithm (cosine for OpenAI embeddings) (Create a vector index - Azure AI Search | Microsoft Learn). To set this up:
- Index Schema: Define an index with appropriate fields. At minimum, you’ll need:
- An ID field (key) – a unique identifier for each chunk or record (e.g. combine a document ID and chunk number).
- A Content field – the chunk’s text content (so you can retrieve or display it, and possibly use keyword search or full-text search on it in addition to vector search).
- A Vector field – to store the embedding (e.g. 1536-dim float array). In the index definition, you’d specify
"dimensions": 1536and"vectorSearchAlgorithm": "cosine"for this field if using the REST API (Create a vector index - Azure AI Search | Microsoft Learn). With the Azure Search .NET SDK, you can use attributes or field builder to denote a vector field. For example, using the latest SDK:Ensure your Azure Cognitive Search service is in a region and SKU that supports vector search (as of writing, vector search is in Azure Cognitive Search preview). You might need to use a preview API version (like 2023-10-01-Preview).public class ChunkRecord { [Searchable] public string Content { get; set; } [SimpleField(IsKey=true)] public string Id { get; set; } [VectorField(VectorDimension = 1536, VectorSearcher = VectorSearcher.Cosine)] public float[] Embedding { get; set; } // ... you can add other fields like Tags, FileName, etc. }
- Index Creation: Create the index in Azure Search (via REST or SDK) with the above schema. For example, using the SDK:
(Alternatively, use the Azure Portal or REST API to create the index with a JSON schema.)
var indexDefinition = new SearchIndex("<your-index-name>", FieldBuilder.BuildForType<ChunkRecord>()); indexDefinition.VectorSearch = new VectorSearch(new VectorSearchAlgorithmConfiguration("my-vector-config", VectorSearchAlgorithm.Cosine)); await searchAdminClient.CreateOrUpdateIndexAsync(indexDefinition);
- Uploading Documents: Once the index is ready, you can start uploading the chunk records. Each record will contain the chunk text, the vector, and any metadata. Using the Azure.Search.Documents SDK:
Here,
var searchClient = new SearchClient(searchEndpoint, "<your-index-name>", new AzureKeyCredential("<SEARCH-API-KEY>")); var batch = IndexDocumentsBatch.Upload(chunkRecords); await searchClient.IndexDocumentsAsync(batch);
chunkRecordsis a collection ofChunkRecordobjects like defined above. Make sure each has a uniqueId. For example, if your document ID is “DOC123” and it was split into 5 chunks, you could assign IDs “DOC123-1”, “DOC123-2”, … “DOC123-5”. Store the original document reference as well (maybe as part of Id or an additional field) so you know which file the chunk came from.
In Kernel Memory’s design, the final SaveRecordsHandler takes the embeddings and saves them to one or more “Memory DBs” (which could be Azure Search, a SQL DB, Qdrant, etc.) along with metadata like source links, tags, etc. (Architecture | Kernel Memory) (Architecture | Kernel Memory). If you want to include additional metadata in Azure Search, you can add fields for those. For instance, you might add a DocumentId field to group chunks by the original document, or Tags for things like author, category, etc., which can be used for filtering results. Azure Search supports filtering and facets on non-vector fields, so you can combine semantic vector search with traditional filtering (e.g. only find results where DocumentId = "DOC123" or Category = "Finance").
After indexing, you’ll have a populated Azure Cognitive Search index where each entry represents a chunk of content with its embedding. You can now perform searches. A vector similarity search query (via the Search REST API or SDK) will allow you to provide a query embedding and retrieve the most similar chunks (with their content). Typically, for a Q&A scenario, you would take a user’s question, use the same embedding model to embed the question, then ask Azure Search for the top N most similar chunks (this is the retrieval step of RAG). Those chunks’ text can then be fed into an Azure OpenAI Completions (or ChatCompletions) API call to generate a final answer with citations.
In the indexing phase, we used Azure OpenAI’s embedding model. Azure OpenAI can also be used with GPT-3.5/GPT-4 models for language generation tasks that augment the index or assist in query answering. Kernel Memory leverages Azure OpenAI not only to create embeddings but also as a Text Generator for answering questions over the indexed data (GitHub - microsoft/kernel-memory: RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.) (GitHub - microsoft/kernel-memory: RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.). Here are a couple of ways Azure OpenAI’s language models come into play:
- Answer Generation (RAG): After indexing, when a user asks a question, you retrieve relevant chunks (as described above) and then prompt a GPT model with those chunks to produce a final answer. For example, you might construct a prompt like: “Use the information below to answer the question. Include citations from the source.\n\nSources:\n1. {chunk1 text}\n2. {chunk2 text}\n\nQuestion: {user question}\nAnswer:”. Azure OpenAI’s
ChatCompletionAPI can be used with a system and user message to instruct the model to answer using the provided context. This is essentially what Kernel Memory’sAskAsyncdoes – it performs a search and then asks the LLM to generate an answer, including citations (the chunk sources) in the response (Querying data and documents using LLM models with Kernel Memory | by Adolfo | Globant | Medium). When replicating, you’d use the Azure OpenAI gpt-35-turbo or gpt-4 model for this step. While this goes beyond indexing, it’s important to design your index to support it (store enough information to identify sources for citation). - Summarization & Insights: During indexing, you might also use an LLM to create summaries or extract structured info. For instance, after chunking a long document, you could feed each chunk to a GPT model with a prompt like “Summarize the following text...”. The summaries could be stored in the index (maybe in a separate field or even as a shorter embedding) to allow faster rough answers or to display a blurb in search results. Another example: for JSON or tabular data, an LLM could transform it into readable sentences (e.g. converting a row of data into a sentence) which can then be embedded. This could yield more meaningful embeddings for certain structured data than raw serialization.
Using Azure OpenAI in these ways can greatly enhance the usability of your indexed data, especially for building a Q&A chatbot or analytic helper. Just keep in mind the cost and performance implications (embedding every chunk and also possibly summarizing every chunk doubles the OpenAI calls, for example). It might be something to enable selectively (e.g. only summarize if the document is very large or if it’s a known format like a table that needs it). Always test if the additional LLM processing yields better search results or answers to justify the extra steps.
Handling tabular or structured data (spreadsheets, JSON, CSV, etc.) often requires special consideration. By default, treating such data as a blob of text might lead to suboptimal results. Let’s discuss how to better support these in your indexer:
Excel Spreadsheets (.xlsx): An Excel file might contain tables, lists, or even free-form text. The naive approach is to read all cells row by row and produce one big text blob. However, important context (like column headers) might be lost or repeated excessively. To improve:
- Include headers with each row’s text. For example, if you have a table with headers “Product, Price, Quantity”, and a row “Widget, 10, 5”, instead of just “Widget 10 5”, combine them into a sentence or structured line like: “Product: Widget, Price: 10, Quantity: 5”. This way, if a user asks "What is the price of Widget?" the chunk containing that row will semantically reflect the relationship of Widget to Price.
- Consider one row per chunk if rows are reasonably sized. Each row becomes a small chunk of text. If a sheet has many rows, this creates many chunks, but each chunk is highly focused. You can also group small related rows (e.g. all rows for a single product or category) into one chunk if that makes more sense for retrieval.
- Preserve sheet names or any hierarchical structure. If the workbook has multiple sheets (e.g. “2022 Sales”, “2023 Sales”), you might add a tag or include the sheet name in the text (e.g. prefix each chunk with “2022 Sales - ...”). This metadata can help filter or give context in results.
- Retain numeric data as text. Ensure numbers aren’t dropped. The KM Excel decoder issue ([Bug] MsExcelDecoder.DecodeAsync only works on text data types · Issue #447 · microsoft/kernel-memory · GitHub) ([Bug] MsExcelDecoder.DecodeAsync only works on text data types · Issue #447 · microsoft/kernel-memory · GitHub) showed that non-text cells were coming out blank – you should avoid that by converting every cell value to a string. Even a formula or date can be turned into a textual representation (though format appropriately, e.g. “2024-01-01” for a date).
- If the Excel contains a lot of textual commentary (e.g. a cell has a paragraph of text), handle that like any text (it will naturally be part of the chunk).
- Optionally, use LLM to summarize tables: For example, you could feed an entire table to GPT (if not too large) and ask for a summary (“This spreadsheet lists products and their prices and quantities...”). However, for query retrieval, it’s usually better to index the detailed data so specific queries can find the exact figures.
JSON Documents: JSON can represent structured records or hierarchical data. A straightforward way is to index the raw JSON string, but that’s not semantically rich for vector search. Instead:
- Flatten JSON to text: Suppose you have JSON like
{"Name": "Alice", "Age": 30, "Skills": ["C#", "SQL"]}. You can convert this to a sentence: “Name: Alice; Age: 30; Skills: C#, SQL.” This reads more like natural language and will embed in a way that captures the meaning (Alice is associated with age 30 and those skills). - If the JSON is an array of objects (e.g. a list of records), treat each object as a separate chunk/document (similar to the row-per-chunk approach for Excel). For a large array, you’ll have many small chunks, but that’s fine for search. Make sure to carry an identifier if the JSON as a whole is one file – like
DocumentIdfor the file and maybe an index for the object. - Nested JSON: For nested structures, you might create text that reflects the path. E.g., JSON:
could be flattened into chunks: “Department: HR; Employee Name: Bob; Title: Manager.” and another: “Department: HR; Employee Name: Carol; Title: Analyst.”. This way, a question like "Who is the HR Manager?" would find the first chunk.
{ "Department": "HR", "Employees": [ {"Name": "Bob", "Title": "Manager"}, {"Name": "Carol", "Title": "Analyst"} ] } - Indexes for fields: In Azure Search, you might also consider indexing some JSON fields as separate search fields (especially if you want to support filtering or exact match). For example, you could have a field
Namewith content “Alice” and still have the vector for the combined text. This allows you to do hybrid queries (vector + keyword). However, designing a hybrid approach can be complex. As a simpler approach, converting JSON to plain sentences as above and using purely vector search often works surprisingly well for natural language questions.
Preserve context for numbers and codes: One challenge with tabular data is that numbers or codes by themselves have little semantic meaning to an embedding model (e.g. a part number “X123” or a value “42” might not embed usefully). By surrounding them with context (keys, units, descriptions), the embeddings will be more meaningful. For instance, instead of just “42” in text, have “Age: 42 years” or “Score: 42 points”. This ensures the embedding space understands relationships (42 as an age or score).
Finally, test your approach: ask sample questions that you expect the index to answer (like “What is the Price of Widget?” or “Who is the HR Manager?”) and see if the relevant chunk would be retrieved via vector similarity. You may need to iterate on how you structure the text for such data to improve retrieval accuracy.
Designing an indexer that works for a few documents is one thing; making it scalable and maintainable is another. Kernel Memory is built with a pipeline and handler architecture to allow asynchronous, distributed processing and customization (Architecture | Kernel Memory) (Architecture | Kernel Memory). You can draw inspiration from these patterns when building your own system:
- Decoupling via Storage & Queues: In KM’s service, when a file is uploaded, it’s first saved to a content store (like Azure Blob Storage) and a pipeline job is queued, so the user’s request returns quickly (Architecture | Kernel Memory) (Architecture | Kernel Memory). The actual heavy lifting (OCR, embedding calls, etc.) happens asynchronously. For a scalable solution, consider using Azure Blob Storage for raw files and Azure Queue (or Service Bus) to trigger processing. For example, an Azure Function could be triggered by a new blob and run the indexing steps. This way, you can scale out to multiple processing instances if many files come in at once.
- Microservices or Azure Functions: Each stage could theoretically be its own function or microservice (though that might be overkill). A simpler POC might just have one worker that does all steps sequentially for a document. But as you scale, you could have a function that only extracts text and saves the result, then another function (triggered by a queue message) that chunks and embeds, then another that uploads to search. Chaining via queues allows resilience (if one step fails, you can retry just that step) and flexibility in scaling each part.
- Serverless vs Persistent Service: Running in an Azure Function (serverless) is convenient for event-driven processing. Kernel Memory also offers an embedded mode where the whole pipeline runs in-process (useful for quickly indexing on startup, for example) (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic) (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic). In a web app, you might call the indexer pipeline directly if the documents are small and few. But for large corpora, offload to background processing.
- Use of DI and Configuration: If building in C#, structure your code so that components (extractors, chunker, embedder, index writer) are injectable or at least modular. This allows swapping implementations (maybe you start with local OpenAI API, later switch to Azure OpenAI, or change vector DB from Azure Search to another). Kernel Memory achieves this through dependency injection and a builder that registers services like
WithOpenAIDefaults(...)(Querying data and documents using LLM models with Kernel Memory | by Adolfo | Globant | Medium). - Tagging and Metadata: Incorporate the concept of tags/metadata early. Kernel Memory allows tagging documents (like assigning user or category tags) which then flow into the index for filtering (GitHub - microsoft/kernel-memory: RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.) (GitHub - microsoft/kernel-memory: RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.). In a custom solution, decide what metadata is important (e.g. source URL, document type, security classification) and ensure it’s captured and stored either in the search index or separately. Azure Search can store facets/filters which is convenient for many metadata types.
- Monitoring and Retries: In a production scenario, you need to monitor the pipeline. If an embedding call fails or a document is of an unsupported format, log it and perhaps mark the document indexing as failed. Kernel Memory likely keeps track of pipeline state (since it’s stateful); you could maintain a status (e.g. in a database or the search index itself with a “status” field that updates once indexing is complete). This way, you can query which documents are ready, which failed, etc. For a POC, logging to console or App Insights might suffice.
- Scaling Azure OpenAI: Keep in mind Azure OpenAI has throughput limits. If you plan to index a large volume quickly, you may need to request a throughput increase or distribute load (for example, use multiple embedding deployments or even multiple regions). Alternatively, batch your requests if possible. Kernel Memory mentions the possibility of optimizing the embedding handler to send requests in parallel or use caching (Architecture | Kernel Memory). A simple approach is to multi-thread the embedding calls for chunks of the same document (since order doesn’t usually matter). Just be cautious to not overwhelm the service – implement a degree of parallelism that your OpenAI resource can handle.
Architecture Diagram – Embedded Indexer: The following diagram illustrates a high-level architecture where the indexing component is embedded in a .NET application, similar to Kernel Memory’s embedded mode:
(GitHub - microsoft/kernel-memory: RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.) Architecture of an embedded AI indexer in a .NET app (inspired by Kernel Memory (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic) (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic)). The Memory Component handles Document Ingestion (parsing and chunking documents), Semantic Search (vector retrieval via Azure Cognitive Search), and Answer Generation (using Azure OpenAI for RAG). Data is stored externally: raw files in document storage, vector index in Azure Search (semantic memory storage), and the OpenAI service provides the AI capabilities.
In the above pattern, your application could be an ASP.NET Web API or a background service. Users upload documents (or point to them), the ingestion pipeline runs (storing files if needed and indexing content into Azure Search). For querying, the app would take user questions, use Azure Search to find relevant chunks, then call Azure OpenAI to generate answers, returning grounded responses with citations. This essentially reproduces what Kernel Memory offers as a service, but tailored to your needs.
For a proof of concept, you might implement everything in a simple console app or a single Azure Function that triggers on a new file and does all steps sequentially. That’s fine to start. Just design the code such that you can later refactor it into separate pieces or add more robust error handling and scaling. For instance, you could write a DocumentIndexer class with methods ExtractText, ChunkText, EmbedChunks, IndexChunks. In the POC, call them in sequence. In a more robust solution, each could be its own service.
By following the above approach, you can build a custom “Kernel Memory”-like indexer using Azure OpenAI and Azure Cognitive Search. To recap, the process is: extract content from documents, chunk it into manageable pieces, embed each piece with Azure OpenAI, and index those vectors into Azure Search for fast similarity search (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic) (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic). We also discussed leveraging the power of GPT models for answering queries (completing the RAG loop) and how to refine the pipeline for structured data like tables and JSON.
As you move from POC to a production-ready system, focus on robustness and flexibility:
- Incorporate monitoring, logging, and perhaps a small database to track document indexing status.
- Allow configuration of chunk sizes, embedding model choices, etc., so you can tune for different content types.
- Consider security and privacy – e.g., if certain documents should only be accessible to certain users, you’d want to index with user-specific tags and filter at query time (KM supports multitenancy and security filters in a similar way (GitHub - microsoft/kernel-memory: RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.) (GitHub - microsoft/kernel-memory: RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.)).
- Test with real-world data and questions to ensure the embeddings and chunks are yielding relevant results. You may need to adjust the chunking strategy or add custom prompts for the LLM if certain queries aren’t answered well.
Building an AI-powered search over your own data is a rewarding endeavor. With Azure OpenAI and Azure Cognitive Search, much of the heavy lifting (language understanding and scalable search indexing) is handled for you, and projects like Kernel Memory provide a blueprint of best practices (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic). By replicating and extending these ideas in C#, you can create a powerful indexer tailored to your data, and scale it up from a simple prototype to a robust solution integrated into your applications. Good luck, and happy coding!
Sources:
- Microsoft Kernel Memory – GitHub repository and documentation for the Kernel Memory project, describing its RAG-based indexing pipeline and supported formats (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic) (Building an Intelligent Document Chatbot with Kernel Memory | Office 365 Clinic) (Querying data and documents using LLM models with Kernel Memory | by Adolfo | Globant | Medium) (Architecture | Kernel Memory).
- Azure Cognitive Search Documentation – Official Microsoft docs on vector search capabilities and index schema requirements for embeddings (Create a vector index - Azure AI Search | Microsoft Learn).
- Azure OpenAI Service Documentation – Guidance on using the Azure OpenAI .NET SDK for embedding and completion tasks. (See Azure.AI.OpenAI client library usage in the code snippet above.)
Great, I’ll update the guide to include how to manage tabular data within the RAG pipeline. I’ll cover both options: (1) using Azure OpenAI embeddings and GPT models to support tabular retrieval, and (2) exploring enhanced options like semantic table encoders or LLM prompt engineering to better represent structured data. The guide will explain how to import Excel spreadsheets or tabular text, how to index them as header-row structures, and how to build a targeted index for querying tabular information via an LLM agent.
In addition to unstructured text, the retrieval-augmented generation (RAG) pipeline can be extended to handle structured tabular data (e.g. Excel sheets, CSV files, database tables). This extension ensures that rows of data can be indexed and retrieved just as effectively as text passages, allowing questions like “What is the total revenue in 2023?” or “List all customers in Texas” to be answered using the tabular content. The overall architecture remains similar – we still parse, chunk, embed, and index the data – but we add steps to import table structure and maintain it through retrieval (RAG data pipeline description and processing steps - Azure Databricks | Microsoft Learn). The diagram below illustrates how tabular data fits into the existing RAG pipeline (ingestion on the left, retrieval and LLM on the right):
(RAG data pipeline description and processing steps - Azure Databricks | Microsoft Learn) Overview of a RAG data pipeline, which we extend to handle tabular data. The parse step will extract tables (rows/columns) as discrete chunks that are then embedded and indexed (RAG data pipeline description and processing steps - Azure Databricks | Microsoft Learn).
1. Importing tabular files: First, gather the structured data from sources such as Excel (.xlsx), CSV, or HTML tables. Depending on the source, you may use different tools: for example, use a CSV parser or Excel library in C# to read the file, or use Azure AI services (like Form Recognizer) if the table is embedded in PDFs. The goal is to extract each row along with its column headers. For an Excel/CSV file, you can read all rows into a data structure (e.g. a list of dictionaries where each key is a column name).
Example (C# - reading a CSV):
using var reader = new StreamReader("CustomerData.csv");
string headerLine = reader.ReadLine();
string[] headers = headerLine.Split(','); // e.g. ["Customer","State","Year","Revenue"]
List<Dictionary<string,string>> rows = new List<Dictionary<string,string>>();
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (string.IsNullOrWhiteSpace(line)) continue;
string[] values = line.Split(',');
var rowDict = new Dictionary<string,string>();
for (int i = 0; i < headers.Length; i++)
rowDict[headers[i]] = values[i];
rows.Add(rowDict);
}This snippet reads a CSV and stores each row with its column values. For Excel files, you can use libraries like NPOI or EPPlus similarly to iterate through rows and columns. Make sure to include the header names – these will serve as field identifiers to preserve structure.
2. Structuring each row for embedding: Each row needs to be converted into a self-contained text representation that captures its structure. A recommended format is “ColumnName: Value; ColumnName: Value; ...” for all columns in the row (Creating embeddings of tabular data - API - OpenAI Developer Community) (Three Paths to Table Understanding with LLMs | by Katsiaryna Ruksha | Medium). This ensures that when the row is later retrieved, an LLM can easily see which value corresponds to which column. For example, if our table has columns Customer, State, Year, Revenue, a row might be serialized as:
"Customer: Alice; State: Texas; Year: 2023; Revenue: 10000"
By listing each attribute-value pair with the column name as context, we maintain the schema in text form. This approach is known as using attribute-value pairs to represent a table row (Three Paths to Table Understanding with LLMs | by Katsiaryna Ruksha | Medium). It’s also helpful to include any unique identifier if available (e.g., a Customer ID or row number) in the text or as metadata, especially if you have multiple rows with similar values (Creating embeddings of tabular data - API - OpenAI Developer Community).
If the table is very large, consider whether you need to index every row individually or if you can filter relevant subsets. In many cases, indexing each row works, but extremely large tables might require additional strategies (like splitting by sections or aggregating some data) to stay within index limits.
1. Creating a dedicated index (or index section) for tables: It’s often practical to use a separate vector index in Azure Cognitive Search for structured data, distinct from the index used for unstructured text. This isolation ensures you can target table data specifically during queries (the Kernel Memory pattern similarly keeps data separated by collection/index (Index | Kernel Memory)). For example, you might create an index named “customer-table-index” for the table above. Each indexed document will correspond to one table row.
2. Defining the index schema: The index schema should reflect the structure of your table. At minimum, include:
- A key field (e.g.,
id) – unique identifier for the row. - A content field (e.g.,
content) – the combined text representation of the row (as constructed in the previous step). - A vector embedding field (e.g.,
contentVector) – to store the embedding of the row for similarity search. - Optional: individual fields for each column (e.g.,
Customer,State,Year,Revenue). Mark these as filterable (and facetable, if useful) so you can perform structured queries or filtering on them (more on this below). Storing separate fields also allows returning structured data easily.
For example, using the Azure Cognitive Search .NET SDK, we can define an index as:
SearchIndex index = new SearchIndex("customer-table-index")
{
Fields =
{
new SimpleField("id", SearchFieldDataType.String) { IsKey = true },
new SearchableField("content") { IsFilterable = false, IsSortable = false },
new SearchField("contentVector", SearchFieldDataType.Collection(SearchFieldDataType.Single))
{
IsSearchable = true, // vector fields must be searchable to use vector search
VectorSearchDimensions = 1536, // dimension of embedding vector (e.g., 1536 for Ada)
VectorSearchAlgorithmConfiguration = "myHnswConfig"
},
// Separate fields for structured filtering (optional):
new SimpleField("Customer", SearchFieldDataType.String) { IsFilterable = true, IsFacetable = true },
new SimpleField("State", SearchFieldDataType.String) { IsFilterable = true, IsFacetable = true },
new SimpleField("Year", SearchFieldDataType.Int32) { IsFilterable = true, IsFacetable = true, IsSortable = true },
new SimpleField("Revenue", SearchFieldDataType.Int32) { IsFilterable = true, IsSortable = true }
},
VectorSearch = new VectorSearch(
new HnswVectorSearchAlgorithmConfiguration(name: "myHnswConfig", capacity: 100000)
)
};
indexClient.CreateOrUpdateIndex(index);In this schema, each row’s textual content is in the content field (which could be used for keyword search or viewed), and its embedding is in contentVector. We also include the original columns as separate fields for filtering. (Ensure your service is on an API version that supports vector search and that you’ve configured a vector algorithm like HNSW as shown.)
3. Generating embeddings for each row: Before uploading the row documents, generate an embedding vector for the content text of each row. Using Azure OpenAI embeddings is straightforward: for example, with the Azure.OpenAI client library in C# you can call the embedding model (such as text-embedding-ada-002 deployed in your Azure OpenAI resource) to get a 1536-dimension vector.
Example (C# - generating and adding embeddings):
var openAIClient = new Azure.AI.OpenAI.OpenAIClient(
new Uri(AzureOpenAIEndpoint), new AzureKeyCredential(OpenAIKey));
var batch = new List<SearchDocument>();
foreach(var row in rows) // rows from earlier step
{
string content = string.Join("; ", row.Select(kv => $"{kv.Key}: {kv.Value}"));
// Call Azure OpenAI Embeddings
EmbeddingsOptions embedOptions = new EmbeddingsOptions(content);
Response<Embeddings> resp = await openAIClient.GetEmbeddingsAsync(modelDeploymentName, embedOptions);
float[] embeddingVector = resp.Value.Data[0].Embedding.ToArray();
var doc = new SearchDocument
{
["id"] = Guid.NewGuid().ToString(),
["content"] = content,
["contentVector"] = embeddingVector,
["Customer"] = row["Customer"],
["State"] = row["State"],
["Year"] = int.Parse(row["Year"]),
["Revenue"] = int.Parse(row["Revenue"])
};
batch.Add(doc);
}
await searchClient.UploadDocumentsAsync(batch);This code loops through each row, creates the content string with the "Column: Value" format, gets its embedding from Azure OpenAI, and then adds the document to the search index (using the UploadDocumentsAsync API). If the volume of data is large, consider batching or using an indexer with integrated vectorization. Azure Cognitive Search now supports built-in vectorizers (e.g., calling Azure OpenAI embedding model directly in an indexer pipeline), which can simplify the ingestion – but the end result is the same, with each row stored along with its vector (python - Azure Cognitive Vector search query and index creation - Stack Overflow).
4. Indexing table data alongside text: If you also have an existing index for unstructured text (documents, PDFs, etc.), you will now have two indexes: one for text content and one for tabular content. Keep track of index names and their schema. It’s a good practice to add a metadata field indicating the source (e.g., ContentType = "TableRow" or "DocumentText") if you combine them, but typically keeping them separate is simpler. The next step is ensuring queries can retrieve from the appropriate index.
The way we stored the table rows (as semi-structured text with consistent key-value format) ensures that when those rows are retrieved, they retain their structure. Instead of getting a plain paragraph of text, the LLM will see something like “Customer: Alice; State: Texas; Year: 2023; Revenue: 10000”. This makes it clear which number is the revenue and what it pertains to. In some cases, you might even choose to return the data in a more structured format to the LLM, such as a small Markdown table or a JSON snippet. For example, you could store a Markdown version of each row in a separate field (or reconstruct it at query time) like:
| Customer | State | Year | Revenue |
|----------|-------|------|---------|
| Alice | Texas | 2023 | 10000 |
However, returning it exactly as stored (key-value text) is usually sufficient for the LLM to parse. The important part is that the column headers (field names) are included in each chunk of data the model sees, providing context for understanding. This approach was proven to help LLMs answer questions on tabular data (Creating embeddings of tabular data - API - OpenAI Developer Community). It effectively treats each table row as a self-contained passage of text that the embedding model and LLM can interpret.
By contrast, if we had only stored raw CSV lines without headers, the model might mix up what each position means. Always include either the headers or some identifier in the content. In summary, each row is a semantically searchable unit thanks to this representation.
With the data indexed, the query workflow for structured data is as follows:
1. Vector similarity search for semantic matches: When a user’s question comes in, you’ll use the same embedding model to encode the query into a vector. Then, run a vector search on the table index to retrieve relevant rows. For example, if the question is “What is the total revenue in 2023?”, the query embedding will hopefully be closer to rows from 2023. The vector search may return the top k most relevant rows, say the rows for all customers in 2023 with their revenues. If the question is “List all customers in Texas”, the vector search should bring up rows where State: Texas. Since we encoded state in text, those rows will likely be scored as similar to the query embedding (which contains “Texas”).
2. Using filters for precise structured queries: In addition to semantic similarity, Azure Cognitive Search allows applying filters on fields even in vector searches. For instance, we can restrict the vector search to only documents where State == "Texas" or Year == 2023. This can be done by adding a filter in the search query (and marking it as a pre-filter) (Vector query filters - Azure AI Search | Microsoft Learn) (Vector query filters - Azure AI Search | Microsoft Learn). Using filters ensures you retrieve all matching rows, not just those deemed most similar. For example, to retrieve all customers in Texas, you might do:
var vector = openAIClient.GetEmbeddingsAsync(... query ...); // embedding for query
var options = new SearchOptions
{
Filter = "State eq 'Texas'",
VectorQueries = { new VectorQuery(vector, "contentVector") { KNearestNeighbors = 50 } }
};
var results = searchClient.Search<SearchDocument>("*", options);Here we combine a filter on the State field with a vector similarity on contentVector. The filter guarantees only Texas rows are considered, and the vector ensures they’re ranked by relevance to the query. In this particular case (“list all customers in Texas”), pure filtering might suffice without even needing an embedding similarity rank. Best practice: If the user query clearly corresponds to a structured filter (e.g. contains a specific value like a state or year), you can leverage the index’s filtering to get a comprehensive result set, rather than relying purely on vector semantics. Filters apply to fields marked filterable and can be used in vector searches as inclusion/exclusion criteria (Vector query filters - Azure AI Search | Microsoft Learn).
3. Returning structured “chunks” to the LLM: After retrieval, the pipeline will pass the found rows (as text snippets) to the LLM as reference context. Because each snippet still looks like structured data, the LLM can be instructed to utilize them accordingly. For example, if three rows for Year 2023 are returned, the assistant prompt might include those three lines of "Customer: X; Year: 2023; Revenue: Y" etc., and then ask the model to compute the total or list the customers. Since the question is asking for a numeric aggregation or a list, how we prompt the LLM is important (discussed next). But crucially, we do not have to manually format these rows into a table – the LLM can understand the key-value format. If needed, you can always post-process the retrieved rows into a nicer format (like combining them into a single markdown table in the prompt) for clarity.
4. Example query flows:
-
Query: “What is the total revenue in 2023?” – The system embeds this query and performs vector search (perhaps filtered to Year=2023). The retrieved rows might be all rows where Year 2023 (with each customer’s revenue). These are provided to the GPT model. The prompt could say: “Based on the following data, calculate the total revenue for 2023.” The model will then sum the
Revenuevalues for the year 2023. This is an aggregation question, which LLMs can handle on small sets of data. However, caution is needed: if there are many rows, the model might miss some or make an arithmetic mistake. In a basic implementation, you rely on the GPT model to sum them up correctly. -
Query: “List all customers in Texas.” – The system might use a filter or rely on the embedding to retrieve all rows with
State: Texas. These rows (each with a customer name and other info) are fed to the model. The prompt could be: “From the data below, list all unique customers who are in State = Texas.” The model should then enumerate, for example, “Alice, Bob, Charlie” if those were the customers in Texas. Since this is a listing task (essentially a filtered list), using a filter in retrieval ensures no relevant entry is missed.
5. Hybrid retrieval (text + tables): If your overall knowledge base has both documents and tables, you might run two searches – one against the text index and one against the table index – and then merge the results. For instance, a question like “What is the total revenue in 2023 according to our sales data?” might retrieve some explanatory text from documents and the raw data from the table index. Your application can then give both to the LLM (perhaps with an indicator of which is which) so it can ground its answer with combined knowledge. In practice, you could also detect query intent: if the query mentions terms like “total”, “average”, or looks like a factual lookup, you might prioritize the table index.
When enabling querying of structured data, there are two levels of sophistication you can choose for your pipeline:
The basic approach treats each table row as just another piece of content for the RAG system, using Azure OpenAI embeddings for retrieval and GPT-3.5/4 for generation and reasoning. Key characteristics of this approach:
-
Embedding generation: Use a general-purpose text embedding model (like text-embedding-ada-002) to vectorize each row’s text. We did this in the ingestion step. The model doesn’t inherently “know” it’s dealing with a table, but because we included the column names in text, the semantic meaning is mostly preserved. Azure OpenAI’s embeddings will place rows with similar content near each other in vector space, and will position queries near relevant rows.
-
Search and retrieval: Use Azure Cognitive Search’s vector search to get relevant rows. Optionally combine with keyword search or filtering for accuracy. This is essentially using semantic search over the structured data.
-
LLM reasoning: Once the relevant rows are retrieved, pass them to an Azure OpenAI GPT model (for example, GPT-4 via the ChatCompletion API) as context and ask the question. The model will read the structured rows and generate an answer. If the question asks for a summary or aggregation, the GPT model will attempt to compute or summarize from the data. For instance, GPT can add up a handful of numbers or filter items from the provided list. It can also format the answer appropriately (a natural language sentence, or even recreate a small table in the answer if instructed).
-
Example: Using the earlier example, after retrieving rows for 2023, you might prompt GPT-4 with: “The following are records of revenue by customer for the year 2023:\nCustomer: Alice; Year: 2023; Revenue: 10000\nCustomer: Bob; Year: 2023; Revenue: 5000\n... \nWhat is the total revenue in 2023?”. A GPT model is usually capable of reading these values and responding: “The total revenue in 2023 is 15,000.” (Hopefully summing correctly).
-
Summarization: Likewise, the GPT model can summarize parts of the table if asked (e.g., “summarize the revenue by state”), or extract specific information (“who is the top customer by revenue?”) from the retrieved data. The summarization ability comes essentially for free as part of the generation step, since GPT can be instructed to produce a concise summary or analysis of the retrieved rows.
The basic approach is easier to implement and leverages the power of GPT for reasoning. However, it has some limitations:
- The embedding model might not capture numeric relationships perfectly (it treats the row text like any sentence).
- The GPT model might make mistakes in calculations or miss an entry if a lot of data is given (context length and inherent stochastic nature).
- It doesn’t inherently understand schema beyond what’s provided in text (which is why we provide headers in text).
Despite these, this approach works well for many scenarios and is relatively straightforward: it uses your existing Azure OpenAI and Cognitive Search resources without requiring new model types. Many implementations of RAG stick to this approach initially.
For more accurate and complex querying of structured data, you can incorporate table-aware techniques:
-
Table-specific semantic encoders: These are models specifically trained to handle tables. For example, TAPAS (Table Parser) is a BERT-based model designed for question-answering over tabular data (TAPAS). TAPAS encodes the structure of a table (rows and columns) directly into the embedding, allowing it to understand queries that involve aggregations or comparisons by selecting table cells and even performing operations like summing or counting (TAPAS). Other models in this family include TaBERT, T5-based table QA models, etc. (Three Paths to Table Understanding with LLMs | by Katsiaryna Ruksha | Medium). In an Azure context, you might not have TAPAS as a managed service, but you could use a Hugging Face endpoint or Azure ML to host such a model. An enhanced pipeline could route tabular questions to this model: it would take the whole table (or relevant chunk of it) plus the question, and directly output an answer. This bypasses the need for manually summing via GPT and can improve accuracy on questions like totals or counts since the model is explicitly trained for that. However, integrating such a model is more complex and might not leverage Azure Cognitive Search’s vector index; it’s a separate path where the model itself handles retrieval internally (often you still need to select which table or which portion of a table to feed it).
-
Advanced prompt engineering: Without changing models, you can engineer the interaction with GPT to better handle structured data. Some strategies:
- Explicit schema prompting: Provide the LLM with a brief schema description before the data. For example: “You have a table with columns [Customer, State, Year, Revenue]. I will give you some rows. Answer the question using these.” This reminds the model of the structure so it’s less likely to confuse Year and Revenue, etc.
- Use of tools or functions: With the latest GPT models supporting function calling, you could equip the system with a “calculator” function. The model, upon seeing a question like “total revenue”, could be guided to output a function call (e.g.,
CalculateSum(column="Revenue", filter="Year=2023")), which your code handles by actually computing the sum from the data, and then the model presents the result. This ensures 100% accuracy for arithmetic. It requires more coding, but is a powerful approach for numeric heavy queries. - Chunking large tables by columns or by meaningful sections: If a table has many columns, you might not include all in the embedding text to avoid diluting relevance. For example, if most questions are about revenue by region, you might split or index a “Revenue by region” subset of the data differently. This is a form of customizing the content that goes into the embedding.
- Quality checks in prompt: Ask the model to verify its answer against the provided data explicitly, or even do a step-by-step reasoning: “First, extract all the Revenue values from the data for 2023, then add them.” This is similar to how you might prompt Chain-of-Thought for math problems, and can reduce errors.
-
Maintaining schema in responses: You might want the LLM to output answers in a structured format occasionally (for instance, returning a small table as the answer). Through prompt engineering, you can achieve this by providing an example or instruction like: “Answer in a JSON format with keys as column names if applicable” or “Present the list of customers as a bullet list.” This doesn’t improve retrieval per se, but it leverages the fact we have structured data to deliver structured answers.
-
Combining structured and unstructured context: An advanced pipeline might try to correlate text and tables. For example, if a document paragraph says “Total 2023 revenue was $15K (an increase from 2022). See Table 4.”, and the table has the detailed numbers, an intelligent agent could retrieve both and use the text to double-check the table data (or vice versa). Achieving this level of cross-referencing might involve establishing links (perhaps via a common key or identifier) between text content and table content during ingestion (like tagging a table row with a document ID if it came from that document). Then the agent or application can use those tags to pull in related info. This is an advanced scenario and tool like LlamaIndex or others can help build such relationships.
When to use advanced approaches? If you find the basic approach is struggling with certain queries – especially those requiring precise computation or understanding of many rows at once – it might be time to incorporate an advanced method. Research has shown that direct use of LLMs on tabular data works well for simple lookups, but for more complex analytical questions (aggregations, conditional logic) a structured query approach or specialized model yields better results (Three Paths to Table Understanding with LLMs | by Katsiaryna Ruksha | Medium). For instance, instead of asking GPT-4 to compute an average over 100 rows (which it might approximate or err on), a text-to-SQL approach could be used: the LLM translates the natural language question into a SQL query that runs on the dataset (which could be stored in a database or in-memory) and then the result is given. Azure doesn’t have a built-in text-to-SQL on Cognitive Search, but you could export the table to an Azure SQL or use an in-memory DataTable in C# and have the LLM formulate a query.
In summary, the enhanced approach means going beyond treating the table as text. It involves either smarter models that know it’s a table (encoders like TAPAS that embed the table with structure (TAPAS)) or smarter prompts that instruct the LLM how to use the table data (like performing calculations stepwise or calling functions). These can significantly improve accuracy for questions like “What’s the average, maximum, or total…”, or multi-step queries like “Which state had the highest revenue in 2023 and what was the amount?”.
Designing how the LLM agent uses the retrieved tabular data is critical:
-
Let the LLM do analysis vs. pre-compute: If the question requires simple analysis on a small set of retrieved rows, letting the LLM handle it in the answer is convenient. For example, summing a handful of numbers or filtering a few records is well within GPT-4’s capabilities. You can prompt it clearly to do so and even ask it to show the steps if necessary. However, if the task is more complex or the stakes of accuracy are high (financial figures, etc.), you should compute the answer outside the LLM and provide it, or at least verify the LLM’s answer. As a rule of thumb, use the LLM for qualitative insights or small-scale calculations, but use deterministic methods for large-scale or critical calculations (Three Paths to Table Understanding with LLMs | by Katsiaryna Ruksha | Medium). In our revenue example, if “total revenue 2023” needs absolute accuracy and the table had 1000 rows, you’d be safer summing those 1000 values with code and then just asking the LLM to format the result in a nice sentence.
-
Aggregations ahead of time: You can enhance your index by storing some pre-aggregated values. For instance, along with individual transactions or customer rows, store a document that is “Year 2023 – Total Revenue: X” and perhaps embed that as well. Then a query asking for total revenue might directly retrieve this “summary” document. This is a form of caching frequent queries. It’s not always feasible (you must anticipate what summaries are needed), but for known metrics it can be very effective.
-
Use of Cognitive Search facets or aggregations: Azure Cognitive Search itself can do some aggregations (counts, facets) on structured fields, but not sums. If you just need to count how many records match a filter, you could use
$count=trueor facet on that field. But for summing revenue, you’d need an external computation. Consider using an Azure Function or minimal API that the system can call when certain queries are recognized (this is similar to the function-calling approach but implemented server-side). -
Prompt structure for the agent: Clearly delimit the data you provide to the LLM and separate it from the question. You might use a system or user message like: “Here is relevant data in tabular form:\n\nPlease answer the query based on this data.” This helps the model focus on using the data to ground its answer. Also, remind the model to not hallucinate extra rows – it should stick to the provided info. If it’s supposed to list items and there are only 3 retrieved rows, it shouldn’t invent a 4th.
-
Validation: After the LLM gives an answer, you could implement a check for certain query types. For example, if the question was asking for a numeric value and the LLM responded with a number, you could quickly verify it by recomputing from the data. If there’s a discrepancy, you know the LLM made an error and you can either correct it or ask the LLM to reconsider (perhaps by showing the calculation steps).
-
User feedback and iteration: Sometimes the user might ask follow-up questions like “How did you get that number?”. If you have structured data, you can respond by showing the relevant rows (which you already have) and explaining the calculation. This builds trust that the answer is grounded in actual data. Designing the agent to handle such clarifications is easier when the data is structured (since you can quickly pinpoint which rows contributed to the answer).
Finally, let’s integrate these ideas into the original Kernel Memory indexer architecture. The pipeline now has an additional branch for tables:
-
Ingestion Phase: Alongside text chunking, we have a tabular ingestion process. This process reads tabular files, transforms rows to text (with column tags), and indexes them into a new Cognitive Search index (with its own vector store). This could be done in parallel with text ingestion. Ensure both indexes use the same embedding model (or compatible vector space) if you plan to embed the query once and search both – typically you will, using Ada for both text and tables.
-
Query Phase: When a query arrives, the system can perform retrieval on both indexes:
- Use the user’s query embedding to query the text index (for any relevant passages) – as originally done.
- Use the same query embedding to query the table index – retrieving any matching rows.
- Combine the top results from both (maybe preferring higher score ones regardless of source, or you might always include at least some table results if the query seems to imply a data question).
Another design is to classify the query first: perhaps using a prompt or a simple heuristic to decide if this question is about data (table) or about general knowledge (text) or both. For example, questions that mention numbers, years, or specific entities might be routed more to tables. You could implement a lightweight classifier (even a few keywords like “total”, “list all”, “number of” might hint at tabular). In a Kernel Memory or Semantic Kernel scenario, you might have two separate semantic memory stores and choose which one to query based on context.
-
LLM Response Phase: The content retrieved from both indexes is passed into the prompt. You should maintain the distinction if needed – e.g., you might prefix table data with a note like “(This is data from our spreadsheet)”. The LLM (as the “agent”) then has all relevant info to answer. It will incorporate both narrative text and factual rows as needed. Ensure your prompt encourages using the provided data and citing it if your application returns citations.
-
Example integration: Suppose the user asks, “In 2023, did we have more revenue from Texas or California, and by how much?”. Your table index retrieval might fetch rows: “State: Texas; Year: 2023; Revenue: 15000” and “State: California; Year: 2023; Revenue: 12000”. Your text index might fetch a sentence from a report: “...in 2023, sales in Texas surpassed California by around $3k...”. The LLM sees both and can confidently answer that Texas had $15k vs California $12k, so Texas led by $3k – backing it up with both the raw data and the explanatory text. This is a powerful demonstration of combining unstructured and structured sources in one answer.
By following these steps and best practices, the existing guide’s solution is now expanded to handle structured data effectively. You can import tables, index them with Azure Cognitive Search as a separate vector store, and retrieve structured row-wise chunks that the Azure OpenAI service can use to answer user queries accurately. Whether using the basic embedding approach or introducing advanced table-aware models, this addition empowers the RAG pipeline to field analytical questions and data lookups that previously required manual effort or SQL knowledge.
References:
- Azure OpenAI Developer Community – advice on embedding tabular data by including “Column name: value” pairs (Creating embeddings of tabular data - API - OpenAI Developer Community).
- Ruksha et al. (2024) – survey of LLMs on tabular data; recommends representing tables as attribute-value pairs or using specialized encoders like TAPAS for better schema understanding (Three Paths to Table Understanding with LLMs | by Katsiaryna Ruksha | Medium) (Three Paths to Table Understanding with LLMs | by Katsiaryna Ruksha | Medium).
- Hugging Face Transformers – documentation on TAPAS, a table QA model that extends BERT to encode table structure and handle aggregation questions (TAPAS) (TAPAS).
- Azure Cognitive Search Documentation – on using filters with vector search (e.g., filter on
StateorYearfields to narrow down vector results) (Vector query filters - Azure AI Search | Microsoft Learn). - Three Paths to Table Understanding with LLMs – notes that for complex table queries (requiring aggregation or sorting), a text-to-SQL approach or external computation is often more reliable than raw LLM reasoning (Three Paths to Table Understanding with LLMs | by Katsiaryna Ruksha | Medium).
Comprehensive Guide to Implementing a Generic Tabular Data Index for Azure OpenAI and Azure Cognitive Search
This document outlines a complete, detailed technical solution for building a flexible, generic tabular data index capable of ingesting and querying data from various sources (Excel files, databases, JSON, CSV) using Azure Cognitive Search and Azure OpenAI embeddings.
The solution follows a Retrieval-Augmented Generation (RAG) architecture, allowing semantic search and natural language queries across heterogeneous structured data sources:
-
Ingestion Pipeline
- Extract and normalize rows from Excel, CSV, JSON, and databases.
- Convert each row into a canonical text representation.
- Generate embeddings using Azure OpenAI.
- Index rows and metadata into Azure Cognitive Search.
-
Semantic Search & Retrieval
- Perform vector search using embeddings.
- Apply metadata filters for targeted queries.
-
Answer Generation
- Retrieve relevant rows from Cognitive Search.
- Generate answers using GPT models in Azure OpenAI.
- Represent each table row as key-value pairs, independent of original schemas:
"ColumnA: ValueA; ColumnB: ValueB; ColumnC: ValueC"
"Customer: ABC Corp; Date: 2024-01-01; Revenue: 10000; Region: East"
A unified, flexible schema suitable for all datasets:
- id (Edm.String): Unique document ID (
Dataset|Table|RowID) - content (Edm.String): Canonical text representation
- contentVector (Collection(Edm.Single)): 1536-dim vector embedding
- DatasetName (Edm.String): Source dataset name
- SourceType (Edm.String): Source type (
Excel,Database, etc.) - TableName (Edm.String): Name of the table or sheet
- RowID (Edm.String): Original row identifier
- IngestionTime (Edm.DateTimeOffset): Timestamp of ingestion
All metadata fields are filterable and facetable.
-
Extraction:
- Excel: Use ClosedXML or EPPlus.
- CSV: Use StreamReader or CSVHelper.
- Databases: Use ADO.NET or EF Core.
-
Normalization:
string content = string.Join("; ", rowDict.Select(kv => $"{kv.Key}: {kv.Value}"));- Embedding Generation:
var embeddings = await openAIClient.GetEmbeddingsAsync(model, new EmbeddingsOptions(content));
float[] vector = embeddings.Value.Data[0].Embedding.ToArray();- Indexing:
var doc = new
{
id = $"{dataset}|{table}|{rowId}",
content,
contentVector = vector,
DatasetName = dataset,
SourceType = sourceType,
TableName = table,
RowID = rowId,
IngestionTime = DateTimeOffset.UtcNow
};
await searchClient.IndexDocumentsAsync(IndexDocumentsBatch.Upload(new[] { doc }));- Embed the user query into a vector.
- Perform vector similarity search:
var searchOptions = new SearchOptions
{
VectorQueries = { new VectorQuery { Vector = queryVector, KNearestNeighborsCount = 5, Fields = { "contentVector" } } }
};
var results = searchClient.Search<SearchDocument>("*", searchOptions);- Include metadata filters:
searchOptions.Filter = "DatasetName eq 'Financial2023' and SourceType eq 'Excel'";Structure retrieved data clearly for GPT:
Dataset: Financial2023, Table: Sales
Row: "Region: East; Revenue: 10000; Month: January"
Row: "Region: East; Revenue: 12000; Month: February"
Question: "What is the total revenue for the East region?"
The GPT model processes structured rows clearly to produce accurate answers.
- Schema Summaries: Maintain short descriptions for each dataset to aid in query routing.
- Precomputed Aggregations: Store pre-calculated summaries or aggregations to improve efficiency for common queries.
- Advanced LLM Function Integration: Allow GPT models to call external calculation APIs for precise numeric answers.
- Batch processing and parallelization to optimize ingestion.
- Incremental updates supported via unique IDs for easy upsert operations.
- The architecture scales efficiently using Azure Cognitive Search’s built-in scalability.
- Incorporate user or role-based filtering if dataset access controls are needed.
This unified indexing solution effectively leverages Azure Cognitive Search and Azure OpenAI for flexible, semantic querying across diverse structured data. By standardizing data ingestion, embedding-based retrieval, and LLM-based answer generation, this solution provides an extensible, maintainable, and powerful data interaction framework.
Here's the fully detailed and integrated technical guide, consolidating all key insights from our conversation into a complete solution for creating a generic Azure OpenAI-powered tabular index.
This comprehensive solution integrates all elements discussed, offering a robust foundation for your generic tabular indexing needs. Let me know if there's any additional detail or enhancement you'd like to explore!