4.2 Configure Structural Element-Based Chunking¶

Structural Element-Based Chunking is a critical step in preparing documents for retrieval-augmented generation (RAG) and other AI-driven workflows. It involves breaking down large documents into smaller, semantically meaningful chunks that enable better indexing, searching, and retrieval. Using Azure Document Intelligence, Structural Element-Based Chunking can be configured to optimize data flow for downstream processes.

What Are Chunks in Document Intelligence?¶

In Azure AI Document Intelligence, "chunks" refer to structured segments of a document that are intelligently divided to improve data extraction, retrieval, and validation. This process, known as Structural Element-Based Chunking, ensures that information is logically grouped based on its meaning and context, rather than arbitrary divisions like page numbers.

How Chunks Work¶

Semantic Understanding: Instead of treating a document as a single block of text, AI models recognize key sections such as headings, paragraphs, tables, and lists.
Context Preservation: By maintaining the relationship between different parts of the document, chunking helps AI models interpret financial statements, SOWs, and invoices more accurately.
Retrieval-Augmented Generation (RAG) models can retrieve relevant chunks dynamically, improving AI-driven document validation, compliance checks, and financial data reconciliation.

Why Chunks Matter in AI-Powered Data Validation¶

Higher Accuracy: Structured document segmentation improves AI’s ability to extract and validate key financial data.
Better Contextual Retrieval: When comparing invoices to Statements of Work (SOWs), AI can retrieve specific clauses related to payment terms.
Optimized Query Performance: AI-powered searches return more precise and context-aware results.

Defining the Chunking Strategy in Document Intelligence¶

When working with Statements of Work (SOWs) and Invoices, defining an effective chunking strategy is essential to ensure accurate data extraction, validation, and cross-referencing. The way chunks are structured depends on the document type, as SOWs and Invoices have different formats, purposes, and data requirements.

Chunking Strategy for Statements of Work (SOWs)¶

SOWs are long, structured documents that outline contractual obligations, payment milestones, deliverables, and compliance clauses. The chunking strategy for SOWs must focus on preserving contextual integrity while making it easy to retrieve relevant sections.

Key Chunking Considerations for SOWs¶

Logical Sectioning:
1. Chunk based on headings, subheadings, and numbered clauses (e.g., "Scope of Work", "Payment Terms", "Milestones").
2. Ensures that AI can retrieve relevant contract details when cross-referencing invoices.
Paragraph-Based Structural Element-Based Chunking:
1. Break sections into semantically meaningful paragraphs for precise retrieval.
2. Helps ensure RAG-based AI models understand contract obligations.
Compliance Clause Recognition:
1. Identify and separately chunk regulatory or compliance language to facilitate automated legal checks.
Deliverable Based Segmentation:
1. Define chunks based on payment deliverables, linking them to expected invoice amounts.

Chunking Strategy for Invoices¶

Invoices are structured, tabular documents designed to list billable items, amounts, and due dates. Unlike SOWs, invoice chunking must focus on tabular data extraction, line-item accuracy, and direct matching with SOWs.

Key Chunking Considerations for Invoices¶

Line-Item Chunking:
1. Each line item (e.g., service, quantity, amount, tax, etc.) is treated as a separate chunk for precise AI validation.
Table-Aware Chunking:
1. Preserve tabular structures to ensure AI correctly extracts and processes invoice totals, taxes, and discounts.
Structural Element-Based Chunking for Descriptions:
1. Invoice descriptions may reference project deliverables—chunking ensures AI can cross-validate them against SOW deliverables.
Vendor & Payment Terms Segmentation:
1. Vendor details, invoice dates, and due dates must be separately chunked to ensure accurate payment milestone validation.

Differences in Chunking Strategy¶

Aspect	SOW Chunking Strategy	Invoice Chunking Strategy
Document Structure	Long-form contract with sections	Structured tables with line items
Chunking Method	Section & paragraph-based chunking	Table-aware & line-item chunking
Key Chunking Focus	Milestones, deliverables, compliance clauses	Billable items, payment details
AI Retrieval Purpose	Validate contractual obligations & compliance	Match billable amounts & due dates with SOWs

Why an Optimized Chunking Strategy Matters¶

Improves AI Accuracy – Ensures that document validation is based on structured, meaningful chunks.
Enhances Cross-Document Validation – Allows AI models to match invoices to SOWs efficiently.
Supports Compliance & Auditability – Ensures that all necessary sections are properly reviewed.
Optimizes Retrieval-Augmented Generation (RAG) Queries – AI can retrieve precise contract and invoice details for analysis.

Leverage Azure Document Intelligence¶

Azure Document Intelligence provides tools and capabilities to automate the chunking process:

Use Pre-Trained Models:
Apply pre-trained models to extract structured data and identify logical sections within documents. In this workshop 2 models were used:
1. The prebuilt-invoice model in Document Intelligence was used to process invoices as this model is designed to extract key information from invoices, such as vendor details, invoice date, and total amount due. This model leverages advanced machine learning techniques to streamline the processing of invoices, reducing manual data entry and improving accuracy.
2. The prebuilt-document model in Document Intelligence was used to process Statement of Works as this model is designed to extract structured data from a variety of document types, such as contracts, receipts, and forms. This model uses machine learning to identify and extract relevant information, enabling efficient document processing and data management.
Apply Custom Models:
1. If the pre-built models will not work for your domain-specific documents then you can use custom models trained on your documents to identify sections unique to your use case. Follow this guide
Configure Semantic Rules:
1. Define rules or conditions to segment documents based on semantic content.
2. Example: Break a contract into "Parties Involved," "Obligations," and "Terms and Conditions" sections.

Store Chunks in Azure Database for PostgreSQL¶

Once documents are semantically chunked, the resulting chunks are stored in Azure Database for PostgreSQL for efficient querying and retrieval:

Data Structure:

Each chunk is stored as an individual record with metadata for easy identification.

Example Schema:

SQL
CREATE TABLE sow_chunks (
    id BIGSERIAL PRIMARY KEY,
    sow_id BIGINT NOT NULL,
    heading, TEXT,
    content TEXT,
    page_number INT,
    FOREIGN KEY (sow_id) REFERENCES sows (id) ON DELETE CASCADE
);

Metadata:
1. This can include metadata such as document ID, chunk sequence, and semantic labels for efficient retrieval.

Benefits of Structural Element-Based Chunking¶

Enhanced Retrieval:
Enables precise querying of specific document sections, improving the accuracy of AI-driven responses.
Optimized Indexing:
1. Reduces the complexity of indexing large documents by focusing on smaller, meaningful sections.
Improved AI Performance:
1. Ensures AI models operate within token limits, avoiding truncation and enhancing output quality.

Write Chunks to PostgreSQL¶

Intelligent Data Storage in Azure Database for PostgreSQL¶

Chunked text data is stored in Azure Database for PostgreSQL, ensuring efficient storage and retrieval.

Data Organization: Documents are stored in structured formats with metadata.
Scalability: Optimized for handling large data volumes.

4. Generating Vector Embeddings with Azure AI Extension¶

The Azure AI extension for PostgreSQL enables advanced AI functionalities such as semantic search.

Embedding Storage: Adds vector columns for document embeddings.
Embedding Generation: Uses Azure OpenAI to generate high-dimensional vector embeddings.

Example SQL command:

SQL
ALTER TABLE invoices ADD COLUMN embeddings VECTOR(1536);
ALTER TABLE sows ADD COLUMN embeddings VECTOR(1536);

API Endpoints To Insert Chunks on Insert¶

When uploading a SOW in the application, the document workflow starts with the analysis API endpoint /sows/. This endpoint is called using an HTTP POST passing it the vendor_id of the Vendor and the document file being uploaded.

The document is passed to Document Intelligence to extract the text from the document.

src/api/app/routers/sows.py
  analysis_result = await doc_intelligence_service.extract_text_from_sow_document(document_data)

The application implements a service for calling Document Intelligence located within the src/api/app/services/azure_doc_intelligence_service.py file where the .extract_text_from_sow_document method is located. This code uses the prebuild-document model within Document Intelligence to extract the text from the document.

You can expand the section below to see the specific section of code that calls Document Intelligence to extract the text form the document and generate the text chunks.

Call Document Intelligence to extract text from document

src/api/services/azure_doc_intelligence_service.py
  async def extract_text_from_sow_document(self, document_data):
    """Extract text and structure using Azure AI Document Intelligence."""
    docClient = DocumentAnalysisClient(
        endpoint=self.docIntelligenceEndpoint,
        credential=self.credential
    )

    poller = await docClient.begin_analyze_document(
        model_id="prebuilt-document",
        document=document_data
    )

    result = await poller.result()

    analysis = DocumentAnalysisResult()
    analysis.extracted_text = []
    analysis.text_chunks = []

    known_headings = [
        "Project Scope", "Project Objectives", "Location", "Tasks", "Schedules",
        "Standards and Testing", "Payments", "Compliance", "Requirements", "Project Deliverables"
    ]

    for page in result.pages:
        page_text = " ".join([line.content for line in page.lines])
        analysis.extracted_text.append(page_text)

        current_heading = None
        for line in page.lines:
            text = line.content
            if self.__is_heading(text, known_headings): # Detect headings
                current_heading = text
                newTextChunk = TextChunk()
                newTextChunk.heading = text
                newTextChunk.content = ""
                newTextChunk.page_number = page.page_number
                analysis.text_chunks.append(newTextChunk)
            elif current_heading:
                analysis.text_chunks[-1].content += " " + text

    await docClient.close()

    analysis.full_text = "\n".join(analysis.extracted_text)

    return analysis

The .extract_text_from_sow_document method returns an object that contains a .text_chunks collection of objects that contain all the text chunks from the document. The code then loops through the chunks that were extracted and inserts them into the database.

src/api/app/routers/sows.py
for chunk in analysis_result.text_chunks:
    await conn.execute('''
        INSERT INTO sow_chunks (sow_id, heading, content, page_number) VALUES ($1, $2, $3, $4);
    ''', sow.id, chunk.heading, chunk.content, chunk.page_number)

The text chunks loaded into the database will now be available for query later by the Copilot to implement retrieval augmented generation (RAG) based on the text content of the documents.

Additional Learning References¶

CONGRATULATIONS. You just learned the key concepts of configuring Structural Element-Based Chunking