Skip to content

4.1 Extract Data with Document Intelligence

Azure Document Intelligence enables automated data extraction and processing from documents using AI. This section provides details on the AI-enhanced data ingestion and processing pipeline.

Using your own data?

If you are using your own data, you must copy the structure and format of ../scripts/sql/deploy-database-tables.sql and incorporate INSERT statements for the vendors, sows, sow_chunks, milestones, deliverables, sow_validation_results, invoices, invoice_line_items, and invoice_validation_results tables.

Additionally, if your database has custom tables and table structures, you will need to map your data model to the existing schema or modify the scripts to reflect your own entity relationships. Ensure that vector columns and indexing strategies align with your AI and retrieval needs, and update queries, joins, and constraints accordingly to maintain data consistency and integrity.

Overview of Document Upload and Ingestion Workflow

  1. The process begins with financial documents, such as invoices and statements of work (SOWs), uploaded through the application UI.

  2. The setup of infrastructure seeds the database with sample data for vendors, statement of works and invoices.

  3. Sample documents to load into Document Intelligence have been provided within the repository and are located at ../data/sample_docs.

    1. You can load invoice documents from the Exercise_1_Load_Invoices folder to complete the invoices for the Trey Research vendor.
    2. You can then load a Statement of Work and 3 invoices from the Exercise_2_Load_SOW_Invoices folder for the Lucerne Publishing vendor.
    3. To load in bad data - there are documents loacted in Exercise_3_Load_Bad_SOWandInv folder.

    Generating your own documents to load

    If you want to generate your own documents, a document generator is located in ../data/data_generator/src. Follow the instructions in the README document. You need to have installed python and installed the necessary dependencies listed in requirements.txt to run the scripts.

  4. When new documents are uploaded they are saved to Azure Blob Storage at the start of the data ingestion workflow.

  5. The data ingestion workflow handles data extraction and processing by sending uploaded documents to Azure AI Document Intelligence service.

  6. The solution uses pre-built AI models within Document Intelligence which analyze data fields, such as project deliverables, due dates, billable amounts, and vendor details. These models are designed for specific document types, such as invoices, SOWs, contracts, receipts, and identity documents, and leverage AI-powered optical character recognition (OCR) and natural language processing (NLP) to extract relevant data.

  7. Once the document is analyzed, the model:

    1. Extracts key fields and assigns confidence scores (e.g., "Invoice Total: $1,250.00 (Confidence: 0.98)").
    2. Processes tables and line items using structured JSON or key-value pairs.
    3. Recognizes relationships between extracted entities (e.g., linking an invoice number to its corresponding due date).
  8. Document Intelligence's Structural Element-Based Chunking capability recognizes document structures, capturing headings and chunking the content body based on semantic coherence, such as paragraphs and sentences. This ensures that the chunks are of higher quality for use in Retrieval-Augmented Generation (RAG) pattern queries.

  9. Once extracted, the document data is securely stored in Azure Database for PostgreSQL flexible server.

  10. As part of the database insert statement, the GenAI capabilities of the azure_ai extension are leveraged to:

    • Generate and save vector embeddings of document text using Azure OpenAI, enabling efficient semantic search and retrieval.
    • Create abstractive summaries of Statements of Work (SOWs) using the Azure AI Language service, distilling key insights from lengthy documents into concise, actionable summaries.
  11. If you have your own documents then you can use the application UI to load them into blob storage which will trigger the ingestion pipeline into AI services.

AI-Enhanced Data Ingestion and Validation

Azure Document Intelligence extracts structured and unstructured data while applying AI-driven validation to ensure accuracy.

Automated Text Extraction: Identifies key fields such as transaction amounts and customer details. AI-Driven Data Validation: Cross-references extracted data with predefined rules to ensure compliance. Structural Element-Based Chunking: Segments large documents into meaningful parts for efficient processing.

Benefits of the Enhanced Pipeline

  • High-Quality Data: AI-driven validation ensures accuracy.
  • Scalability: Handles large document volumes efficiently.
  • Semantic Understanding: Enables AI-powered search and analytics.
  • End-to-End Automation: Streamlines document processing workflows.

By leveraging Azure Document Intelligence, unstructured documents can be transformed into actionable insights efficiently.

Additional Learning References