A document intelligence solution accelerator built on Azure AI. Extracts structured answers from document collections using AI agents and proves those answers are grounded in actual source material.
prismv2.mp4
This template is built to showcase Azure AI services. We strongly advise against using this code in production without implementing additional security features. See productionizing guide.
| Challenge | Prism's Solution |
|---|---|
| Inconsistent PDF extraction | Azure Document Intelligence extracts PDFs with native markdown output, HTML tables, and layout detection. No local dependencies. |
| Poor table extraction | Document Intelligence preserves table structure with merged cells/rowspan/colspan. openpyxl extracts Excel with formulas and formatting. |
| Lost document structure | Structure-aware chunking respects markdown hierarchy (##, ###). Extracts section titles as metadata. |
| Hallucinated answers | Agentic retrieval with strict grounding instructions. Always cites sources. Distinguishes "not found" vs "explicitly excluded." |
| Manual Q&A workflows | Define question templates per project. Run workflows against your knowledge base. Export results to CSV. |
Documents go through extraction using Azure services and Microsoft Agent Framework for enhancement.
PDF Processing
- Azure Document Intelligence:
prebuilt-layoutmodel with native markdown output, HTML tables, figure tags, and selection marks - Custom instructions: Project-specific extraction prompts via
config.json
Excel Processing
- openpyxl: Extracts all worksheets (including hidden), formulas, merged cells
- Excel_Enhancement agent: Restructures raw data into search-optimized markdown, preserving item numbers, part codes, specifications
Email Processing
- python-oxmsg: MIT-licensed .msg file parsing with metadata and body extraction
- Email_Enhancement agent: Classifies email purpose and urgency, extracts requirements and action items, identifies deadlines, generates summaries
Upload → Extract → Deduplicate → Chunk → Embed → Index → Query
| Stage | What It Does |
|---|---|
| Extract | Azure Document Intelligence + AI agent extraction to structured markdown |
| Deduplicate | SHA256 hashing removes duplicate content |
| Chunk | Document-aware recursive chunking (1000 tokens, 200 overlap) |
| Embed | text-embedding-3-large (1024 dimensions, batch processing) |
| Index | Azure AI Search with hybrid search + semantic ranking |
| Query | Agentic retrieval with Knowledge Source + Knowledge Base |
Before embedding, documents go through document-aware recursive chunking:
- PDFs split on page boundaries, Excel on sheet markers, emails on metadata/body/attachment sections
- Chunks target 1000 tokens with 200-token overlap, using tiktoken for accurate counting
- Preserves markdown header hierarchy (H1-H4) as metadata, merges small sections with neighbors
- Table-aware regex avoids breaking markdown tables mid-row
- Each chunk enriched with context prefix (document name, section hierarchy, location) to improve embedding quality
PrismRAG uses Azure AI Search Agentic Retrieval for intelligent document retrieval.
The search index uses hybrid search: HNSW vectors with cosine distance, full-text search, and semantic ranking (required for agentic retrieval). On top of the index sits a two-layer architecture:
- Knowledge Source - wraps the search index with properties for agentic retrieval
- Knowledge Base - orchestrates the multi-query pipeline, connects to the LLM
When you submit a query with conversation history, agentic retrieval:
- Uses the LLM (gpt-4o, gpt-4.1, or gpt-5) to analyze context and break the query into focused subqueries
- Executes all subqueries in parallel against the knowledge source
- Applies semantic reranking to filter results
- Returns grounding data, source references, and execution details
Your application then uses this grounding data to generate the final answer. PrismRAG adds custom retry logic: if the original query returns nothing, it tries a simplified version (removing acronyms), then an expanded version (adding synonyms).
Define structured Q&A templates for systematic document analysis:
{
"sections": [
{
"name": "Technical Specifications",
"template": "Answer based on technical documents. Provide specific values with units.",
"questions": [
{ "question": "What is the rated voltage?", "instructions": "Check electrical specs" },
{ "question": "Operating temperature range?", "instructions": "Check environmental specs" }
]
}
]
}- Run workflows against your knowledge base
- Track completion percentage per section
- Export results to CSV
- Edit and comment on answers
- Evaluation: Assess answer quality with Azure AI Evaluation SDK (relevance, coherence, fluency, groundedness)
See Architecture Documentation for detailed system design.
| Service | Purpose |
|---|---|
| Azure AI Foundry | GPT-4.1 (chat, evaluation), GPT-5-chat (extraction agents, workflows), text-embedding-3-large (1024 dimensions) |
| Azure AI Search Agentic Retrieval | Knowledge Source + Knowledge Base for multi-query retrieval pipeline |
| Azure AI Evaluation SDK | Answer quality scoring (relevance, coherence, fluency, groundedness) |
| Azure Blob Storage | Document and project data storage |
| Container Apps | Serverless hosting for backend/frontend |
| Framework | Purpose |
|---|---|
| Microsoft Agent Framework | Orchestrates extraction agents (Vision_Validator, Excel_Enhancement, Email_Enhancement) and workflow agents |
| Service | Purpose |
|---|---|
| Azure Document Intelligence | PDF extraction with prebuilt-layout model (markdown, tables, figures) |
| Library | License | Purpose |
|---|---|---|
| openpyxl | MIT | Excel extraction with formula support |
| python-oxmsg | MIT | Outlook .msg email parsing |
| tiktoken | MIT | Token counting for accurate chunk sizing |
| LangChain text splitters | MIT | Structure-aware recursive chunking |
| Component | Technology |
|---|---|
| Backend | FastAPI (Python 3.11) |
| Frontend | Vue 3 + Vite + TailwindCSS + Pinia |
| Infrastructure | Bicep + Azure Developer CLI |
- Azure subscription with permissions to create resources
- Azure Developer CLI
- Docker
# Clone and deploy
git clone https://github.com/Azure-Samples/Prism---Transform-Data-into-Queryable-Knowledge.git
cd Prism---Transform-Data-into-Queryable-Knowledge
azd auth login
azd upWhat gets deployed:
- AI Foundry with GPT-4.1, text-embedding-3-large
- Azure Document Intelligence for PDF extraction
- Azure AI Search with semantic ranking enabled
- Azure Blob Storage for project data
- Container Apps with system-assigned managed identity (backend + frontend)
- RBAC role assignments (Storage Blob Data Contributor, Cognitive Services OpenAI User)
- Container Registry, Log Analytics, Application Insights
⚠️ Manual step required:azd updoes not deploy thegpt-5-chatmodel. If you want to use it for workflows, deploy it manually through the Azure Portal → your AI Foundry resource → Model deployments. Otherwise, updateAZURE_OPENAI_WORKFLOW_DEPLOYMENT_NAMEin your.envtogpt-4.1.
Get the auth password:
az containerapp secret show --name prism-backend --resource-group <your-rg> --secret-name auth-password --query value -o tsvAfter running azd up, generate a local .env file from your deployed Container App:
# Set your resource group
RG=<your-rg>
# Get environment variables and secrets
az containerapp show --name prism-backend --resource-group $RG \
--query "properties.template.containers[0].env[?value!=null].{name:name, value:value}" \
-o tsv | awk '{print $1"="$2}' > .env
# Append secrets
echo "AZURE_OPENAI_API_KEY=$(az containerapp secret show --name prism-backend --resource-group $RG --secret-name ai-services-key --query value -o tsv)" >> .env
echo "AZURE_SEARCH_ADMIN_KEY=$(az containerapp secret show --name prism-backend --resource-group $RG --secret-name search-admin-key --query value -o tsv)" >> .env
echo "AUTH_PASSWORD=$(az containerapp secret show --name prism-backend --resource-group $RG --secret-name auth-password --query value -o tsv)" >> .envThen run locally:
docker-compose -f infra/docker/docker-compose.yml --env-file .env up -dAccess at http://localhost:3000
prism/
├── apps/
│ ├── api/ # FastAPI backend
│ │ └── app/
│ │ ├── api/ # REST endpoints
│ │ └── services/ # Pipeline, workflow, storage services
│ └── web/ # Vue 3 frontend
│ └── src/views/ # Dashboard, Query, Workflows, Results
├── scripts/
│ ├── extraction/ # Document extractors
│ │ ├── pdf_extraction_di.py # Azure Document Intelligence
│ │ ├── excel_extraction_agents.py # openpyxl + AI
│ │ └── email_extraction_agents.py # python-oxmsg + AI
│ ├── rag/ # RAG pipeline
│ │ ├── deduplicate_documents.py
│ │ ├── chunk_documents.py # Structure-aware chunking
│ │ └── generate_embeddings.py
│ ├── search_index/ # Azure AI Search
│ │ ├── create_search_index.py
│ │ ├── create_knowledge_source.py
│ │ └── create_knowledge_agent.py
│ └── evaluation/ # Answer quality evaluation
│ └── evaluate_results.py
├── workflows/
│ └── workflow_agent.py # Q&A workflow execution
└── infra/
├── bicep/ # Azure infrastructure
└── docker/ # Local development (includes Azurite)
All project data is stored in Azure Blob Storage:
- Production: Azure Blob Storage with managed identity authentication (no keys required)
- Local Development: Azurite (Azure Storage emulator, included in docker-compose)
Authentication: Uses DefaultAzureCredential from azure-identity, which automatically:
- In Container Apps: Uses system-assigned managed identity
- In local development: Uses Azure CLI credentials (
az login)
Container: prism-projects
└── {project-name}/
├── documents/ # Uploaded files
├── output/ # Processed results
│ ├── extraction_results/*.md
│ ├── chunked_documents/*.json
│ ├── embedded_documents/*.json
│ └── results.json # Workflow answers + evaluations
├── config.json # Extraction instructions
└── workflow_config.json # Q&A templates
Browse local storage with Azure Storage Explorer connected to http://localhost:10000.
| Service | SKU | Pricing |
|---|---|---|
| Azure Container Apps | Consumption | Pricing |
| Azure OpenAI | Standard | Pricing |
| Azure AI Search | Basic | Pricing |
Note: Azure Document Intelligence pricing is ~$1.50/1000 pages. See Azure pricing.
azd down- Quick Start - Get running in 5 minutes
- User Guide - Complete usage instructions
- Architecture - System design details
- Data Ingestion - Supported formats and pipeline
- Troubleshooting - Common issues
- Productionizing - Production readiness
- Local Development - Development setup
- Azure AI Foundry
- Azure AI Search Agentic Retrieval
- Microsoft Agent Framework
- Azure AI Evaluation SDK
- Azure Document Intelligence
MIT License - see LICENSE
All third-party dependencies use permissive licenses (MIT, BSD, Apache 2.0). See THIRD_PARTY_LICENSES.md for details.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines.