Skip to content

Oam11/InsightDocs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

46 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“š InsightDocs - Multi-Format RAG Application

A powerful Retrieval-Augmented Generation (RAG) application that processes multiple document types including images using advanced OCR technology. Chat with your documents and images using natural language queries with zero configuration required.

โœจ Features

๐Ÿ”‘ Zero Configuration Setup

  • No Config Files Required: Enter your API key directly in the web interface
  • Local Storage: API key is saved locally so you only enter it once
  • Session-Based Security: Option to use session-only storage for enhanced privacy
  • Instant Setup: Get running in under 2 minutes

๐Ÿ“„ Comprehensive Document Support

  • Text Documents: PDF, Word (.docx), Text files (.txt), Markdown (.md)
  • Spreadsheets: Excel (.xlsx, .xls), CSV files
  • Presentations: PowerPoint (.pptx)
  • Data Files: JSON, XML, YAML
  • Web Content: HTML documents
  • Images with OCR: JPG, PNG, TIFF, BMP, WebP, GIF

๐Ÿง  Advanced RAG Technology

  • Hybrid Retrieval: Combines semantic search (FAISS) with keyword search (BM25)
  • Ensemble Retrieval: Intelligently weights different search methods (60% semantic + 40% keyword)
  • Dual OCR Engines: EasyOCR + Tesseract for maximum text extraction accuracy
  • Smart Chunking: Document-type aware text splitting for optimal context
  • Source Attribution: Shows exactly which documents provided each answer
  • Metadata Preservation: Tracks file types, sources, and chunk information

๐ŸŽฏ User Experience

  • Real-time Processing Stats: See files processed, chunks created, and error details
  • Source References: Expandable sections showing document sources for each answer
  • Error Handling: Detailed feedback when files can't be processed
  • PDF Export: Download complete Q&A sessions as formatted reports
  • Progress Tracking: Visual indicators and processing summaries

๐Ÿš€ Installation & Setup

Prerequisites

Quick Install

  1. Clone the repository:
git clone https://github.com/Oam11/InsightDocs.git
cd InsightDocs
  1. Install dependencies:
pip install -r requirements.txt
  1. Optional - Install Tesseract OCR (for better image processing):

    • Windows: Download from GitHub Tesseract
    • macOS: brew install tesseract
    • Linux: sudo apt-get install tesseract-ocr
  2. Run the application:

streamlit run app.py
  1. Open your browser and go to http://localhost:8501

  2. Enter your Groq API key when prompted (get one free at console.groq.com)

    • Check "Remember this key" to store it locally for future use
    • Your API key will be saved in ~/.insightdocs/config.json for convenience

That's it! No configuration files needed.

๐ŸŽฎ How to Use

1. Get Your API Key

  • Visit console.groq.com and create a free account
  • Generate an API key (starts with gsk_)
  • Enter it in the app when prompted

2. Upload Your Documents

  • Drag and drop files or use the file browser
  • Mix different types: PDFs, images, spreadsheets, presentations
  • Upload multiple files at once for comprehensive analysis

3. Process Documents

  • Click "Process Documents"
  • View the processing summary showing:
    • Total files processed
    • Number of text chunks created
    • File types detected
    • Any processing errors

4. Ask Questions

  • Use natural language queries
  • Be specific for better results
  • Reference document types when needed

5. Review Answers

  • Get detailed responses with source attribution
  • Click "Source Documents Used" to see which files provided the answer
  • Download your Q&A session as a PDF

๐Ÿ’ก Example Use Cases & Questions

Business Intelligence

"What are the key performance metrics mentioned in the quarterly report?"
"List all action items from the meeting minutes"
"Compare sales figures between Q1 and Q2"

Research & Analysis

"Summarize the main findings from all research papers"
"What methodologies were used in the studies?"
"Extract all statistical data mentioned in the documents"

Document Review

"Find all references to budget allocations"
"What are the compliance requirements mentioned?"
"List all contact information from the uploaded files"

Image Analysis

"What text is visible in the uploaded screenshots?"
"Extract data from the chart in the image"
"What information can you read from the scanned document?"

๐Ÿ”ง Troubleshooting

"No text content could be extracted" Error

This usually happens when files can't be read properly:

  1. Test with simple file: Try uploading the included test_document.txt
  2. Check file formats: Ensure files have proper extensions
  3. Remove passwords: Documents must not be password-protected
  4. Upload individually: Test files one at a time to identify issues
  5. Check file corruption: Open files in their native applications first

API Key Issues

  • Ensure your key starts with gsk_
  • Verify your Groq account is active
  • Try generating a new API key if issues persist

OCR Not Working

  • Install Tesseract OCR for better image processing
  • Use high-resolution, clear images
  • Ensure good contrast between text and background

Upload Issues

  • Check file size limits (default 200MB per file)
  • Try smaller files first
  • Use supported file formats only

๐Ÿ“Š Technical Architecture

Core Components

  • Frontend: Streamlit with enhanced UI/UX
  • Embeddings: SentenceTransformers all-MiniLM-L6-v2 (CPU optimized)
  • Vector Store: FAISS for semantic search
  • Keyword Search: BM25 for exact term matching
  • OCR: Dual-engine approach (EasyOCR + Tesseract)

Performance Features

  • Lightweight: No heavy transformers, CPU optimized
  • Fast Processing: Efficient document chunking and indexing
  • Hybrid Search: Best of both semantic and keyword retrieval
  • Memory Efficient: Optimized for standard hardware

File Processing Pipeline

  1. Upload & Validation: Files saved with proper extensions
  2. Type Detection: Smart MIME type and extension-based detection
  3. Content Extraction: Format-specific readers with fallbacks
  4. OCR Processing: Images processed with dual OCR engines
  5. Smart Chunking: Context-aware text splitting
  6. Dual Indexing: Both vector and keyword indexes created
  7. Ensemble Retrieval: Weighted combination of search methods

๐Ÿ“ˆ Supported File Formats

Category Formats Notes
Documents PDF, DOCX, TXT, MD Text-based content only
Spreadsheets XLSX, XLS, CSV All sheets processed
Presentations PPTX Text and slide content
Images JPG, PNG, TIFF, BMP, WebP, GIF OCR text extraction
Data JSON, XML, YAML Structured data parsing
Web HTML Text content extraction
Code PY, JS, TS, JAVA, CPP, C Code-aware chunking

๐Ÿ† Key Advantages

vs. Traditional RAG Systems

  • โœ… Zero Configuration: No complex setup or config files
  • โœ… Multi-Modal: Handles both text and images seamlessly
  • โœ… Hybrid Search: Better retrieval than pure vector search
  • โœ… Source Attribution: Always know where answers come from
  • โœ… Error Resilience: Graceful handling of problematic files

vs. Chat Interfaces

  • โœ… Document Context: Maintains awareness of document structure
  • โœ… Batch Processing: Handle multiple documents simultaneously
  • โœ… Persistent Sessions: Keep context across conversations
  • โœ… Export Capability: Generate reports from your analysis

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments


๐Ÿš€ Ready to explore your documents?
Run streamlit run app.py and start chatting with your files in minutes!

๐ŸŽฏ Core Capabilities

  • Multi-Document & Image Support: Upload and process multiple files simultaneously
  • Advanced OCR: Extract text from images using EasyOCR and Tesseract
  • Hybrid Retrieval: Combines semantic search (FAISS) with keyword search (BM25)
  • Session Management: Unique session tracking with persistent chat history
  • PDF Export: Download your entire Q&A session as a formatted PDF

๐Ÿ“„ Supported File Formats

Documents:

  • PDF documents
  • Word documents (DOCX)
  • Text files (TXT, MD)
  • Spreadsheets (CSV, XLS, XLSX)
  • JSON & XML data
  • YAML configuration files
  • PowerPoint presentations (PPTX)
  • HTML documents

Images (with OCR):

  • JPEG, PNG, TIFF, BMP
  • WebP, GIF formats
  • Automatic text extraction
  • Image preprocessing for better OCR accuracy

Code Files:

  • Python, JavaScript, TypeScript
  • Java, C++, C
  • Optimized chunking for code structure

๐Ÿง  Enhanced RAG Features

  • Ensemble Retrieval: Combines vector similarity and keyword matching
  • Smart Chunking: Document-type aware text splitting
  • Source Tracking: Know which documents provided each answer
  • Metadata Enrichment: Rich context with file types and chunk information
  • Lightweight Architecture: No heavy transformers, optimized for efficiency

Prerequisites

  • Python 3.8 or higher
  • Groq API key

๐Ÿš€ Installation

  1. Clone the repository:
git clone https://github.com/Oam11/InsightDocs.git
cd InsightDocs
  1. Create and activate a virtual environment:
python -m venv vev
# On Windows:
vev\Scripts\activate
# On macOS/Linux:
source vev/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up OCR (Optional but recommended):

  2. Get your Groq API key:

    • Visit console.groq.com
    • Sign up for a free account
    • Generate an API key (starts with gsk_)
    • No configuration files needed - you'll enter it directly in the app!

๐ŸŽฎ Usage

  1. Start the application:
streamlit run app.py
  1. Upload documents and images using the sidebar file uploader
  2. Click "Process Documents" to analyze content with OCR and text extraction
  3. Ask questions about your uploaded content
  4. View source references to see which documents provided the answers
  5. Download Q&A session as a formatted PDF

๐Ÿ”ง How It Works

๐Ÿ“Š Document Processing Pipeline

  1. File Type Detection: Automatic identification of document and image types
  2. OCR Processing: Extract text from images using dual OCR engines
  3. Smart Chunking: Document-type aware text splitting for optimal context
  4. Hybrid Indexing: Combines FAISS vector store with BM25 keyword search
  5. Metadata Enrichment: Track sources, chunk IDs, and content types

๐Ÿง  Enhanced RAG System

  1. Ensemble Retrieval: Combines semantic and keyword search (60/40 weight)
  2. Context-Aware Generation: Uses retrieved chunks with source information
  3. Source Attribution: Shows which documents contributed to each answer
  4. Session Persistence: Maintains conversation history and document context

๐Ÿ–ผ๏ธ Image Processing Features

  • Dual OCR Engines: EasyOCR + Tesseract for maximum accuracy
  • Image Preprocessing: Denoising, thresholding, morphological operations
  • Format Support: JPEG, PNG, TIFF, BMP, WebP, GIF
  • Metadata Extraction: Image properties, EXIF data when available

๐Ÿ› ๏ธ Technical Architecture

Core Components:

  • Frontend: Streamlit with enhanced UI/UX
  • Embeddings: SentenceTransformers all-MiniLM-L6-v2 (CPU optimized)
  • Vector Store: FAISS for semantic search
  • Keyword Search: BM25Okapi for exact matching
  • OCR: EasyOCR + Tesseract (dual engine approach)

Performance Optimizations:

  • Lightweight architecture (no heavy transformers)
  • CPU-optimized embeddings
  • Efficient document chunking
  • Ensemble retrieval for accuracy
  • Minimal bandwidth requirements

๐ŸŽฏ Use Cases

  • Research: Analyze academic papers, reports, and documents
  • Business Intelligence: Extract insights from presentations, spreadsheets
  • Document Management: Search through large document collections
  • Image Analysis: Extract text from scanned documents, charts, diagrams
  • Code Review: Understand code repositories and documentation
  • Legal/Compliance: Review contracts, policies, and regulatory documents

๐Ÿ” Troubleshooting

Common Issues:

  1. API Key Problems:

    • Ensure your Groq API key is correctly set in .streamlit/secrets.toml
    • Verify the API key format starts with gsk_
    • Check API key permissions and rate limits
  2. OCR Issues:

    • Install Tesseract OCR if image processing fails
    • Ensure image files are clear and readable
    • Try different image formats (PNG often works better than JPEG)
  3. File Processing Errors:

    • Check if the file format is supported
    • Ensure files aren't corrupted or password-protected
    • Try processing files individually to isolate issues
  4. Performance Optimization:

    • Process documents in smaller batches for large collections
    • Use high-quality images for better OCR accuracy
    • Clear browser cache if UI becomes unresponsive

๐Ÿ“ˆ Roadmap

  • Support for more image formats (SVG, HEIC)
  • Advanced document layout analysis
  • Multi-language OCR support
  • Integration with cloud storage (Google Drive, Dropbox)
  • Real-time collaboration features
  • API endpoint for programmatic access

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments


Built with โค๏ธ for the AI community

  • Verify file encoding (UTF-8 recommended)
  1. Model Errors:
    • Ensure you're using a compatible version of the Groq API
    • Check your internet connection

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A Streamlit application that allows you to chat with your documents using Groq's Gemma2 model. This application implements a Retrieval Augmented Generation (RAG) model to provide accurate and context-aware responses based on your document content.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors