A powerful Retrieval-Augmented Generation (RAG) application that processes multiple document types including images using advanced OCR technology. Chat with your documents and images using natural language queries with zero configuration required.
- No Config Files Required: Enter your API key directly in the web interface
- Local Storage: API key is saved locally so you only enter it once
- Session-Based Security: Option to use session-only storage for enhanced privacy
- Instant Setup: Get running in under 2 minutes
- Text Documents: PDF, Word (.docx), Text files (.txt), Markdown (.md)
- Spreadsheets: Excel (.xlsx, .xls), CSV files
- Presentations: PowerPoint (.pptx)
- Data Files: JSON, XML, YAML
- Web Content: HTML documents
- Images with OCR: JPG, PNG, TIFF, BMP, WebP, GIF
- Hybrid Retrieval: Combines semantic search (FAISS) with keyword search (BM25)
- Ensemble Retrieval: Intelligently weights different search methods (60% semantic + 40% keyword)
- Dual OCR Engines: EasyOCR + Tesseract for maximum text extraction accuracy
- Smart Chunking: Document-type aware text splitting for optimal context
- Source Attribution: Shows exactly which documents provided each answer
- Metadata Preservation: Tracks file types, sources, and chunk information
- Real-time Processing Stats: See files processed, chunks created, and error details
- Source References: Expandable sections showing document sources for each answer
- Error Handling: Detailed feedback when files can't be processed
- PDF Export: Download complete Q&A sessions as formatted reports
- Progress Tracking: Visual indicators and processing summaries
- Python 3.8 or higher
- A free Groq API key from console.groq.com
- Clone the repository:
git clone https://github.com/Oam11/InsightDocs.git
cd InsightDocs- Install dependencies:
pip install -r requirements.txt-
Optional - Install Tesseract OCR (for better image processing):
- Windows: Download from GitHub Tesseract
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr
-
Run the application:
streamlit run app.py-
Open your browser and go to
http://localhost:8501 -
Enter your Groq API key when prompted (get one free at console.groq.com)
- Check "Remember this key" to store it locally for future use
- Your API key will be saved in
~/.insightdocs/config.jsonfor convenience
That's it! No configuration files needed.
- Visit console.groq.com and create a free account
- Generate an API key (starts with
gsk_) - Enter it in the app when prompted
- Drag and drop files or use the file browser
- Mix different types: PDFs, images, spreadsheets, presentations
- Upload multiple files at once for comprehensive analysis
- Click "Process Documents"
- View the processing summary showing:
- Total files processed
- Number of text chunks created
- File types detected
- Any processing errors
- Use natural language queries
- Be specific for better results
- Reference document types when needed
- Get detailed responses with source attribution
- Click "Source Documents Used" to see which files provided the answer
- Download your Q&A session as a PDF
"What are the key performance metrics mentioned in the quarterly report?"
"List all action items from the meeting minutes"
"Compare sales figures between Q1 and Q2"
"Summarize the main findings from all research papers"
"What methodologies were used in the studies?"
"Extract all statistical data mentioned in the documents"
"Find all references to budget allocations"
"What are the compliance requirements mentioned?"
"List all contact information from the uploaded files"
"What text is visible in the uploaded screenshots?"
"Extract data from the chart in the image"
"What information can you read from the scanned document?"
This usually happens when files can't be read properly:
- Test with simple file: Try uploading the included
test_document.txt - Check file formats: Ensure files have proper extensions
- Remove passwords: Documents must not be password-protected
- Upload individually: Test files one at a time to identify issues
- Check file corruption: Open files in their native applications first
- Ensure your key starts with
gsk_ - Verify your Groq account is active
- Try generating a new API key if issues persist
- Install Tesseract OCR for better image processing
- Use high-resolution, clear images
- Ensure good contrast between text and background
- Check file size limits (default 200MB per file)
- Try smaller files first
- Use supported file formats only
- Frontend: Streamlit with enhanced UI/UX
- Embeddings: SentenceTransformers all-MiniLM-L6-v2 (CPU optimized)
- Vector Store: FAISS for semantic search
- Keyword Search: BM25 for exact term matching
- OCR: Dual-engine approach (EasyOCR + Tesseract)
- Lightweight: No heavy transformers, CPU optimized
- Fast Processing: Efficient document chunking and indexing
- Hybrid Search: Best of both semantic and keyword retrieval
- Memory Efficient: Optimized for standard hardware
- Upload & Validation: Files saved with proper extensions
- Type Detection: Smart MIME type and extension-based detection
- Content Extraction: Format-specific readers with fallbacks
- OCR Processing: Images processed with dual OCR engines
- Smart Chunking: Context-aware text splitting
- Dual Indexing: Both vector and keyword indexes created
- Ensemble Retrieval: Weighted combination of search methods
| Category | Formats | Notes |
|---|---|---|
| Documents | PDF, DOCX, TXT, MD | Text-based content only |
| Spreadsheets | XLSX, XLS, CSV | All sheets processed |
| Presentations | PPTX | Text and slide content |
| Images | JPG, PNG, TIFF, BMP, WebP, GIF | OCR text extraction |
| Data | JSON, XML, YAML | Structured data parsing |
| Web | HTML | Text content extraction |
| Code | PY, JS, TS, JAVA, CPP, C | Code-aware chunking |
- โ Zero Configuration: No complex setup or config files
- โ Multi-Modal: Handles both text and images seamlessly
- โ Hybrid Search: Better retrieval than pure vector search
- โ Source Attribution: Always know where answers come from
- โ Error Resilience: Graceful handling of problematic files
- โ Document Context: Maintains awareness of document structure
- โ Batch Processing: Handle multiple documents simultaneously
- โ Persistent Sessions: Keep context across conversations
- โ Export Capability: Generate reports from your analysis
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
- Groq - Ultra-fast LLM inference
- Streamlit - Elegant web framework
- LangChain - RAG orchestration
- EasyOCR - OCR capabilities
- Tesseract - Text recognition
๐ Ready to explore your documents?
Run streamlit run app.py and start chatting with your files in minutes!
- Multi-Document & Image Support: Upload and process multiple files simultaneously
- Advanced OCR: Extract text from images using EasyOCR and Tesseract
- Hybrid Retrieval: Combines semantic search (FAISS) with keyword search (BM25)
- Session Management: Unique session tracking with persistent chat history
- PDF Export: Download your entire Q&A session as a formatted PDF
Documents:
- PDF documents
- Word documents (DOCX)
- Text files (TXT, MD)
- Spreadsheets (CSV, XLS, XLSX)
- JSON & XML data
- YAML configuration files
- PowerPoint presentations (PPTX)
- HTML documents
Images (with OCR):
- JPEG, PNG, TIFF, BMP
- WebP, GIF formats
- Automatic text extraction
- Image preprocessing for better OCR accuracy
Code Files:
- Python, JavaScript, TypeScript
- Java, C++, C
- Optimized chunking for code structure
- Ensemble Retrieval: Combines vector similarity and keyword matching
- Smart Chunking: Document-type aware text splitting
- Source Tracking: Know which documents provided each answer
- Metadata Enrichment: Rich context with file types and chunk information
- Lightweight Architecture: No heavy transformers, optimized for efficiency
- Python 3.8 or higher
- Groq API key
- Clone the repository:
git clone https://github.com/Oam11/InsightDocs.git
cd InsightDocs- Create and activate a virtual environment:
python -m venv vev
# On Windows:
vev\Scripts\activate
# On macOS/Linux:
source vev/bin/activate- Install dependencies:
pip install -r requirements.txt-
Set up OCR (Optional but recommended):
- Install Tesseract OCR: Installation Guide
- On Windows: Download from GitHub releases
- On macOS:
brew install tesseract - On Ubuntu:
sudo apt install tesseract-ocr
-
Get your Groq API key:
- Visit console.groq.com
- Sign up for a free account
- Generate an API key (starts with
gsk_) - No configuration files needed - you'll enter it directly in the app!
- Start the application:
streamlit run app.py- Upload documents and images using the sidebar file uploader
- Click "Process Documents" to analyze content with OCR and text extraction
- Ask questions about your uploaded content
- View source references to see which documents provided the answers
- Download Q&A session as a formatted PDF
- File Type Detection: Automatic identification of document and image types
- OCR Processing: Extract text from images using dual OCR engines
- Smart Chunking: Document-type aware text splitting for optimal context
- Hybrid Indexing: Combines FAISS vector store with BM25 keyword search
- Metadata Enrichment: Track sources, chunk IDs, and content types
- Ensemble Retrieval: Combines semantic and keyword search (60/40 weight)
- Context-Aware Generation: Uses retrieved chunks with source information
- Source Attribution: Shows which documents contributed to each answer
- Session Persistence: Maintains conversation history and document context
- Dual OCR Engines: EasyOCR + Tesseract for maximum accuracy
- Image Preprocessing: Denoising, thresholding, morphological operations
- Format Support: JPEG, PNG, TIFF, BMP, WebP, GIF
- Metadata Extraction: Image properties, EXIF data when available
Core Components:
- Frontend: Streamlit with enhanced UI/UX
- Embeddings: SentenceTransformers all-MiniLM-L6-v2 (CPU optimized)
- Vector Store: FAISS for semantic search
- Keyword Search: BM25Okapi for exact matching
- OCR: EasyOCR + Tesseract (dual engine approach)
Performance Optimizations:
- Lightweight architecture (no heavy transformers)
- CPU-optimized embeddings
- Efficient document chunking
- Ensemble retrieval for accuracy
- Minimal bandwidth requirements
- Research: Analyze academic papers, reports, and documents
- Business Intelligence: Extract insights from presentations, spreadsheets
- Document Management: Search through large document collections
- Image Analysis: Extract text from scanned documents, charts, diagrams
- Code Review: Understand code repositories and documentation
- Legal/Compliance: Review contracts, policies, and regulatory documents
Common Issues:
-
API Key Problems:
- Ensure your Groq API key is correctly set in
.streamlit/secrets.toml - Verify the API key format starts with
gsk_ - Check API key permissions and rate limits
- Ensure your Groq API key is correctly set in
-
OCR Issues:
- Install Tesseract OCR if image processing fails
- Ensure image files are clear and readable
- Try different image formats (PNG often works better than JPEG)
-
File Processing Errors:
- Check if the file format is supported
- Ensure files aren't corrupted or password-protected
- Try processing files individually to isolate issues
-
Performance Optimization:
- Process documents in smaller batches for large collections
- Use high-quality images for better OCR accuracy
- Clear browser cache if UI becomes unresponsive
- Support for more image formats (SVG, HEIC)
- Advanced document layout analysis
- Multi-language OCR support
- Integration with cloud storage (Google Drive, Dropbox)
- Real-time collaboration features
- API endpoint for programmatic access
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
- Groq for the fast inference API
- Streamlit for the web framework
- LangChain for RAG components
- EasyOCR for OCR capabilities
- Tesseract for text recognition
Built with โค๏ธ for the AI community
- Verify file encoding (UTF-8 recommended)
- Model Errors:
- Ensure you're using a compatible version of the Groq API
- Check your internet connection
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.