Sfenbox (SFIT Enquiry box) - Advanced Admission Enquiry Chatbot with Unified Hybrid RAG Framework (URAG)
A modular pipeline for building a college information chatbot using LLMs, with support for ingesting both crawled website data and PDF documents as context.
- URAG-D (Document Augmentation):
Processes crawled web data or PDFs, semantically chunks, rewrites, and summarizes content for robust retrieval. - URAG-F (FAQ Enrichment):
Generates and paraphrases FAQs from augmented documents for diverse and accurate chatbot responses. - PDF Support:
Ingests and processes PDF files as context. - FastAPI Backend:
Exposes a/chatendpoint for chatbot queries.
urag-sfenbox/
│
├── python_backend/
│ ├── urag_preparation.py # Data preparation and augmentation pipeline
│ ├── main.py # FastAPI app for chatbot API
│ └── __init__.py
├── pdf_docs/ # (Recommended) Place your PDF files here
├── data/ # (Optional) JSON data files
├── .gitignore
└── README.md
-
Clone the repository:
git clone https://github.com/AnleaMJ/urag-sfenbox.git cd urag-sfenbox -
Create and activate a virtual environment:
python -m venv sfenbox sfenbox\Scripts\activate # On Windows
-
Install dependencies:
pip install -r requirements.txt
(If
requirements.txtis missing, install manually:pip install fastapi uvicorn langchain-community langchain-coreand other required packages.) -
Configure your environment:
- Edit
config.pywith your HuggingFace API token and model names. - Place your PDF files in a folder (e.g.,
pdf_docs/).
- Edit
Run the preparation pipeline to process your data:
python python_backend/urag_preparation.py-
PDF Crawling & Caching:
All PDFs in thepdf_docsfolder are automatically extracted and cached aspdf_crawled_data.jsonfor faster future runs.
On subsequent runs, the pipeline loads PDF data from this JSON file instead of re-processing the PDFs. -
To use both PDF files and firecrawl (web-crawled JSON) data as context:
# In urag_preparation.py __main__ section: augmented_docs = prep.urag_d_augment_documents( use_pdf=True, pdf_folder="pdf_docs", use_firecrawl=True, firecrawl_json=None, # or path to your firecrawl JSON file pdf_json="pdf_crawled_data.json" )
Start the FastAPI server:
uvicorn python_backend.main:app --reload- The API will be available at http://127.0.0.1:8000
- Test the
/chatendpoint with a POST request:{ "question": "What courses are offered?" }
- For production, use a process manager (e.g., Gunicorn with Uvicorn workers).
- Deploy on a cloud VM or platform (Azure, AWS, GCP, Heroku, etc.).
- Connect a frontend (React, Streamlit, etc.) to the FastAPI backend.
- Use JSON as context for faster repeated runs.
- Add new PDFs to
pdf_docs/and re-run the preparation pipeline as needed. - Use
.gitignoreto avoid committing cache files.
MIT License