Skip to content

AnleaMJ/urag-sfenbox

Repository files navigation

Sfenbox (SFIT Enquiry box) - Advanced Admission Enquiry Chatbot with Unified Hybrid RAG Framework (URAG)

A modular pipeline for building a college information chatbot using LLMs, with support for ingesting both crawled website data and PDF documents as context.


Features

  • URAG-D (Document Augmentation):
    Processes crawled web data or PDFs, semantically chunks, rewrites, and summarizes content for robust retrieval.
  • URAG-F (FAQ Enrichment):
    Generates and paraphrases FAQs from augmented documents for diverse and accurate chatbot responses.
  • PDF Support:
    Ingests and processes PDF files as context.
  • FastAPI Backend:
    Exposes a /chat endpoint for chatbot queries.

Folder Structure

urag-sfenbox/
│
├── python_backend/
│   ├── urag_preparation.py      # Data preparation and augmentation pipeline
│   ├── main.py                  # FastAPI app for chatbot API
│   └── __init__.py
├── pdf_docs/                    # (Recommended) Place your PDF files here
├── data/                        # (Optional) JSON data files
├── .gitignore
└── README.md

Setup

  1. Clone the repository:

    git clone https://github.com/AnleaMJ/urag-sfenbox.git
    cd urag-sfenbox
  2. Create and activate a virtual environment:

    python -m venv sfenbox
    sfenbox\Scripts\activate  # On Windows
  3. Install dependencies:

    pip install -r requirements.txt

    (If requirements.txt is missing, install manually: pip install fastapi uvicorn langchain-community langchain-core and other required packages.)

  4. Configure your environment:

    • Edit config.py with your HuggingFace API token and model names.
    • Place your PDF files in a folder (e.g., pdf_docs/).

Data Preparation

Run the preparation pipeline to process your data:

python python_backend/urag_preparation.py
  • PDF Crawling & Caching:
    All PDFs in the pdf_docs folder are automatically extracted and cached as pdf_crawled_data.json for faster future runs.
    On subsequent runs, the pipeline loads PDF data from this JSON file instead of re-processing the PDFs.

  • To use both PDF files and firecrawl (web-crawled JSON) data as context:

    # In urag_preparation.py __main__ section:
    augmented_docs = prep.urag_d_augment_documents(
        use_pdf=True,
        pdf_folder="pdf_docs",
        use_firecrawl=True,
        firecrawl_json=None,  # or path to your firecrawl JSON file
        pdf_json="pdf_crawled_data.json"
    )

Running the Chatbot API

Start the FastAPI server:

uvicorn python_backend.main:app --reload
  • The API will be available at http://127.0.0.1:8000
  • Test the /chat endpoint with a POST request:
    {
      "question": "What courses are offered?"
    }

Deployment

  • For production, use a process manager (e.g., Gunicorn with Uvicorn workers).
  • Deploy on a cloud VM or platform (Azure, AWS, GCP, Heroku, etc.).
  • Connect a frontend (React, Streamlit, etc.) to the FastAPI backend.

Tips

  • Use JSON as context for faster repeated runs.
  • Add new PDFs to pdf_docs/ and re-run the preparation pipeline as needed.
  • Use .gitignore to avoid committing cache files.

License

MIT License


Acknowledgements

About

Official chatbot for SFIT admission related queries based on Unified Hybrid RAG Framework for its website.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published