A full-stack application that enables scraping Discord servers and querying the data using both RAG (Retrieval Augmented Generation) and SQL approaches. The system uses Modal for serverless deployment, FastAPI for the backend, and React with TypeScript for the frontend.
The application consists of two main services:
-
Backend Service (Modal/FastAPI)
- Handles Discord server scraping
- Manages SQLite database with vector embeddings
- Processes queries using either RAG or SQL approaches
- Integrates with OpenAI for embeddings and completions
-
Frontend Service (React/TypeScript)
- Provides user interface for server scraping
- Enables natural language querying of Discord data
- Visualizes the query processing pipeline
- Built with shadcn/ui components
- Python 3.8+
- Bun 1.0+
- Modal CLI
- OpenAI API key
- Discord Bot Token with necessary permissions
- uv (Python package installer)
- Clone the repository:
git clone <repository-url>
cd discord-rag-pipeline
- Create a
.env
file in the root directory:
OPENAI_API_KEY=your_openai_api_key
DISCORD_TOKEN=your_discord_bot_token
- Create a
.env
file in thefrontend_service
directory:
VITE_MODAL_URL=http://localhost:8000 # For local development
- Navigate to the backend directory:
cd backend_service
- Install dependencies with uv:
uv pip install -r requirements.txt
- Deploy to Modal:
modal deploy src/modal_app/main.py
- Navigate to the frontend directory:
cd frontend_service
- Install dependencies with Bun:
bun install
- Start the development server:
bun dev
-
Scraping Discord Data
- Enter your Discord server ID in the scraper form
- Set the desired message limit
- Click "Scrape Server" to begin data collection
-
Querying Data
- Enter your question in natural language
- The system automatically determines whether to use RAG or SQL
- View the complete processing pipeline in the UI
-
RAG-based queries:
- "What are the main topics discussed in the server?"
- "Summarize recent conversations about React"
-
SQL-based queries:
- "How many messages were sent today?"
- "Who are the most active users?"
CREATE TABLE discord_messages (
id TEXT PRIMARY KEY,
channel_id TEXT NOT NULL,
author_id TEXT NOT NULL,
content TEXT NOT NULL,
created_at TIMESTAMP NOT NULL
);
CREATE VIRTUAL TABLE vec_discord_messages USING vec0(
id TEXT PRIMARY KEY,
embedding FLOAT[1536]
);
- Vector Search: Uses SQLite-VEC for efficient similarity search
- Hybrid Query Processing: Automatically chooses between RAG and SQL approaches
- Real-time Processing: Processes Discord messages and generates embeddings on-the-fly
- Interactive UI: Visualizes the complete query processing pipeline
The backend uses Modal for serverless deployment and includes:
- FastAPI for API endpoints
- SQLite with vector extension for data storage
- OpenAI integration for embeddings and completions
- uv for fast, reliable Python package management
The frontend is built with:
- React 18 with TypeScript
- Vite for build tooling
- shadcn/ui for components
- Lucide icons
- Bun for fast JavaScript runtime and package management
-
Discord Scraping Issues
- Ensure bot has necessary permissions
- Check server ID is correct
- Verify Discord token is valid
-
Query Processing Issues
- Confirm OpenAI API key is valid
- Check database contains scraped messages
- Verify embeddings are being generated correctly
-
Package Management Issues
- For Python: Try
uv pip install --force-reinstall -r requirements.txt
- For JavaScript: Try
bun install --force
- For Python: Try
MIT
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a new Pull Request