A voice-powered guesstimation analysis system built with OpenAI's Voice Agents SDK, FastAPI, Next.js, and MongoDB. This project demonstrates how to create a speech-to-speech AI assistant for breaking down complex estimation problems and generating structured markdown reports with visual flow diagrams.
- Voice-First Interface: Natural voice commands for all operations
- Guesstimation Analysis: Break down complex estimation problems step by step
- Conversation Tracking: Automatically track and analyze voice conversations
- Markdown Export: Generate structured reports with flow diagrams
- Session Management: Save and retrieve previous analysis sessions
- Real-time Updates: WebSocket-based communication for instant feedback
- Visual Flow Diagrams: Mermaid-based diagrams showing estimation logic
- Multi-tenant Database: Supports multiple users with proper data isolation
graph TB
subgraph "Frontend (Next.js)"
UI[User Interface]
Audio[Audio Recording/Playback]
WS[WebSocket Client]
end
subgraph "Backend (FastAPI)"
API[FastAPI Server]
VoicePipeline[Voice Pipeline]
Workflow[Voice Workflow]
end
subgraph "OpenAI Voice Agents SDK"
Agent[AI Agent]
Tools[Function Tools]
Runner[Agent Runner]
end
subgraph "External Services"
OpenAI[OpenAI API]
Whisper[Whisper STT]
TTS[Text-to-Speech]
end
subgraph "Database"
MongoDB[(MongoDB)]
Sessions[Guesstimation Sessions]
Context[Conversation Context]
end
UI --> Audio
Audio --> WS
WS --> API
API --> VoicePipeline
VoicePipeline --> Workflow
Workflow --> Runner
Runner --> Agent
Agent --> Tools
Tools --> MongoDB
Agent --> OpenAI
OpenAI --> Whisper
OpenAI --> TTS
MongoDB --> Sessions
MongoDB --> Context
style UI fill:#e1f5fe
style API fill:#f3e5f5
style VoicePipeline fill:#fff3e0
style Agent fill:#ffe0b2
style Tools fill:#ffcc02
style Runner fill:#ffb74d
style OpenAI fill:#e8f5e8
style Whisper fill:#c8e6c9
style TTS fill:#a5d6a7
style MongoDB fill:#fce4ec
- "I want to estimate how many people go to movies in India"
- "Let's analyze the market size for electric vehicles"
- "Help me estimate the number of coffee shops in New York"
- "Create markdown from this conversation"
- "Export my analysis"
- "Save this guesstimation session"
- "Generate a report"
- "Show my previous analyses"
- "Clear this session"
- "What sessions do I have?"
- Backend: FastAPI, OpenAI Voice Agents SDK, Motor (MongoDB async driver)
- Frontend: Next.js 15, TypeScript, Tailwind CSS, wavtools
- Database: MongoDB (with indexes for performance)
- AI: GPT-4o-mini, OpenAI Whisper (STT), OpenAI TTS
- Communication: WebSockets for real-time voice interaction
- Package Management: uv (Python), npm (Node.js)
- Python 3.11+
- Node.js 18+
- MongoDB (local or Atlas)
- OpenAI API key with voice capabilities
- Microphone access
git clone <your-repo>
cd experiment-openai-speech-to-speech
Create a .env
file in the root directory:
# OpenAI API Configuration
OPENAI_API_KEY=your_openai_api_key_here
# Database Configuration
MONGODB_URI=mongodb://localhost:27017
DATABASE_NAME=guesstimation_analysis
# Backend dependencies
cd server && uv sync
# Frontend dependencies
cd ../frontend && npm install
# From root directory
make dev
Or run separately:
# Backend (Terminal 1)
cd server && uv run server.py
# Frontend (Terminal 2)
cd frontend && npm run dev
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- WebSocket: ws://localhost:8000/ws
{
"_id": ObjectId,
"user_id": ObjectId, // Multi-tenant support
"scenario_title": "Movie attendance in India",
"conversation_summary": "Full conversation text...",
"markdown_content": "Generated markdown with flow diagram...",
"created_at": ISODate,
"updated_at": ISODate
}
{
"_id": ObjectId,
"user_id": ObjectId,
"session_id": ObjectId,
"context": {
"current_conversation": [...],
"scenario_title": "Movie attendance in India",
"analysis_steps": [...],
"final_estimate": {...}
},
"created_at": ISODate,
"updated_at": ISODate
}
# Setup
make setup # Initial setup with .env creation
make sync # Install/update dependencies
# Development
make dev # Start both frontend and backend
make frontend # Start only frontend
make server # Start only backend
# Production
make build # Build frontend for production
make start # Start production server
# Utilities
make clean # Clean build artifacts
make help # Show all commands
cd server && uv run test_db.py
- Start the application
- Grant microphone permissions
- Try voice commands from the examples above
- Check database for created sessions
- User: "I want to estimate how many people go to movies in India every month"
- Assistant: "Great! Let's break this down step by step. What's your starting point?"
- User: "Well, India has about 1.4 billion people, and maybe 35% live in cities..."
- Assistant: "Good start! Let's think about the age demographics..."
- User: "Create markdown from this conversation"
- System: Generates structured markdown with flow diagram and saves to database
# Guesstimation Analysis: Movie Attendance in India
## Scenario
Estimate how many people go to movies every month and every year in India.
## Conversation Summary
**User**: I want to estimate how many people go to movies in India
**Assistant**: Let's break this down step by step. What's your starting point?
**User**: Well, India has about 1.4 billion people, and maybe 35% live in cities...
## Step-by-Step Analysis
### 1. Total Population
- **Value**: 1.4 billion people
- **Source**: User estimate
- **Confidence**: High
### 2. Urban Population
- **Value**: 35% of total population
- **Calculation**: 1.4B Γ 0.35 = 490 million
- **Assumption**: Movie theaters primarily in urban areas
- **Confidence**: Medium
## Flow Diagram
```mermaid
graph TD
A[Total Population: 1.4B] --> B[Urban Population: 35%]
B --> C[Urban Population: 490M]
C --> D[Movie-going Age: 15-65]
D --> E[Target Age Group: 60%]
E --> F[Potential Audience: 294M]
F --> G[Monthly Attendance Rate: 50%]
G --> H[Monthly Attendance: 147M]
H --> I[Yearly Attendance: 1.76B]
style A fill:#e1f5fe
style H fill:#c8e6c9
style I fill:#ffcc02
- Monthly Movie Attendance: 147 million people
- Yearly Movie Attendance: 1.76 billion people
- Overall Confidence Level: Medium (60%)
## π§ Troubleshooting
### Common Issues
1. **Microphone Access Denied**
- Check browser permissions
- Use HTTPS in production
- Clear browser cache
2. **Database Connection Failed**
- Verify MongoDB is running
- Check connection string
- Ensure database exists
3. **OpenAI API Errors**
- Verify API key is valid
- Check API quota
- Ensure voice capabilities enabled
4. **WebSocket Connection Issues**
- Check server is running
- Verify WebSocket URL
- Check firewall settings
## π€ Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Acknowledgments
- OpenAI for the Voice Agents SDK
- FastAPI for the excellent web framework
- Next.js for the frontend framework
- MongoDB for the database solution
## π Support
For questions or issues:
1. Check the troubleshooting section
2. Review the documentation
3. Open an issue on GitHub
---
**Happy Voice-Enabled Guesstimation Analysis! π€π**