Skip to content

Commit 59758c5

Browse files
[AI-271] Elevenhour labs Scribe2 (#170)
* scribe2 * scribe2 * cleanup * add an elevenlabs example
1 parent ddc1433 commit 59758c5

File tree

13 files changed

+3399
-2560
lines changed

13 files changed

+3399
-2560
lines changed

conftest.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
from getstream.video.rtc.track_util import PcmData, AudioFormat
1919
from vision_agents.core.stt.events import STTTranscriptEvent, STTErrorEvent, STTPartialTranscriptEvent
20+
from vision_agents.core.edge.types import Participant
2021

2122
load_dotenv()
2223

@@ -127,6 +128,12 @@ def assets_dir():
127128
return get_assets_dir()
128129

129130

131+
@pytest.fixture
132+
def participant():
133+
"""Create a test participant for STT testing."""
134+
return Participant({}, user_id="test-user")
135+
136+
130137
@pytest.fixture
131138
def mia_audio_16khz():
132139
"""Load mia.mp3 and convert to 16kHz PCM data."""

docs/ai/instructions/ai-utils.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,4 @@ PcmData.from_response
2727
* AudioForwarder to forward audio. See audio_forwarder.py
2828
* QueuedVideoTrack to have a writable video track
2929
* QueuedAudioTrack to have a writable audio track
30+
* AudioQueue enables you to buffer audio, and read a certain number of ms or number of samples of audio
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# ElevenLabs TTS and STT Example
2+
3+
This directory contains an example demonstrating how to use the ElevenLabs TTS and Scribe v2 STT plugins with Vision Agents.
4+
5+
## Overview
6+
7+
This example creates an AI agent that uses ElevenLabs' state-of-the-art voice technology for both speech synthesis and recognition.
8+
9+
## Features
10+
11+
- **ElevenLabs TTS**: High-quality, natural-sounding text-to-speech with customizable voices
12+
- **ElevenLabs Scribe v2**: Real-time speech-to-text with low latency (~150ms) and 99 language support
13+
- **GetStream**: Real-time communication infrastructure
14+
- **Smart Turn Detection**: Natural conversation flow management
15+
- **Gemini LLM**: Intelligent response generation
16+
17+
## Setup
18+
19+
1. Install dependencies:
20+
```bash
21+
cd plugins/elevenlabs/example
22+
uv sync
23+
```
24+
25+
2. Create a `.env` file with your API keys:
26+
```bash
27+
# Required for ElevenLabs TTS and STT
28+
ELEVENLABS_API_KEY=your_elevenlabs_api_key
29+
30+
# Required for GetStream (real-time communication)
31+
STREAM_API_KEY=your_stream_api_key
32+
STREAM_API_SECRET=your_stream_api_secret
33+
34+
# Required for Gemini LLM
35+
GEMINI_API_KEY=your_gemini_api_key
36+
```
37+
38+
## Running the Example
39+
40+
```bash
41+
uv run elevenlabs_example.py
42+
```
43+
44+
The agent will:
45+
1. Connect to the GetStream edge network
46+
2. Initialize ElevenLabs TTS and Scribe v2 STT
47+
3. Join a call and greet you
48+
4. Listen and respond to your voice input in real-time
49+
50+
## Customization
51+
52+
### Voice Selection
53+
54+
You can customize the ElevenLabs voice:
55+
56+
```python
57+
# Use a specific voice ID
58+
tts = elevenlabs.TTS(voice_id="your_voice_id")
59+
60+
# Use a different model
61+
tts = elevenlabs.TTS(model_id="eleven_flash_v2_5")
62+
```
63+
64+
### STT Configuration
65+
66+
Customize the speech-to-text settings:
67+
68+
```python
69+
# Use a different language
70+
stt = elevenlabs.STT(language_code="es") # Spanish
71+
72+
# Adjust VAD settings
73+
stt = elevenlabs.STT(
74+
vad_threshold=0.5,
75+
vad_silence_threshold_secs=2.0,
76+
)
77+
```
78+
79+
### Turn Detection
80+
81+
Adjust turn detection sensitivity:
82+
83+
```python
84+
turn_detection = smart_turn.TurnDetection(
85+
buffer_in_seconds=2.0, # How long to wait for speech
86+
confidence_threshold=0.5, # How confident to be before ending turn
87+
)
88+
```
89+
90+
## ElevenLabs Models
91+
92+
### TTS Models
93+
- `eleven_multilingual_v2`: High-quality, emotionally rich (default)
94+
- `eleven_flash_v2_5`: Ultra-fast with low latency (~75ms)
95+
- `eleven_turbo_v2_5`: Balanced quality and speed
96+
97+
### STT Model
98+
- `scribe_v2_realtime`: Real-time transcription with 99 language support
99+
100+
## Architecture
101+
102+
```
103+
User Voice Input
104+
105+
ElevenLabs Scribe v2 STT (Real-time transcription)
106+
107+
Gemini LLM (Generate response)
108+
109+
ElevenLabs TTS (Synthesize speech)
110+
111+
User Hears Response
112+
```
113+
114+
## Additional Resources
115+
116+
- [ElevenLabs Documentation](https://elevenlabs.io/docs)
117+
- [ElevenLabs Voice Library](https://elevenlabs.io/voice-library)
118+
- [Vision Agents Documentation](https://visionagents.ai)
119+
- [GetStream Documentation](https://getstream.io)
120+
121+
## Troubleshooting
122+
123+
### No audio output
124+
- Verify your `ELEVENLABS_API_KEY` is valid
125+
- Check your audio device settings
126+
- Ensure GetStream connection is established
127+
128+
### Poor transcription quality
129+
- Use 16kHz sample rate audio for optimal results
130+
- Speak clearly and avoid background noise
131+
- Adjust `vad_threshold` if needed
132+
133+
### High latency
134+
- Consider using `eleven_flash_v2_5` for TTS
135+
- Check your network connection
136+
- Reduce `buffer_in_seconds` in turn detection
137+
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Example package
2+
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
You're a friendly voice AI assistant. Here's your personality and style:
2+
3+
## Communication Style
4+
5+
- Be warm, approachable, and helpful
6+
- Keep responses concise and conversational
7+
- Use natural language without being overly formal
8+
- Show enthusiasm when appropriate
9+
10+
## Response Guidelines
11+
12+
### Helpfulness
13+
- Always aim to provide clear, actionable information
14+
- If you don't know something, admit it honestly
15+
- Offer to help with follow-up questions
16+
17+
### Tone
18+
- Friendly but professional
19+
- Patient and understanding
20+
- Encouraging and positive
21+
22+
### Conversation Flow
23+
- Listen actively to what the user says
24+
- Ask clarifying questions when needed
25+
- Stay on topic unless the user changes direction
26+
- Remember context from earlier in the conversation
27+
28+
## Example Phrases
29+
30+
**Greetings:**
31+
- "Hello! How can I help you today?"
32+
- "Hi there! What can I do for you?"
33+
- "Good to hear from you! What's on your mind?"
34+
35+
**Acknowledgment:**
36+
- "I understand, let me help with that."
37+
- "That's a great question!"
38+
- "I see what you mean."
39+
40+
**Clarification:**
41+
- "Just to make sure I understand, you're asking about..."
42+
- "Could you tell me a bit more about..."
43+
- "To clarify, you want to..."
44+
45+
**Assistance:**
46+
- "Here's what I can help you with..."
47+
- "Let me walk you through this..."
48+
- "I'd be happy to explain..."
49+
50+
**Uncertainty:**
51+
- "I'm not entirely sure about that, but..."
52+
- "That's outside my expertise, but I can try to help..."
53+
- "Let me think about the best way to answer that..."
54+
55+
**Closing:**
56+
- "Is there anything else I can help you with?"
57+
- "Let me know if you need anything else!"
58+
- "Feel free to ask if you have more questions."
59+
60+
## Usage Notes
61+
62+
- Keep responses under 2-3 sentences when possible
63+
- Use contractions (I'm, you're, let's) for natural speech
64+
- Avoid jargon unless the user uses it first
65+
- Match the user's energy level (casual vs. professional)
66+
- Be empathetic to the user's needs and emotions
67+
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
"""
2+
ElevenLabs TTS and STT Example
3+
4+
This example demonstrates ElevenLabs TTS and Scribe v2 STT integration with Vision Agents.
5+
6+
This example creates an agent that uses:
7+
- ElevenLabs for text-to-speech (TTS)
8+
- ElevenLabs Scribe v2 for speech-to-text (STT)
9+
- GetStream for edge/real-time communication
10+
- Smart Turn for turn detection
11+
12+
Requirements:
13+
- ELEVENLABS_API_KEY environment variable
14+
- STREAM_API_KEY and STREAM_API_SECRET environment variables
15+
"""
16+
17+
import asyncio
18+
import logging
19+
20+
from dotenv import load_dotenv
21+
22+
from vision_agents.core import User, Agent, cli
23+
from vision_agents.core.agents import AgentLauncher
24+
from vision_agents.plugins import elevenlabs, getstream, smart_turn, gemini
25+
26+
27+
logger = logging.getLogger(__name__)
28+
29+
load_dotenv()
30+
31+
32+
async def create_agent(**kwargs) -> Agent:
33+
"""Create the agent with ElevenLabs TTS and STT."""
34+
agent = Agent(
35+
edge=getstream.Edge(),
36+
agent_user=User(name="Friendly AI", id="agent"),
37+
instructions="You're a friendly voice AI assistant. Keep your replies conversational and concise. Read @assistant.md for personality guidelines",
38+
tts=elevenlabs.TTS(), # Uses ElevenLabs for text-to-speech
39+
stt=elevenlabs.STT(), # Uses ElevenLabs Scribe v2 for speech-to-text
40+
llm=gemini.LLM("gemini-2.5-flash-lite"),
41+
turn_detection=smart_turn.TurnDetection(),
42+
)
43+
return agent
44+
45+
46+
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
47+
"""Join the call and start the agent."""
48+
# Ensure the agent user is created
49+
await agent.create_user()
50+
# Create a call
51+
call = await agent.create_call(call_type, call_id)
52+
53+
logger.info("🤖 Starting ElevenLabs Agent...")
54+
55+
# Have the agent join the call/room
56+
with await agent.join(call):
57+
logger.info("Joining call")
58+
logger.info("LLM ready")
59+
60+
await asyncio.sleep(5)
61+
await agent.llm.simple_response(text="Hello! How can I help you today?")
62+
63+
await agent.finish() # Run till the call ends
64+
65+
66+
if __name__ == "__main__":
67+
cli(AgentLauncher(create_agent=create_agent, join_call=join_call))
68+
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
[project]
2+
name = "elevenlabs-example"
3+
version = "0.0.0"
4+
requires-python = ">=3.10"
5+
6+
dependencies = [
7+
"python-dotenv>=1.0",
8+
"vision-agents-plugins-elevenlabs",
9+
"vision-agents-plugins-getstream",
10+
"vision-agents-plugins-smart-turn",
11+
"vision-agents-plugins-gemini",
12+
"vision-agents",
13+
]
14+
15+
[tool.uv.sources]
16+
"vision-agents-plugins-elevenlabs" = {path = "..", editable=true}
17+
"vision-agents-plugins-getstream" = {path = "../../getstream", editable=true}
18+
"vision-agents-plugins-smart-turn" = {path = "../../smart_turn", editable=true}
19+
"vision-agents-plugins-gemini" = {path = "../../gemini", editable=true}
20+
"vision-agents" = {path = "../../../agents-core", editable=true}
21+

plugins/elevenlabs/pyproject.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,14 @@ build-backend = "hatchling.build"
55
[project]
66
name = "vision-agents-plugins-elevenlabs"
77
dynamic = ["version"]
8-
description = "ElevenLabs TTS integration for Vision Agents"
8+
description = "ElevenLabs TTS and STT integration for Vision Agents"
99
readme = "README.md"
10-
keywords = ["elevenlabs", "TTS", "text-to-speech", "AI", "voice agents", "agents"]
10+
keywords = ["elevenlabs", "TTS", "text-to-speech", "STT", "speech-to-text", "AI", "voice agents", "agents"]
1111
requires-python = ">=3.10"
1212
license = "MIT"
1313
dependencies = [
1414
"vision-agents",
15-
"elevenlabs>=2.5.0",
15+
"elevenlabs>=2.22.1",
1616
]
1717

1818
[project.urls]

0 commit comments

Comments
 (0)