Skip to content

Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.

License

Notifications You must be signed in to change notification settings

GetStream/Vision-Agents

Repository files navigation

Readme

Open Vision Agents by Stream

build PyPI version PyPI - Python Version License Discord


Build Real-Time Vision AI Agents

Watch the demo

Multi-modal AI agents that watch, listen, and understand video.

Vision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.

Key Highlights

  • Video AI: Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.
  • Low Latency: Join quickly (500ms) and maintain audio/video latency under 30ms using Stream's edge network.
  • Open: Built by Stream, but works with any video edge network.
  • Native APIs: Native SDK methods from OpenAI (create response), Gemini (generate), and Claude (create message) — always access the latest LLM capabilities.
  • SDKs: SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.

See It In Action

Sports Coaching

This example shows you how to build golf coaching AI with YOLO and OpenAI realtime. Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases. For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.

# partial example, full example: examples/02_golf_coach_example/golf_coach_example.py
agent = Agent(
    edge=getstream.Edge(),
    agent_user=agent_user,
    instructions="Read @golf_coach.md",
    llm=openai.Realtime(fps=10),
    #llm=gemini.Realtime(fps=1), # Careful with FPS can get expensive
    processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")],
)

This example shows you how to build golf coaching AI with YOLO and OpenAI realtime. Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases. For example: Drone fire detection. Sports/video game coaching. Physical therapy. Workout coaching, Just dance style games etc.

Golf Example

Cluely style Invisible Assistant (coming soon)

Apps like Cluely offer realtime coaching via an invisible overlay. This example shows you how you can build your own invisible assistant. It combines Gemini realtime (to watch your screen and audio), and doesn't broadcast audio (only text). This approach is quite versatile and can be used for: Sales coaching, job interview cheating, physical world/ on the job coaching with glasses

Demo video

agent = Agent(
    edge=StreamEdge(),  # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
    agent_user=agent_user,  # the user object for the agent (name, image etc)
    instructions="You are silently helping the user pass this interview. See @interview_coach.md",
    # gemini realtime, no need to set tts, or sst (though that's also supported)
    llm=gemini.Realtime()
)

Quick Start

Step 1: Install via uv

uv add vision-agents

Step 2: (Optional) Install with extra integrations

uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"

Step 3: Obtain your Stream API credentials

Get a free API key from Stream. Developers receive 333,000 participant minutes per month, plus extra credits via the Maker Program.

Features

Feature Description
True real-time via WebRTC Stream directly to model providers that support it for instant visual understanding.
Interval/processor pipeline For providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls.
Turn detection & diarization Keep conversations natural; know when the agent should speak or stay quiet and who's talking.
Voice activity detection (VAD) Trigger actions intelligently and use resources efficiently.
Speech↔Text↔Speech Enable low-latency loops for smooth, conversational voice UX.
Tool/function calling Execute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services.
Built-in memory via Stream Chat Agents recall context naturally across turns and sessions.
Text back-channel Message the agent silently during a call.

Out-of-the-Box Integrations

Plugin Name Description Docs Link
Cartesia TTS plugin for realistic voice synthesis in real-time voice applications View Docs
Deepgram STT plugin for fast, accurate real-time transcription with speaker diarization View Docs
ElevenLabs TTS plugin with highly realistic and expressive voices for conversational agents View Docs
Kokoro Local TTS engine for offline voice synthesis with low latency View Docs
Moonshine STT plugin optimized for fast, locally runnable transcription on constrained devices View Docs
OpenAI LLM plugin for real-time reasoning, conversation, and multimodal capabilities using OpenAI's Realtime API View Docs
Gemini Multimodal plugin for real-time audio, video, and text understanding powered by Google's Gemini Live models View Docs
Silero VAD plugin for voice activity detection and turn-taking in low-latency real-time conversations View Docs
Wizper Real-time variant of OpenAI's Whisper v3 for Speech-to-Text and on-the-fly translation, hosted by Fal.ai View Docs

Processors

Processors let your agent manage state and handle audio/video in real-time.

They take care of the hard stuff, like:

  • Running smaller models
  • Making API calls
  • Transforming media

… so you can focus on your agent logic.

Documentation

Check out our getting started guide at VisionAgents.ai.

Quickstart: Building a Voice AI app
Quickstart: Building a Video AI app
Tutorial: Building real-time sports coaching
Tutorial: Building a real-time meeting assistant

Development

See DEVELOPMENT.md

Open Platform

Want to add your platform or provider? Reach out to [email protected].

Awesome Video AI

Our favorite people & projects to follow for vision AI

@demishassabis
CEO @ Google DeepMind
Won a Nobel prize
@OfficialLoganK
Product Lead @ Gemini
Posts about robotics vision
@ultralytics
Various fast vision AI models
Pose, detect, segment, classify
@skalskip92
Open Source Lead @ Roboflow
Building tools for vision AI
@moondreamai
The tiny vision model that could
Lightweight, fast, efficient
@kwindla
Pipecat / Daily
Sharing AI and vision insights
@juberti
Head of Realtime AI @ OpenAI
Realtime AI systems
@romainhuet
Head of DX @ OpenAI
Developer tooling & APIs
@thorwebdev
Eleven Labs
Voice and AI experiments
@mervenoyann
Hugging Face
Posts extensively about Video AI
@stash_pomichter
Spatial memory for robots
Robotics & AI navigation

Inspiration

  • Livekit Agents: Great syntax, Livekit only
  • Pipecat: Flexible, but more verbose.
  • OpenAI Agents: Focused on openAI only

Roadmap

0.1 – First Release

  • Support for 10+ out-of-the-box integrations
  • Video processors
  • Native Stream Chat integration for memory
  • MCP & function calling for Gemini and OpenAI
  • Realtime WebRTC video and voice with GPT Realtime

Coming Soon

[ ] Improved Python WebRTC library
[ ] Hosting & production deploy example
[ ] More built-in YOLO processors (object & person detection)
[ ] Roboflow support
[ ] Computer use support
[ ] AI avatar integrations (e.g., Tavus)
[ ] QWen3 vision support
[ ] Buffered video capture (for "catch the moment" scenarios)
[ ] Moondream vision

Star History

Star History Chart

About

Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages