Agentic Very Long Video Understanding

Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, Hyo Jin Kim

Here we provide code for our agentic framework for very long video understanding powered by entity scene graphs, EGAgent. EGAgent consists of a planning agent equipped for multi-hop cross-modal reasoning by querying three tools: a visual search tool, an audio transcript search tool, and an entity graph search tool. We structure this repository as follows:

Create Data Sources for Tool Querying
Agent Inference
Baselines and Evaluation

Installation

Install prerequisite packages with conda.

conda env create -f environment.yml
conda activate egagent

Configure Paths

Update dataset, model, and API key locations in paths.py before running the scripts.

Set up Multimodal Embedding Model

Download the multimodal embedding model used by EGAgent's visual search tool. We use SigLIP 2 by default, but this can be replaced by the latest state-of-the-art image-text encoder. This repository is the default path to download the embedding model checkpoints, this can be changed to another path in paths.py.

git lfs install
git clone https://huggingface.co/google/siglip2-giant-opt-patch16-384

Create Data Sources for Tool Querying

We create data sources for the visual search tool and entity graph in prepare_datasources/. The audio transcripts are queried on the fly and do not require an explicit data source.

EGAgent Inference

We provide code for EGAgent inference on EgoLife and Video-MME in egagent/.

Baseline Inference

We provide code to evaluate other baselines on very long video understanding, i.e. multimodal LLMs that uniformly sample frames and transcripts in baselines/.

Ablations

We provide code to compute retrieval recall of EGAgent tools as well as to generate plots from our paper in ablations/.

Citation

If you find this project useful in your research, please consider citing:

@misc{rege2025agentic,
  title={Agentic Very Long Video Understanding},
  author={Rege, Aniket and Sadhu, Arka and Li, Yuliang and Li, Kejie and Vinayak, Ramya Korlakai and Chai, Yuning and Lee, Yong Jae and Kim, Hyo Jin},
  month={January},
  year={2026},
  eprint={2601.18157},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2601.18157},
}

Contribution

See the CONTRIBUTING file for how to help out.

License

This code is CC-BY-NC4.0 licensed, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
ablations		ablations
baselines		baselines
docs		docs
egagent		egagent
figs		figs
prepare_datasources		prepare_datasources
LICENSE.md		LICENSE.md
README.md		README.md
environment.yml		environment.yml
eval.py		eval.py
paths.py		paths.py
retrieval_model.py		retrieval_model.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Very Long Video Understanding

Installation

Configure Paths

Set up Multimodal Embedding Model

Create Data Sources for Tool Querying

EGAgent Inference

Baseline Inference

Ablations

Citation

Contribution

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

facebookresearch/egagent

Folders and files

Latest commit

History

Repository files navigation

Agentic Very Long Video Understanding

Installation

Configure Paths

Set up Multimodal Embedding Model

Create Data Sources for Tool Querying

EGAgent Inference

Baseline Inference

Ablations

Citation

Contribution

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages