Junfu Pu*,
Teng Wang*,
Yixiao Ge†,
Yuying Ge,
Chen Li,
Ying Shan
ARC Lab, Tencent PCG
*Core contributors †Project lead
An illustration of the capabilities of our video chaptering model. Given a video, our model is able to generate timestamped chapters with three-level structured output: 1) Short Title - a concise label summarizing each chapter; 2) Structural Chapter - a detailed, structured annotation for each chapter, including a rewritten comprehensive Title, an Abstract summarizing the core content, and an Introduction describing key details and highlights; and 3) Timestamp-Aligned Video Description - fine-grained descriptions aligned with precise temporal boundaries. This hierarchical structure facilitates an efficient and precise understanding of video content.
With the explosion of video content, users on video platforms now expect video chapters to navigate long videos and find key moments instantly. However, most chapters are created manually or with simple rules—a process that is slow, expensive, and often fails to capture the true semantic structure of the content. For information-dense videos like lectures, tutorials, product reviews, and meetings, there is a critical need to automatically generate a clear table of contents with precise timestamps. For viewers, this means a "navigable" experience where they can jump to the exact information they need. For creators and platforms, it means powerful tools for content clipping, fast search, and smarter recommendations, ultimately boosting user engagement.
To meet this demand, we introduce ARC-Chapter: a large-scale model from Tencent ARC Lab for deep video understanding and structured chapter generation. ARC-Chapter automatically analyzes videos that have a clear narrative or semantic structure, segmenting them into meaningful chapters, identifying precise timestamps, and generating summaries for each part. Our goal is to let users browse videos like they read a document—getting straight to the point—while empowering platforms and creators with more efficient ways to manage and distribute their content.
Ideal Use Cases for ARCChapter:
- Education & Knowledge: Lectures, courses, and scientific explainers.
- Tutorials & How-To's: Software demos, cooking guides, fitness routines, and gaming walkthroughs.
- Presentations & Talks: Conference keynotes, product announcements, and interviews.
- Reviews & Guides: Unboxing videos, product comparisons, and travel vlogs.
- Analysis & Commentary: Movie reviews, plot summaries, and deep dives.
- etc.
- [2025.11.19] Technical report and API of ARC-Chapter released!
The model seamlessly processes both Chinese and English videos.
Choose the level of detail you need with three distinct output formats:
- Simple Chapters: Generates precise start times and a concise title for each chapter. Perfect for quick navigation.
- Detailed Structured Chapters: Provides a rich, structured output including:
- Chapter Timestamps
- Short and Long Titles
- Segment Summaries
- An Overall Video-level Summary
- Segmented Video Descriptions: Creates timestamped description covering the entire video from start to finish.
- Multimodal Input: ARC-Chapter leverages both video frames and ASR transcripts to deeply understand content.
- Video Length: It effectively processes everything from short 3-5 minute clips to feature-length, hour-long videos.
ARC-Chapter's exceptional performance comes from being trained on a large-scale, high-quality dataset built with our proprietary semi-automatic annotation pipeline. This dataset covers a vast array of topics: Knowledge, Gaming, Technology, Music, Lifestyle, Film & TV, Sports, Automotive, Food, and etc. This diverse training ensures the model is highly generalizable and performs reliably across different genres and video formats.
ARC-Chapter is built on top of the powerful Qwen2.5VL model, with several key architectural designs that make it exceptionally effective for video chaptering.
- Flexible Multimodal Inputs:
To deeply understand video content, ARC-Chapter is designed to be multimodal and robust:
- Efficient Long-Video Processing: We transcribe audio into text (ASR), significantly reducing the input context length and enabling the model to process hour-long videos efficiently.
- Modality Dropout: During training, we randomly drop either the visual or text modality. This technique makes the final model highly versatile, allowing it to work with video-only, text-only (ASR), or combined video+text inputs.
- Large-Scale Data Annotation Pipeline:
The performance of ARC-Chapter is powered by a massive, high-quality dataset:
- We developed a semi-automatic annotation pipeline to process hundreds of thousands of videos. This pipeline allowed us to construct a large-scale, high-precision dataset that is foundational to the model's accuracy and generalization capabilities.
- Explicit Timestamp Injection:
To ensure precise temporal localization, we enhance the model's awareness of time:
- During training, we randomly overlay the timestamp (in HH:MM:SS format) directly onto each video frame. This explicitly teaches the model to associate visual cues with their exact moment in the video, leading to more accurate chapter boundaries.
- Dynamic Prompt Engineering
We guide the model to generate the desired output with tailored instructions:
- We use customized prompts that adapt to the specific task. These prompts vary based on the input modality, the requested output format (e.g., short titles vs. structured summaries), and the video's language (Chinese/English).
The model's capabilities of structured video chaptering and summarization are shown as follow:
For more video chaptering results, please visit our blog: 👉 Blog
Please visit our online ARC-Chapter Demo on ARC Lab website if you have interest in our work.
How to find demo on ARC Lab Homepage:
ARC Lab -> AI Demo -> Register with Phone No. -> Multimodal Comprehension and Generation -> ARC-Chapter-7B
We provide model access via API service. A brief tutorial on how to use the API is as follow. For more details, please refer to the documentation.
Prior to using the ARC-Chapter API, obtaining an access token (ARC-Token) is mandatory. Users who are not logged in are required to complete account verification first.
Steps to get your token:
- Log in: Visit ARC Website and log in with your mobile number.
- Retrieve Token: Once logged in, click the user icon in the top-right corner and select "View Token" from the dropdown menu to get your ARC-TOKEN.
Python package install: pip install requestsimport requests
import json
from typing import Tuple, Dict, Optional, Any
ARC_TOKENS = {
"tokens": "YOUR_ARC_TOKEN",
}
headers = {
"Content-type": "application/json", "Accept": "text/plain", "charset": "UTF-8",
"Authorization": ARC_TOKENS["tokens"]
}
class ArcChapter7bClient:
def __init__(self, base_api_url: str):
self.base_url = base_api_url.rstrip("/")
self.api_endpoint = f"{self.base_url}/cvc_function/arc_chapter_7b/"
self.session = requests.Session()
def upload_and_process_video(self, video_url: str, language: str = "", no_vid: bool = None, no_asr: bool = None,
output_type: str = "", timeout: int = 600, ) -> Tuple[bool, Dict[str, Any]]:
data = {
"video_url": video_url, "language": language, "no_vid": no_vid,
"no_asr": no_asr, "output_type": output_type
}
try:
response = self.session.post(
url=self.api_endpoint,
data=json.dumps(data),
timeout=timeout,
headers=headers
)
response.raise_for_status()
result = response.json()
if result.get("code") == 0:
return True, result
else:
return False, {
"message": f"API returned error {result.get('message', 'Unknown error')}",
"code": result.get("code"),
"raw_response": result
}
except requests.exceptions.Timeout:
return False, {"error": f"Request timed out({timeout}秒)"}
except requests.exceptions.ConnectionError:
return False, {"error": "Connection failed, please check API URL or network"}
except requests.exceptions.HTTPError as e:
return False, {"error": f"HTTP error: {str(e)}", "status_code": response.status_code}
except json.JSONDecodeError:
return False, {"error": "API returned non-JSON data", "raw_data": response.text}
except Exception as e: # pylint: disable=broad-except
return False, {"error": f"Call failed: {str(e)}"}
if __name__ == "__main__":
API_BASE_URL = "https://arc.tencent.com/"
client = ArcChapter7bClient(base_api_url=API_BASE_URL)
video_url = "https://50058.gzc.svp.tencent-cloud.com/0b53jqazwaabhuaigpyjaruywtgdtngadgya.f0.mp4?dis_k=408b92763fc60607fc538781de0444dc&dis_t=1762138225"
output_type = "structinfo"
language = "zh"
no_vid = False
no_asr = False
success_short, result_short = client.upload_and_process_video(
video_url=video_url,
output_type=output_type,
language=language,
no_vid=no_vid,
no_asr=no_asr,
timeout=300
)
print("success_short", success_short)
if success_short:
print("=== Service [Call Succeeded] ===")
print(f"Status: {result_short['message']}")
print(f"Result: \n{result_short}")
else:
print("=== Service [Call Failed] ===")
print(f"Error: {result_short['error'] if 'error' in result_short else result_short['raw_response']}")
If you find the work helpful, please consider giving a star and citing the following article:
@article{pu2025arc,
title={ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries},
author={Pu, Junfu and Wang, Teng and Ge, Yixiao and Ge, Yuying and Li, Chen and Shan, Ying},
journal={arXiv preprint arXiv:2511.14349},
year={2025}
}

