ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

Junfu Pu^*, Teng Wang^*, Yixiao Ge^†, Yuying Ge, Chen Li, Ying Shan
ARC Lab, Tencent PCG
^*Core contributors ^†Project lead

An illustration of the capabilities of our video chaptering model. Given a video, our model is able to generate timestamped chapters with three-level structured output: 1) Short Title - a concise label summarizing each chapter; 2) Structural Chapter - a detailed, structured annotation for each chapter, including a rewritten comprehensive Title, an Abstract summarizing the core content, and an Introduction describing key details and highlights; and 3) Timestamp-Aligned Video Description - fine-grained descriptions aligned with precise temporal boundaries. This hierarchical structure facilitates an efficient and precise understanding of video content.

Introduction

With the explosion of video content, users on video platforms now expect video chapters to navigate long videos and find key moments instantly. However, most chapters are created manually or with simple rules—a process that is slow, expensive, and often fails to capture the true semantic structure of the content. For information-dense videos like lectures, tutorials, product reviews, and meetings, there is a critical need to automatically generate a clear table of contents with precise timestamps. For viewers, this means a "navigable" experience where they can jump to the exact information they need. For creators and platforms, it means powerful tools for content clipping, fast search, and smarter recommendations, ultimately boosting user engagement.

To meet this demand, we introduce ARC-Chapter: a large-scale model from Tencent ARC Lab for deep video understanding and structured chapter generation. ARC-Chapter automatically analyzes videos that have a clear narrative or semantic structure, segmenting them into meaningful chapters, identifying precise timestamps, and generating summaries for each part. Our goal is to let users browse videos like they read a document—getting straight to the point—while empowering platforms and creators with more efficient ways to manage and distribute their content.

Ideal Use Cases for ARCChapter:

Education & Knowledge: Lectures, courses, and scientific explainers.
Tutorials & How-To's: Software demos, cooking guides, fitness routines, and gaming walkthroughs.
Presentations & Talks: Conference keynotes, product announcements, and interviews.
Reviews & Guides: Unboxing videos, product comparisons, and travel vlogs.
Analysis & Commentary: Movie reviews, plot summaries, and deep dives.
etc.

News

[2025.11.19] Technical report and API of ARC-Chapter released!

Model

Key Features

🌐 Multilingual Support

The model seamlessly processes both Chinese and English videos.

📝 Flexible Output Modes

Choose the level of detail you need with three distinct output formats:

Simple Chapters: Generates precise start times and a concise title for each chapter. Perfect for quick navigation.
Detailed Structured Chapters: Provides a rich, structured output including:
- Chapter Timestamps
- Short and Long Titles
- Segment Summaries
- An Overall Video-level Summary
Segmented Video Descriptions: Creates timestamped description covering the entire video from start to finish.

🚀 Robust and Scalable

Multimodal Input: ARC-Chapter leverages both video frames and ASR transcripts to deeply understand content.
Video Length: It effectively processes everything from short 3-5 minute clips to feature-length, hour-long videos.

🏆 Trained on a Massive, Diverse Dataset

ARC-Chapter's exceptional performance comes from being trained on a large-scale, high-quality dataset built with our proprietary semi-automatic annotation pipeline. This dataset covers a vast array of topics: Knowledge, Gaming, Technology, Music, Lifestyle, Film & TV, Sports, Automotive, Food, and etc. This diverse training ensures the model is highly generalizable and performs reliably across different genres and video formats.

Method

ARC-Chapter is built on top of the powerful Qwen2.5VL model, with several key architectural designs that make it exceptionally effective for video chaptering.

Flexible Multimodal Inputs: To deeply understand video content, ARC-Chapter is designed to be multimodal and robust:
- Efficient Long-Video Processing: We transcribe audio into text (ASR), significantly reducing the input context length and enabling the model to process hour-long videos efficiently.
- Modality Dropout: During training, we randomly drop either the visual or text modality. This technique makes the final model highly versatile, allowing it to work with video-only, text-only (ASR), or combined video+text inputs.
Large-Scale Data Annotation Pipeline: The performance of ARC-Chapter is powered by a massive, high-quality dataset:
- We developed a semi-automatic annotation pipeline to process hundreds of thousands of videos. This pipeline allowed us to construct a large-scale, high-precision dataset that is foundational to the model's accuracy and generalization capabilities.
Explicit Timestamp Injection: To ensure precise temporal localization, we enhance the model's awareness of time:
- During training, we randomly overlay the timestamp (in HH:MM:SS format) directly onto each video frame. This explicitly teaches the model to associate visual cues with their exact moment in the video, leading to more accurate chapter boundaries.
Dynamic Prompt Engineering We guide the model to generate the desired output with tailored instructions:
- We use customized prompts that adapt to the specific task. These prompts vary based on the input modality, the requested output format (e.g., short titles vs. structured summaries), and the video's language (Chinese/English).

Example

The model's capabilities of structured video chaptering and summarization are shown as follow:

For more video chaptering results, please visit our blog: 👉 Blog

Usages

Demo

Please visit our online ARC-Chapter Demo on ARC Lab website if you have interest in our work.

How to find demo on ARC Lab Homepage:

ARC Lab -> AI Demo -> Register with Phone No. -> Multimodal Comprehension and Generation -> ARC-Chapter-7B

API Service

We provide model access via API service. A brief tutorial on how to use the API is as follow. For more details, please refer to the documentation.

Prior to using the ARC-Chapter API, obtaining an access token (ARC-Token) is mandatory. Users who are not logged in are required to complete account verification first.

Steps to get your token:

Log in: Visit ARC Website and log in with your mobile number.
Retrieve Token: Once logged in, click the user icon in the top-right corner and select "View Token" from the dropdown menu to get your ARC-TOKEN.

Python package install: pip install requests

import requests
import json
from typing import Tuple, Dict, Optional, Any

ARC_TOKENS = {
    "tokens": "YOUR_ARC_TOKEN",
}

headers = {
    "Content-type": "application/json", "Accept": "text/plain", "charset": "UTF-8",
    "Authorization": ARC_TOKENS["tokens"]
}


class ArcChapter7bClient:
    def __init__(self, base_api_url: str):
        self.base_url = base_api_url.rstrip("/")
        self.api_endpoint = f"{self.base_url}/cvc_function/arc_chapter_7b/"
        self.session = requests.Session()

    def upload_and_process_video(self, video_url: str, language: str = "", no_vid: bool = None, no_asr: bool = None,
                                 output_type: str = "", timeout: int = 600, ) -> Tuple[bool, Dict[str, Any]]:

        data = {
            "video_url": video_url, "language": language, "no_vid": no_vid,
            "no_asr": no_asr, "output_type": output_type
        }

        try:
            response = self.session.post(
                url=self.api_endpoint,
                data=json.dumps(data),
                timeout=timeout,
                headers=headers
            )

            response.raise_for_status()

            result = response.json()

            if result.get("code") == 0:
                return True, result
            else:
                return False, {
                    "message": f"API returned error {result.get('message', 'Unknown error')}",
                    "code": result.get("code"),
                    "raw_response": result
                }
        except requests.exceptions.Timeout:
            return False, {"error": f"Request timed out（{timeout}秒）"}
        except requests.exceptions.ConnectionError:
            return False, {"error": "Connection failed, please check API URL or network"}
        except requests.exceptions.HTTPError as e:
            return False, {"error": f"HTTP error: {str(e)}", "status_code": response.status_code}
        except json.JSONDecodeError:
            return False, {"error": "API returned non-JSON data", "raw_data": response.text}
        except Exception as e:  # pylint: disable=broad-except
            return False, {"error": f"Call failed: {str(e)}"}


if __name__ == "__main__":
    API_BASE_URL = "https://arc.tencent.com/"

    client = ArcChapter7bClient(base_api_url=API_BASE_URL)

    video_url = "https://50058.gzc.svp.tencent-cloud.com/0b53jqazwaabhuaigpyjaruywtgdtngadgya.f0.mp4?dis_k=408b92763fc60607fc538781de0444dc&dis_t=1762138225"
    output_type = "structinfo"
    language = "zh"
    no_vid = False
    no_asr = False

    success_short, result_short = client.upload_and_process_video(
        video_url=video_url,
        output_type=output_type,
        language=language,
        no_vid=no_vid,
        no_asr=no_asr,
        timeout=300
    )
    print("success_short", success_short)
    if success_short:
        print("=== Service [Call Succeeded] ===")
        print(f"Status: {result_short['message']}")
        print(f"Result: \n{result_short}")
    else:
        print("=== Service [Call Failed] ===")
        print(f"Error: {result_short['error'] if 'error' in result_short else result_short['raw_response']}")

Citation

If you find the work helpful, please consider giving a star and citing the following article:

@article{pu2025arc,
  title={ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries},
  author={Pu, Junfu and Wang, Teng and Ge, Yixiao and Ge, Yuying and Li, Chen and Shan, Ying},
  journal={arXiv preprint arXiv:2511.14349},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets/figures		assets/figures
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

Introduction

News

Model

Key Features

🌐 Multilingual Support

📝 Flexible Output Modes

🚀 Robust and Scalable

🏆 Trained on a Massive, Diverse Dataset

Method

Example

Usages

Demo

API Service

Citation

About

Uh oh!

Releases

Packages

License

TencentARC/ARC-Chapter

Folders and files

Latest commit

History

Repository files navigation

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

Introduction

News

Model

Key Features

🌐 Multilingual Support

📝 Flexible Output Modes

🚀 Robust and Scalable

🏆 Trained on a Massive, Diverse Dataset

Method

Example

Usages

Demo

API Service

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages