Skip to content

Conversation

@hanouticelina
Copy link
Contributor

This PR adds a new CLI command that checks cached files against their checksums on the Hub. It verifies all cached revisions for a repo, or specific snapshots if a revision is provided.

Under the hoods, it lists remote files for each revision using list_repo_tree, maps them to local snapshots, and compares the sets to find files that are missing locally or on the Hub. Then for each file, it computes and compares checksums.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin
Copy link
Contributor

Wauplin commented Oct 22, 2025

Hey! Thanks for opening this PR. Here are some high-level thoughts about this feature:

  • 💯% agree that the purpose of the command is to compute file checksums
  • as requested in command for verifying local files #3298, I would make the command compatible with local directories as well (not necessarily the cache). It is a bit counter-intuitive with the naming hf cache verify but it's still fine IMO. Another possibility would be to have directly hf verify but it's less self-explanatory.
  • I don't think we should scan the entire cache only to verify 1 repo and/or 1 revision. Scanning the cache is a heavy task (i.e. listing all files from all revisions from all repos) and most of it is useless if we target only a repo
  • I think it's fine to assume we want to be able to target a single folder per command execution. This makes the CLI much easier to extend with the "generic" arguments like --repo-type, --revision, --local-dir, etc. existing in the hf download command.
  • I don't think the command should fail on missing files. It's quite common for someone to download only a subpart of a repo in which case the verify command should not fail if the downloaded files are valid. Same for files that are present locally but not on remote. So having optional flags like --fail-on-missing-files and --fail-on-extra-files makes sense IMO.
    • without these flags, I'd say it's ok to print a warning on missing/extra files with a message like ("12 local files do not exist on remote repo. Use --fail-on-extra-files for more details.")

In the end, the CLI I suggest would look like this:

hf cache verify <repo-id> [--repo-type ...] [--revision ...] [--cache-dir ...] [--token ...] [--local-dir ...] [--fail-on-missing-files]  [--fail-on-extra-files]

# Verify main revision of "deepseek-ai/DeepSeek-OCR" in cache
hf cache verify deepseek-ai/DeepSeek-OCR

# Verify specific revision
hf cache verify deepseek-ai/DeepSeek-OCR --revision refs/pr/1
hf cache verify deepseek-ai/DeepSeek-OCR --revision abcdef123

# Verify using private repo
hf cache verify me/private-model --token ...

# Verify dataset
hf cache verify karpathy/fineweb-edu-100b-shuffle --repo-type dataset

# Verify local dir
hf cache verify deepseek-ai/DeepSeek-OCR --local-dir /path/to/repo

Let me know what you think. I might now have thought of all possible use cases so happy to get it challenged ^^

Base automatically changed from v1.0-release to main October 23, 2025 12:48
@hanouticelina
Copy link
Contributor Author

agh the commit history is messed up since we merged v1.0-release into main. fixing it now!

@hanouticelina hanouticelina marked this pull request as ready for review October 24, 2025 15:34
@hanouticelina hanouticelina requested a review from Wauplin October 24, 2025 15:34
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(haven't reviewed the tests)

@hanouticelina hanouticelina requested a review from Wauplin October 29, 2025 17:28
@hanouticelina
Copy link
Contributor Author

thanks @Wauplin for the very thorough review! I addressed all your comments and refactored a bit the logic

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iteration! This time I've checked the tests which look great 🤗

Left a last round of comments but overall looks good :)

Copy link
Contributor Author

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Wauplin for the review! i addressed all your comments, could you have a final look? 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants