Skip to content

Commit a3e5a67

Browse files
committed
checksum verification for cached repos
1 parent 39ebbc0 commit a3e5a67

File tree

9 files changed

+732
-2
lines changed

9 files changed

+732
-2
lines changed

docs/source/en/guides/cli.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -673,6 +673,44 @@ Deleted 3 unreferenced revision(s); freed 2.4G.
673673

674674
As with the other cache commands, `--dry-run`, `--yes`, and `--cache-dir` are available. Refer to the [Manage your cache](./manage-cache) guide for more examples.
675675

676+
## hf cache verify
677+
678+
Use `hf cache verify` to validate local files against their checksums on the Hub. Target a single repo per invocation and choose between verifying the cache snapshot or a regular local directory.
679+
680+
Examples:
681+
682+
```bash
683+
# Verify main revision of a model in cache
684+
>>> hf cache verify deepseek-ai/DeepSeek-OCR
685+
686+
# Verify a specific revision
687+
>>> hf cache verify deepseek-ai/DeepSeek-OCR --revision refs/pr/1
688+
>>> hf cache verify deepseek-ai/DeepSeek-OCR --revision abcdef123
689+
690+
# Verify a private repo
691+
>>> hf cache verify me/private-model --token hf_***
692+
693+
# Verify a dataset
694+
>>> hf cache verify karpathy/fineweb-edu-100b-shuffle --repo-type dataset
695+
696+
# Verify files in a local directory
697+
>>> hf cache verify deepseek-ai/DeepSeek-OCR --local-dir /path/to/repo
698+
```
699+
700+
By default, the command warns about missing or extra files but does not fail. Use flags to make these conditions fail the command:
701+
702+
```bash
703+
>>> hf cache verify gpt2 --fail-on-missing-files --fail-on-extra-files
704+
```
705+
706+
On success, you will see a summary:
707+
708+
```text
709+
✅ Verified 60 file(s) at e7da7f221d5bf496a48136c0cd264e630fe9fcc8; no checksum mismatches.
710+
```
711+
712+
If mismatches are detected, the command prints a detailed list and exits with a non-zero status.
713+
676714
## hf repo tag create
677715

678716
The `hf repo tag create` command allows you to tag, untag, and list tags for repositories.

docs/source/en/guides/manage-cache.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -479,6 +479,28 @@ HFCacheInfo(
479479
)
480480
```
481481

482+
### Verify your cache
483+
484+
`huggingface_hub` can verify that your cached files match the checksums on the Hub. Use `hf cache verify` from the CLI to validate one or more cached repositories or specific revisions.
485+
486+
Verify a whole cached repository by repo ID (verifies every cached revision for that repo):
487+
488+
```bash
489+
>>> hf cache verify model/sentence-transformers/all-MiniLM-L6-v2
490+
✅ Verified 28 file(s) across 1 revision(s); no checksum mismatches detected.
491+
```
492+
493+
Verify specific cached revisions by hash (you can pass several targets at once):
494+
495+
```text
496+
➜ hf cache verify 1c610f6b3f5e7d8a d4ec9b72
497+
❌ Checksum verification failed for the following file(s):
498+
- dataset/nyu-mll/glue@bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c::cola/test-00000-of-00001.parquet: missing locally.
499+
```
500+
501+
> [!TIP]
502+
> Pair `hf cache verify` with `--cache-dir PATH` when working outside the default cache, and `--token` to verify against private or gated repositories.
503+
482504
### Clean your cache
483505

484506
Scanning your cache is interesting but what you really want to do next is usually to

docs/source/en/package_reference/cli.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,7 @@ $ hf cache [OPTIONS] COMMAND [ARGS]...
152152
* `ls`: List cached repositories or revisions.
153153
* `prune`: Remove detached revisions from the cache.
154154
* `rm`: Remove cached repositories or revisions.
155+
* `verify`: Verify checksums for a single repo...
155156

156157
### `hf cache ls`
157158

@@ -210,6 +211,37 @@ $ hf cache rm [OPTIONS] TARGETS...
210211
* `--dry-run / --no-dry-run`: Preview deletions without removing anything. [default: no-dry-run]
211212
* `--help`: Show this message and exit.
212213

214+
### `hf cache verify`
215+
216+
Verify checksums for a single repo revision from cache or a local directory.
217+
218+
Examples:
219+
- Verify main revision in cache: `hf cache verify gpt2`
220+
- Verify specific revision: `hf cache verify gpt2 --revision refs/pr/1`
221+
- Verify dataset: `hf cache verify karpathy/fineweb-edu-100b-shuffle --repo-type dataset`
222+
- Verify local dir: `hf cache verify deepseek-ai/DeepSeek-OCR --local-dir /path/to/repo`
223+
224+
**Usage**:
225+
226+
```console
227+
$ hf cache verify [OPTIONS] REPO_ID
228+
```
229+
230+
**Arguments**:
231+
232+
* `REPO_ID`: The ID of the repo (e.g. `username/repo-name`). [required]
233+
234+
**Options**:
235+
236+
* `--repo-type [model|dataset|space]`: The type of repository (model, dataset, or space). [default: model]
237+
* `--revision TEXT`: Git revision id which can be a branch name, a tag, or a commit hash.
238+
* `--cache-dir TEXT`: Cache directory to use when verifying files from cache (defaults to Hugging Face cache).
239+
* `--local-dir TEXT`: If set, verify files under this directory instead of the cache.
240+
* `--fail-on-missing-files / --no-fail-on-missing-files`: Fail if some files exist on the remote but are missing locally. [default: no-fail-on-missing-files]
241+
* `--fail-on-extra-files / --no-fail-on-extra-files`: Fail if some files exist locally but are not present on the remote revision. [default: no-fail-on-extra-files]
242+
* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens.
243+
* `--help`: Show this message and exit.
244+
213245
## `hf download`
214246

215247
Download files from the Hub.

src/huggingface_hub/cli/cache.py

Lines changed: 99 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
tabulate,
3838
)
3939
from ..utils._parsing import parse_duration, parse_size
40-
from ._cli_utils import typer_factory
40+
from ._cli_utils import RepoIdArg, RepoTypeOpt, RevisionOpt, TokenOpt, get_hf_api, typer_factory
4141

4242

4343
cache_cli = typer_factory(help="Manage local cache directory.")
@@ -634,3 +634,101 @@ def prune(
634634

635635
strategy.execute()
636636
print(f"Deleted {counts.total_revision_count} unreferenced revision(s); freed {strategy.expected_freed_size_str}.")
637+
638+
639+
@cache_cli.command()
640+
def verify(
641+
repo_id: RepoIdArg,
642+
repo_type: RepoTypeOpt = RepoTypeOpt.model,
643+
revision: RevisionOpt = None,
644+
cache_dir: Annotated[
645+
Optional[str],
646+
typer.Option(
647+
help="Cache directory to use when verifying files from cache (defaults to Hugging Face cache).",
648+
),
649+
] = None,
650+
local_dir: Annotated[
651+
Optional[str],
652+
typer.Option(
653+
help="If set, verify files under this directory instead of the cache.",
654+
),
655+
] = None,
656+
fail_on_missing_files: Annotated[
657+
bool,
658+
typer.Option(
659+
help="Fail if some files exist on the remote but are missing locally.",
660+
),
661+
] = False,
662+
fail_on_extra_files: Annotated[
663+
bool,
664+
typer.Option(
665+
help="Fail if some files exist locally but are not present on the remote revision.",
666+
),
667+
] = False,
668+
token: TokenOpt = None,
669+
) -> None:
670+
"""Verify checksums for a single repo revision from cache or a local directory.
671+
672+
Examples:
673+
- Verify main revision in cache: `hf cache verify gpt2`
674+
- Verify specific revision: `hf cache verify gpt2 --revision refs/pr/1`
675+
- Verify dataset: `hf cache verify karpathy/fineweb-edu-100b-shuffle --repo-type dataset`
676+
- Verify local dir: `hf cache verify deepseek-ai/DeepSeek-OCR --local-dir /path/to/repo`
677+
"""
678+
679+
if local_dir is not None and cache_dir is not None:
680+
print("Cannot pass both --local-dir and --cache-dir. Use one or the other.")
681+
raise typer.Exit(code=2)
682+
683+
api = get_hf_api(token=token)
684+
685+
try:
686+
result = api.verify_repo_checksums(
687+
repo_id=repo_id,
688+
repo_type=repo_type.value if hasattr(repo_type, "value") else str(repo_type),
689+
revision=revision,
690+
local_dir=local_dir,
691+
cache_dir=cache_dir,
692+
token=token,
693+
)
694+
except ValueError as exc:
695+
print(str(exc))
696+
raise typer.Exit(code=1)
697+
698+
# Print mismatches first if any
699+
if result.mismatches:
700+
print("❌ Checksum verification failed for the following file(s):")
701+
for m in result.mismatches:
702+
print(f" - {m['path']}: expected {m['expected']} ({m['algorithm']}), got {m['actual']}")
703+
704+
# Handle missing/extra
705+
exit_code = 0
706+
if result.missing_paths:
707+
if fail_on_missing_files:
708+
print("Missing files (present remotely, absent locally):")
709+
for p in result.missing_paths:
710+
print(f" - {p}")
711+
exit_code = 1
712+
else:
713+
print(
714+
f"{len(result.missing_paths)} remote file(s) are missing locally. Use --fail-on-missing-files for details."
715+
)
716+
717+
if result.extra_paths:
718+
if fail_on_extra_files:
719+
print("Extra files (present locally, absent remotely):")
720+
for p in result.extra_paths:
721+
print(f" - {p}")
722+
exit_code = 1
723+
else:
724+
print(
725+
f"{len(result.extra_paths)} local file(s) do not exist on remote repo. Use --fail-on-extra-files for more details."
726+
)
727+
728+
if result.mismatches:
729+
exit_code = 1
730+
731+
if exit_code != 0:
732+
raise typer.Exit(code=exit_code)
733+
734+
print(f"✅ Verified {result.checked_count} file(s) at {result.revision}; no checksum mismatches.")

src/huggingface_hub/hf_api.py

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,11 +105,13 @@
105105
from .utils._auth import _get_token_from_environment, _get_token_from_file, _get_token_from_google_colab
106106
from .utils._deprecation import _deprecate_arguments
107107
from .utils._typing import CallableT
108+
from .utils._verification import collect_local_files, resolve_local_root, verify_maps
108109
from .utils.endpoint_helpers import _is_emission_within_threshold
109110

110111

111112
if TYPE_CHECKING:
112113
from .inference._providers import PROVIDER_T
114+
from .utils._verification import Verification
113115

114116
R = TypeVar("R") # Return type
115117
CollectionItemType_T = Literal["model", "dataset", "space", "paper", "collection"]
@@ -3080,6 +3082,84 @@ def list_repo_tree(
30803082
for path_info in paginate(path=tree_url, headers=headers, params={"recursive": recursive, "expand": expand}):
30813083
yield (RepoFile(**path_info) if path_info["type"] == "file" else RepoFolder(**path_info))
30823084

3085+
@validate_hf_hub_args
3086+
def verify_repo_checksums(
3087+
self,
3088+
repo_id: str,
3089+
*,
3090+
repo_type: Optional[str] = None,
3091+
revision: Optional[str] = None,
3092+
local_dir: Optional[Union[str, Path]] = None,
3093+
cache_dir: Optional[Union[str, Path]] = None,
3094+
token: Union[str, bool, None] = None,
3095+
) -> "Verification":
3096+
"""
3097+
Verify local files for a repo against Hub checksums.
3098+
3099+
Args:
3100+
repo_id (`str`):
3101+
A namespace (user or an organization) and a repo name separated by a `/`.
3102+
repo_type (`str`, *optional*):
3103+
The type of the repository from which to get the tree (`"model"`, `"dataset"` or `"space"`.
3104+
Defaults to `"model"`.
3105+
revision (`str`, *optional*):
3106+
The revision of the repository from which to get the tree. Defaults to `"main"` branch.
3107+
local_dir (`str` or `Path`, *optional*):
3108+
The local directory to verify.
3109+
cache_dir (`str` or `Path`, *optional*):
3110+
The cache directory to verify.
3111+
token (Union[bool, str, None], optional):
3112+
A valid user access token (string). Defaults to the locally saved
3113+
token, which is the recommended method for authentication (see
3114+
https://huggingface.co/docs/huggingface_hub/quick-start#authentication).
3115+
To disable authentication, pass `False`.
3116+
3117+
Returns:
3118+
[`Verification`]: a structured result containing the verification details.
3119+
3120+
Raises:
3121+
[`~utils.RepositoryNotFoundError`]:
3122+
If repository is not found (error 404): wrong repo_id/repo_type, private but not authenticated or repo
3123+
does not exist.
3124+
[`~utils.RevisionNotFoundError`]:
3125+
If revision is not found (error 404) on the repo.
3126+
[`~utils.RemoteEntryNotFoundError`]:
3127+
If the tree (folder) does not exist (error 404) on the repo.
3128+
3129+
"""
3130+
3131+
if repo_type is None:
3132+
repo_type = constants.REPO_TYPE_MODEL
3133+
3134+
if local_dir is not None and cache_dir is not None:
3135+
raise ValueError("Pass either `local_dir` or `cache_dir`, not both.")
3136+
3137+
root, remote_revision = resolve_local_root(
3138+
repo_id=repo_id,
3139+
repo_type=repo_type,
3140+
revision=revision,
3141+
cache_dir=Path(cache_dir) if cache_dir is not None else None,
3142+
local_dir=Path(local_dir) if local_dir is not None else None,
3143+
)
3144+
local_by_path = collect_local_files(root)
3145+
3146+
# get remote entries
3147+
remote_by_path: dict[str, object] = {}
3148+
for entry in self.list_repo_tree(
3149+
repo_id=repo_id, recursive=True, revision=remote_revision, repo_type=repo_type, token=token
3150+
):
3151+
path = getattr(entry, "path", None)
3152+
if not path:
3153+
continue
3154+
lfs = getattr(entry, "lfs", None)
3155+
has_lfs_sha = (getattr(lfs, "sha256", None) is not None) or (
3156+
isinstance(lfs, dict) and lfs.get("sha256") is not None
3157+
)
3158+
if hasattr(entry, "blob_id") or has_lfs_sha:
3159+
remote_by_path[path] = entry
3160+
3161+
return verify_maps(remote_by_path=remote_by_path, local_by_path=local_by_path, revision=remote_revision)
3162+
30833163
@validate_hf_hub_args
30843164
def list_repo_refs(
30853165
self,
@@ -10733,6 +10813,7 @@ def _parse_revision_from_pr_url(pr_url: str) -> str:
1073310813
list_repo_commits = api.list_repo_commits
1073410814
list_repo_tree = api.list_repo_tree
1073510815
get_paths_info = api.get_paths_info
10816+
verify_repo_checksums = api.verify_repo_checksums
1073610817

1073710818
get_model_tags = api.get_model_tags
1073810819
get_dataset_tags = api.get_dataset_tags

0 commit comments

Comments
 (0)