diff --git a/docs/source/en/guides/cli.md b/docs/source/en/guides/cli.md index 68927ac38a..d88d1b8232 100644 --- a/docs/source/en/guides/cli.md +++ b/docs/source/en/guides/cli.md @@ -35,6 +35,20 @@ On Windows: >>> powershell -ExecutionPolicy ByPass -c "irm https://hf.co/cli/install.ps1 | iex" ``` +Alternatively, you can install the `hf` CLI with a single command: + +On macOS and Linux: + +```bash +>>> curl -LsSf https://hf.co/cli/install.sh | sh +``` + +On Windows: + +```powershell +>>> powershell -ExecutionPolicy ByPass -c "irm https://hf.co/cli/install.ps1 | iex" +``` + Once installed, you can check that the CLI is correctly setup: ``` @@ -1016,3 +1030,34 @@ Manage scheduled jobs using # Delete a scheduled job >>> hf jobs scheduled delete ``` + +## hf endpoints + +Use `hf endpoints` to list, deploy, describe, and manage Inference Endpoints directly from the terminal. The legacy +`hf inference-endpoints` alias remains available for compatibility. + +```bash +# Lists endpoints in your namespace +>>> hf endpoints ls + +# Deploy an endpoint from Model Catalog +>>> hf endpoints catalog deploy --repo openai/gpt-oss-120b --name my-endpoint + +# Deploy an endpoint from the Hugging Face Hub +>>> hf endpoints deploy my-endpoint --repo gpt2 --framework pytorch --accelerator cpu --instance-size x2 --instance-type intel-icl + +# List catalog entries +>>> hf endpoints catalog ls + +# Show status and metadata +>>> hf endpoints describe my-endpoint + +# Pause the endpoint +>>> hf endpoints pause my-endpoint + +# Delete without confirmation prompt +>>> hf endpoints delete my-endpoint --yes +``` + +> [!TIP] +> Add `--namespace` to target an organization, `--token` to override authentication. diff --git a/docs/source/en/guides/inference_endpoints.md b/docs/source/en/guides/inference_endpoints.md index c89c47621a..1a1d64b8a9 100644 --- a/docs/source/en/guides/inference_endpoints.md +++ b/docs/source/en/guides/inference_endpoints.md @@ -33,6 +33,16 @@ The first step is to create an Inference Endpoint using [`create_inference_endpo ... ) ``` +Or via CLI: + +```bash +hf endpoints deploy my-endpoint-name --repo gpt2 --framework pytorch --accelerator cpu --vendor aws --region us-east-1 --instance-size x2 --instance-type intel-icl --task text-generation + +# Deploy from the catalog with a single command +hf endpoints catalog deploy my-endpoint-name --repo openai/gpt-oss-120b +``` + + In this example, we created a `protected` Inference Endpoint named `"my-endpoint-name"`, to serve [gpt2](https://huggingface.co/gpt2) for `text-generation`. A `protected` Inference Endpoint means your token is required to access the API. We also need to provide additional information to configure the hardware requirements, such as vendor, region, accelerator, instance type, and size. You can check out the list of available resources [here](https://api.endpoints.huggingface.cloud/#/v2%3A%3Aprovider/list_vendors). Alternatively, you can create an Inference Endpoint manually using the [Web interface](https://ui.endpoints.huggingface.co/new) for convenience. Refer to this [guide](https://huggingface.co/docs/inference-endpoints/guides/advanced) for details on advanced settings and their usage. The value returned by [`create_inference_endpoint`] is an [`InferenceEndpoint`] object: @@ -42,6 +52,12 @@ The value returned by [`create_inference_endpoint`] is an [`InferenceEndpoint`] InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='pending', url=None) ``` +Or via CLI: + +```bash +hf endpoints describe my-endpoint-name +``` + It's a dataclass that holds information about the endpoint. You can access important attributes such as `name`, `repository`, `status`, `task`, `created_at`, `updated_at`, etc. If you need it, you can also access the raw response from the server with `endpoint.raw`. Once your Inference Endpoint is created, you can find it on your [personal dashboard](https://ui.endpoints.huggingface.co/). @@ -101,6 +117,14 @@ InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2 [InferenceEndpoint(name='aws-starchat-beta', namespace='huggingface', repository='HuggingFaceH4/starchat-beta', status='paused', url=None), ...] ``` +Or via CLI: + +```bash +hf endpoints describe my-endpoint-name +hf endpoints ls --namespace huggingface +hf endpoints ls --namespace '*' +``` + ## Check deployment status In the rest of this guide, we will assume that we have a [`InferenceEndpoint`] object called `endpoint`. You might have noticed that the endpoint has a `status` attribute of type [`InferenceEndpointStatus`]. When the Inference Endpoint is deployed and accessible, the status should be `"running"` and the `url` attribute is set: @@ -117,6 +141,12 @@ Before reaching a `"running"` state, the Inference Endpoint typically goes throu InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='pending', url=None) ``` +Or via CLI: + +```bash +hf endpoints describe my-endpoint-name +``` + Instead of fetching the Inference Endpoint status while waiting for it to run, you can directly call [`~InferenceEndpoint.wait`]. This helper takes as input a `timeout` and a `fetch_every` parameter (in seconds) and will block the thread until the Inference Endpoint is deployed. Default values are respectively `None` (no timeout) and `5` seconds. ```py @@ -189,6 +219,14 @@ InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2 # Endpoint is not 'running' but still has a URL and will restart on first call. ``` +Or via CLI: + +```bash +hf endpoints pause my-endpoint-name +hf endpoints resume my-endpoint-name +hf endpoints scale-to-zero my-endpoint-name +``` + ### Update model or hardware requirements In some cases, you might also want to update your Inference Endpoint without creating a new one. You can either update the hosted model or the hardware requirements to run the model. You can do this using [`~InferenceEndpoint.update`]: @@ -207,6 +245,14 @@ InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2 InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2-large', status='pending', url=None) ``` +Or via CLI: + +```bash +hf endpoints update my-endpoint-name --repo gpt2-large +hf endpoints update my-endpoint-name --min-replica 2 --max-replica 6 +hf endpoints update my-endpoint-name --accelerator cpu --instance-size x4 --instance-type intel-icl +``` + ### Delete the endpoint Finally if you won't use the Inference Endpoint anymore, you can simply call [`~InferenceEndpoint.delete()`]. diff --git a/docs/source/en/package_reference/cli.md b/docs/source/en/package_reference/cli.md index da0bed04d0..cb60a58732 100644 --- a/docs/source/en/package_reference/cli.md +++ b/docs/source/en/package_reference/cli.md @@ -26,6 +26,7 @@ $ hf [OPTIONS] COMMAND [ARGS]... * `auth`: Manage authentication (login, logout, etc.). * `cache`: Manage local cache directory. * `download`: Download files from the Hub. +* `endpoints`: Manage Hugging Face Inference Endpoints. * `env`: Print information about the environment. * `jobs`: Run and manage Jobs on the Hub. * `lfs-enable-largefiles`: Configure your repository to enable upload... @@ -274,6 +275,279 @@ $ hf download [OPTIONS] REPO_ID [FILENAMES]... * `--max-workers INTEGER`: Maximum number of workers to use for downloading files. Default is 8. [default: 8] * `--help`: Show this message and exit. +## `hf endpoints` + +Manage Hugging Face Inference Endpoints. + +**Usage**: + +```console +$ hf endpoints [OPTIONS] COMMAND [ARGS]... +``` + +**Options**: + +* `--help`: Show this message and exit. + +**Commands**: + +* `catalog`: Interact with the Inference Endpoints... +* `delete`: Delete an Inference Endpoint permanently. +* `deploy`: Deploy an Inference Endpoint from a Hub... +* `describe`: Get information about an existing endpoint. +* `list-catalog`: List available Catalog models. +* `ls`: Lists all Inference Endpoints for the... +* `pause`: Pause an Inference Endpoint. +* `resume`: Resume an Inference Endpoint. +* `scale-to-zero`: Scale an Inference Endpoint to zero. +* `update`: Update an existing endpoint. + +### `hf endpoints catalog` + +Interact with the Inference Endpoints catalog. + +**Usage**: + +```console +$ hf endpoints catalog [OPTIONS] COMMAND [ARGS]... +``` + +**Options**: + +* `--help`: Show this message and exit. + +**Commands**: + +* `deploy`: Deploy an Inference Endpoint from the... +* `ls`: List available Catalog models. + +#### `hf endpoints catalog deploy` + +Deploy an Inference Endpoint from the Model Catalog. + +**Usage**: + +```console +$ hf endpoints catalog deploy [OPTIONS] NAME +``` + +**Arguments**: + +* `NAME`: Endpoint name. [required] + +**Options**: + +* `--repo TEXT`: The name of the model repository associated with the Inference Endpoint (e.g. 'openai/gpt-oss-120b'). [required] +* `--namespace TEXT`: The namespace associated with the Inference Endpoint. Defaults to the current user's namespace. +* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens. +* `--help`: Show this message and exit. + +#### `hf endpoints catalog ls` + +List available Catalog models. + +**Usage**: + +```console +$ hf endpoints catalog ls [OPTIONS] +``` + +**Options**: + +* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens. +* `--help`: Show this message and exit. + +### `hf endpoints delete` + +Delete an Inference Endpoint permanently. + +**Usage**: + +```console +$ hf endpoints delete [OPTIONS] NAME +``` + +**Arguments**: + +* `NAME`: Endpoint name. [required] + +**Options**: + +* `--namespace TEXT`: The namespace associated with the Inference Endpoint. Defaults to the current user's namespace. +* `--yes`: Skip confirmation prompts. +* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens. +* `--help`: Show this message and exit. + +### `hf endpoints deploy` + +Deploy an Inference Endpoint from a Hub repository. + +**Usage**: + +```console +$ hf endpoints deploy [OPTIONS] NAME +``` + +**Arguments**: + +* `NAME`: Endpoint name. [required] + +**Options**: + +* `--repo TEXT`: The name of the model repository associated with the Inference Endpoint (e.g. 'openai/gpt-oss-120b'). [required] +* `--framework TEXT`: The machine learning framework used for the model (e.g. 'vllm'). [required] +* `--accelerator TEXT`: The hardware accelerator to be used for inference (e.g. 'cpu'). [required] +* `--instance-size TEXT`: The size or type of the instance to be used for hosting the model (e.g. 'x4'). [required] +* `--instance-type TEXT`: The cloud instance type where the Inference Endpoint will be deployed (e.g. 'intel-icl'). [required] +* `--region TEXT`: The cloud region in which the Inference Endpoint will be created (e.g. 'us-east-1'). [required] +* `--vendor TEXT`: The cloud provider or vendor where the Inference Endpoint will be hosted (e.g. 'aws'). [required] +* `--namespace TEXT`: The namespace associated with the Inference Endpoint. Defaults to the current user's namespace. +* `--task TEXT`: The task on which to deploy the model (e.g. 'text-classification'). +* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens. +* `--help`: Show this message and exit. + +### `hf endpoints describe` + +Get information about an existing endpoint. + +**Usage**: + +```console +$ hf endpoints describe [OPTIONS] NAME +``` + +**Arguments**: + +* `NAME`: Endpoint name. [required] + +**Options**: + +* `--namespace TEXT`: The namespace associated with the Inference Endpoint. Defaults to the current user's namespace. +* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens. +* `--help`: Show this message and exit. + +### `hf endpoints list-catalog` + +List available Catalog models. + +**Usage**: + +```console +$ hf endpoints list-catalog [OPTIONS] +``` + +**Options**: + +* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens. +* `--help`: Show this message and exit. + +### `hf endpoints ls` + +Lists all Inference Endpoints for the given namespace. + +**Usage**: + +```console +$ hf endpoints ls [OPTIONS] +``` + +**Options**: + +* `--namespace TEXT`: The namespace associated with the Inference Endpoint. Defaults to the current user's namespace. +* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens. +* `--help`: Show this message and exit. + +### `hf endpoints pause` + +Pause an Inference Endpoint. + +**Usage**: + +```console +$ hf endpoints pause [OPTIONS] NAME +``` + +**Arguments**: + +* `NAME`: Endpoint name. [required] + +**Options**: + +* `--namespace TEXT`: The namespace associated with the Inference Endpoint. Defaults to the current user's namespace. +* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens. +* `--help`: Show this message and exit. + +### `hf endpoints resume` + +Resume an Inference Endpoint. + +**Usage**: + +```console +$ hf endpoints resume [OPTIONS] NAME +``` + +**Arguments**: + +* `NAME`: Endpoint name. [required] + +**Options**: + +* `--namespace TEXT`: The namespace associated with the Inference Endpoint. Defaults to the current user's namespace. +* `--fail-if-already-running`: If `True`, the method will raise an error if the Inference Endpoint is already running. +* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens. +* `--help`: Show this message and exit. + +### `hf endpoints scale-to-zero` + +Scale an Inference Endpoint to zero. + +**Usage**: + +```console +$ hf endpoints scale-to-zero [OPTIONS] NAME +``` + +**Arguments**: + +* `NAME`: Endpoint name. [required] + +**Options**: + +* `--namespace TEXT`: The namespace associated with the Inference Endpoint. Defaults to the current user's namespace. +* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens. +* `--help`: Show this message and exit. + +### `hf endpoints update` + +Update an existing endpoint. + +**Usage**: + +```console +$ hf endpoints update [OPTIONS] NAME +``` + +**Arguments**: + +* `NAME`: Endpoint name. [required] + +**Options**: + +* `--namespace TEXT`: The namespace associated with the Inference Endpoint. Defaults to the current user's namespace. +* `--repo TEXT`: The name of the model repository associated with the Inference Endpoint (e.g. 'openai/gpt-oss-120b'). +* `--accelerator TEXT`: The hardware accelerator to be used for inference (e.g. 'cpu'). +* `--instance-size TEXT`: The size or type of the instance to be used for hosting the model (e.g. 'x4'). +* `--instance-type TEXT`: The cloud instance type where the Inference Endpoint will be deployed (e.g. 'intel-icl'). +* `--framework TEXT`: The machine learning framework used for the model (e.g. 'custom'). +* `--revision TEXT`: The specific model revision to deploy on the Inference Endpoint (e.g. '6c0e6080953db56375760c0471a8c5f2929baf11'). +* `--task TEXT`: The task on which to deploy the model (e.g. 'text-classification'). +* `--min-replica INTEGER`: The minimum number of replicas (instances) to keep running for the Inference Endpoint. +* `--max-replica INTEGER`: The maximum number of replicas (instances) to scale to for the Inference Endpoint. +* `--scale-to-zero-timeout INTEGER`: The duration in minutes before an inactive endpoint is scaled to zero. +* `--token TEXT`: A User Access Token generated from https://huggingface.co/settings/tokens. +* `--help`: Show this message and exit. + ## `hf env` Print information about the environment. diff --git a/src/huggingface_hub/cli/hf.py b/src/huggingface_hub/cli/hf.py index 8306eff084..6ee60ed194 100644 --- a/src/huggingface_hub/cli/hf.py +++ b/src/huggingface_hub/cli/hf.py @@ -17,13 +17,12 @@ from huggingface_hub.cli.auth import auth_cli from huggingface_hub.cli.cache import cache_cli from huggingface_hub.cli.download import download +from huggingface_hub.cli.inference_endpoints import ie_cli from huggingface_hub.cli.jobs import jobs_cli from huggingface_hub.cli.lfs import lfs_enable_largefiles, lfs_multipart_upload from huggingface_hub.cli.repo import repo_cli from huggingface_hub.cli.repo_files import repo_files_cli from huggingface_hub.cli.system import env, version - -# from huggingface_hub.cli.jobs import jobs_app from huggingface_hub.cli.upload import upload from huggingface_hub.cli.upload_large_folder import upload_large_folder from huggingface_hub.utils import logging @@ -48,6 +47,7 @@ app.add_typer(repo_cli, name="repo") app.add_typer(repo_files_cli, name="repo-files") app.add_typer(jobs_cli, name="jobs") +app.add_typer(ie_cli, name="endpoints") def main(): diff --git a/src/huggingface_hub/cli/inference_endpoints.py b/src/huggingface_hub/cli/inference_endpoints.py new file mode 100644 index 0000000000..cfc169672d --- /dev/null +++ b/src/huggingface_hub/cli/inference_endpoints.py @@ -0,0 +1,373 @@ +"""CLI commands for Hugging Face Inference Endpoints.""" + +import json +from typing import Annotated, Optional + +import typer + +from huggingface_hub._inference_endpoints import InferenceEndpoint +from huggingface_hub.errors import HfHubHTTPError + +from ._cli_utils import TokenOpt, get_hf_api, typer_factory + + +ie_cli = typer_factory(help="Manage Hugging Face Inference Endpoints.") + +catalog_app = typer_factory(help="Interact with the Inference Endpoints catalog.") + +NameArg = Annotated[ + str, + typer.Argument(help="Endpoint name."), +] + +NamespaceOpt = Annotated[ + Optional[str], + typer.Option( + help="The namespace associated with the Inference Endpoint. Defaults to the current user's namespace.", + ), +] + + +def _print_endpoint(endpoint: InferenceEndpoint) -> None: + typer.echo(json.dumps(endpoint.raw, indent=2, sort_keys=True)) + + +@ie_cli.command() +def ls( + namespace: NamespaceOpt = None, + token: TokenOpt = None, +) -> None: + """Lists all Inference Endpoints for the given namespace.""" + api = get_hf_api(token=token) + try: + endpoints = api.list_inference_endpoints(namespace=namespace, token=token) + except HfHubHTTPError as error: + typer.echo(f"Listing failed: {error}") + raise typer.Exit(code=error.response.status_code) from error + + typer.echo( + json.dumps( + {"items": [endpoint.raw for endpoint in endpoints]}, + indent=2, + sort_keys=True, + ) + ) + + +@ie_cli.command(name="deploy") +def deploy( + name: NameArg, + repo: Annotated[ + str, + typer.Option( + help="The name of the model repository associated with the Inference Endpoint (e.g. 'openai/gpt-oss-120b').", + ), + ], + framework: Annotated[ + str, + typer.Option( + help="The machine learning framework used for the model (e.g. 'vllm').", + ), + ], + accelerator: Annotated[ + str, + typer.Option( + help="The hardware accelerator to be used for inference (e.g. 'cpu').", + ), + ], + instance_size: Annotated[ + str, + typer.Option( + help="The size or type of the instance to be used for hosting the model (e.g. 'x4').", + ), + ], + instance_type: Annotated[ + str, + typer.Option( + help="The cloud instance type where the Inference Endpoint will be deployed (e.g. 'intel-icl').", + ), + ], + region: Annotated[ + str, + typer.Option( + help="The cloud region in which the Inference Endpoint will be created (e.g. 'us-east-1').", + ), + ], + vendor: Annotated[ + str, + typer.Option( + help="The cloud provider or vendor where the Inference Endpoint will be hosted (e.g. 'aws').", + ), + ], + *, + namespace: NamespaceOpt = None, + task: Annotated[ + Optional[str], + typer.Option( + help="The task on which to deploy the model (e.g. 'text-classification').", + ), + ] = None, + token: TokenOpt = None, +) -> None: + """Deploy an Inference Endpoint from a Hub repository.""" + api = get_hf_api(token=token) + endpoint = api.create_inference_endpoint( + name=name, + repository=repo, + framework=framework, + accelerator=accelerator, + instance_size=instance_size, + instance_type=instance_type, + region=region, + vendor=vendor, + namespace=namespace, + task=task, + token=token, + ) + + _print_endpoint(endpoint) + + +@catalog_app.command(name="deploy") +def deploy_from_catalog( + name: NameArg, + repo: Annotated[ + str, + typer.Option( + help="The name of the model repository associated with the Inference Endpoint (e.g. 'openai/gpt-oss-120b').", + ), + ], + namespace: NamespaceOpt = None, + token: TokenOpt = None, +) -> None: + """Deploy an Inference Endpoint from the Model Catalog.""" + api = get_hf_api(token=token) + try: + endpoint = api.create_inference_endpoint_from_catalog( + repo_id=repo, + name=name, + namespace=namespace, + token=token, + ) + except HfHubHTTPError as error: + typer.echo(f"Deployment failed: {error}") + raise typer.Exit(code=error.response.status_code) from error + + _print_endpoint(endpoint) + + +def list_catalog( + token: TokenOpt = None, +) -> None: + """List available Catalog models.""" + api = get_hf_api(token=token) + try: + models = api.list_inference_catalog(token=token) + except HfHubHTTPError as error: + typer.echo(f"Catalog fetch failed: {error}") + raise typer.Exit(code=error.response.status_code) from error + + typer.echo(json.dumps({"models": models}, indent=2, sort_keys=True)) + + +catalog_app.command(name="ls")(list_catalog) +ie_cli.command(name="list-catalog", help="List available Catalog models.", hidden=True)(list_catalog) + + +ie_cli.add_typer(catalog_app, name="catalog") + + +@ie_cli.command() +def describe( + name: NameArg, + namespace: NamespaceOpt = None, + token: TokenOpt = None, +) -> None: + """Get information about an existing endpoint.""" + api = get_hf_api(token=token) + try: + endpoint = api.get_inference_endpoint(name=name, namespace=namespace, token=token) + except HfHubHTTPError as error: + typer.echo(f"Fetch failed: {error}") + raise typer.Exit(code=error.response.status_code) from error + + _print_endpoint(endpoint) + + +@ie_cli.command() +def update( + name: NameArg, + namespace: NamespaceOpt = None, + repo: Annotated[ + Optional[str], + typer.Option( + help="The name of the model repository associated with the Inference Endpoint (e.g. 'openai/gpt-oss-120b').", + ), + ] = None, + accelerator: Annotated[ + Optional[str], + typer.Option( + help="The hardware accelerator to be used for inference (e.g. 'cpu').", + ), + ] = None, + instance_size: Annotated[ + Optional[str], + typer.Option( + help="The size or type of the instance to be used for hosting the model (e.g. 'x4').", + ), + ] = None, + instance_type: Annotated[ + Optional[str], + typer.Option( + help="The cloud instance type where the Inference Endpoint will be deployed (e.g. 'intel-icl').", + ), + ] = None, + framework: Annotated[ + Optional[str], + typer.Option( + help="The machine learning framework used for the model (e.g. 'custom').", + ), + ] = None, + revision: Annotated[ + Optional[str], + typer.Option( + help="The specific model revision to deploy on the Inference Endpoint (e.g. '6c0e6080953db56375760c0471a8c5f2929baf11').", + ), + ] = None, + task: Annotated[ + Optional[str], + typer.Option( + help="The task on which to deploy the model (e.g. 'text-classification').", + ), + ] = None, + min_replica: Annotated[ + Optional[int], + typer.Option( + help="The minimum number of replicas (instances) to keep running for the Inference Endpoint.", + ), + ] = None, + max_replica: Annotated[ + Optional[int], + typer.Option( + help="The maximum number of replicas (instances) to scale to for the Inference Endpoint.", + ), + ] = None, + scale_to_zero_timeout: Annotated[ + Optional[int], + typer.Option( + help="The duration in minutes before an inactive endpoint is scaled to zero.", + ), + ] = None, + token: TokenOpt = None, +) -> None: + """Update an existing endpoint.""" + api = get_hf_api(token=token) + try: + endpoint = api.update_inference_endpoint( + name=name, + namespace=namespace, + repository=repo, + framework=framework, + revision=revision, + task=task, + accelerator=accelerator, + instance_size=instance_size, + instance_type=instance_type, + min_replica=min_replica, + max_replica=max_replica, + scale_to_zero_timeout=scale_to_zero_timeout, + token=token, + ) + except HfHubHTTPError as error: + typer.echo(f"Update failed: {error}") + raise typer.Exit(code=error.response.status_code) from error + _print_endpoint(endpoint) + + +@ie_cli.command() +def delete( + name: NameArg, + namespace: NamespaceOpt = None, + yes: Annotated[ + bool, + typer.Option("--yes", help="Skip confirmation prompts."), + ] = False, + token: TokenOpt = None, +) -> None: + """Delete an Inference Endpoint permanently.""" + if not yes: + confirmation = typer.prompt(f"Delete endpoint '{name}'? Type the name to confirm.") + if confirmation != name: + typer.echo("Aborted.") + raise typer.Exit(code=2) + + api = get_hf_api(token=token) + try: + api.delete_inference_endpoint(name=name, namespace=namespace, token=token) + except HfHubHTTPError as error: + typer.echo(f"Delete failed: {error}") + raise typer.Exit(code=error.response.status_code) from error + + typer.echo(f"Deleted '{name}'.") + + +@ie_cli.command() +def pause( + name: NameArg, + namespace: NamespaceOpt = None, + token: TokenOpt = None, +) -> None: + """Pause an Inference Endpoint.""" + api = get_hf_api(token=token) + try: + endpoint = api.pause_inference_endpoint(name=name, namespace=namespace, token=token) + except HfHubHTTPError as error: + typer.echo(f"Pause failed: {error}") + raise typer.Exit(code=error.response.status_code) from error + + _print_endpoint(endpoint) + + +@ie_cli.command() +def resume( + name: NameArg, + namespace: NamespaceOpt = None, + fail_if_already_running: Annotated[ + bool, + typer.Option( + "--fail-if-already-running", + help="If `True`, the method will raise an error if the Inference Endpoint is already running.", + ), + ] = False, + token: TokenOpt = None, +) -> None: + """Resume an Inference Endpoint.""" + api = get_hf_api(token=token) + try: + endpoint = api.resume_inference_endpoint( + name=name, + namespace=namespace, + token=token, + running_ok=not fail_if_already_running, + ) + except HfHubHTTPError as error: + typer.echo(f"Resume failed: {error}") + raise typer.Exit(code=error.response.status_code) from error + _print_endpoint(endpoint) + + +@ie_cli.command() +def scale_to_zero( + name: NameArg, + namespace: NamespaceOpt = None, + token: TokenOpt = None, +) -> None: + """Scale an Inference Endpoint to zero.""" + api = get_hf_api(token=token) + try: + endpoint = api.scale_to_zero_inference_endpoint(name=name, namespace=namespace, token=token) + except HfHubHTTPError as error: + typer.echo(f"Scale To Zero failed: {error}") + raise typer.Exit(code=error.response.status_code) from error + + _print_endpoint(endpoint) diff --git a/tests/test_cli.py b/tests/test_cli.py index f9f66c1c2c..03abb29a36 100644 --- a/tests/test_cli.py +++ b/tests/test_cli.py @@ -1265,6 +1265,236 @@ def test_repo_delete_with_all_options(self, runner: CliRunner) -> None: ) +class TestInferenceEndpointsCommands: + def test_list(self, runner: CliRunner) -> None: + endpoint = Mock(raw={"name": "demo"}) + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + api.list_inference_endpoints.return_value = [endpoint] + result = runner.invoke(app, ["endpoints", "ls"]) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.list_inference_endpoints.assert_called_once_with(namespace=None, token=None) + assert '"items"' in result.stdout + assert '"name": "demo"' in result.stdout + + def test_inference_endpoints_alias(self, runner: CliRunner) -> None: + endpoint = Mock(raw={"name": "alias"}) + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + api.list_inference_endpoints.return_value = [endpoint] + result = runner.invoke(app, ["endpoints", "ls"]) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.list_inference_endpoints.assert_called_once_with(namespace=None, token=None) + assert '"name": "alias"' in result.stdout + + def test_deploy_from_hub(self, runner: CliRunner) -> None: + endpoint = Mock(raw={"name": "hub"}) + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + api.create_inference_endpoint.return_value = endpoint + result = runner.invoke( + app, + [ + "endpoints", + "deploy", + "my-endpoint", + "--repo", + "my-repo", + "--framework", + "custom", + "--accelerator", + "cpu", + "--instance-size", + "x4", + "--instance-type", + "standard", + "--region", + "us-east-1", + "--vendor", + "aws", + ], + ) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.create_inference_endpoint.assert_called_once_with( + name="my-endpoint", + repository="my-repo", + framework="custom", + accelerator="cpu", + instance_size="x4", + instance_type="standard", + region="us-east-1", + vendor="aws", + namespace=None, + token=None, + task=None, + ) + assert '"name": "hub"' in result.stdout + + def test_deploy_from_catalog(self, runner: CliRunner) -> None: + endpoint = Mock(raw={"name": "catalog"}) + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + api.create_inference_endpoint_from_catalog.return_value = endpoint + result = runner.invoke( + app, + [ + "endpoints", + "catalog", + "deploy", + "catalog-endpoint", + "--repo", + "catalog/model", + ], + ) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.create_inference_endpoint_from_catalog.assert_called_once_with( + repo_id="catalog/model", + name="catalog-endpoint", + namespace=None, + token=None, + ) + assert '"name": "catalog"' in result.stdout + + def test_describe(self, runner: CliRunner) -> None: + endpoint = Mock(raw={"name": "describe"}) + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + api.get_inference_endpoint.return_value = endpoint + result = runner.invoke(app, ["endpoints", "describe", "my-endpoint"]) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.get_inference_endpoint.assert_called_once_with(name="my-endpoint", namespace=None, token=None) + assert '"name": "describe"' in result.stdout + + def test_update(self, runner: CliRunner) -> None: + endpoint = Mock(raw={"name": "updated"}) + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + api.update_inference_endpoint.return_value = endpoint + result = runner.invoke( + app, + [ + "endpoints", + "update", + "my-endpoint", + "--repo", + "my-repo", + "--accelerator", + "gpu", + "--instance-size", + "x4", + ], + ) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.update_inference_endpoint.assert_called_once_with( + name="my-endpoint", + namespace=None, + repository="my-repo", + framework=None, + revision=None, + task=None, + accelerator="gpu", + instance_size="x4", + instance_type=None, + min_replica=None, + max_replica=None, + scale_to_zero_timeout=None, + token=None, + ) + assert '"name": "updated"' in result.stdout + + def test_delete(self, runner: CliRunner) -> None: + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + result = runner.invoke(app, ["endpoints", "delete", "my-endpoint", "--yes"]) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.delete_inference_endpoint.assert_called_once_with(name="my-endpoint", namespace=None, token=None) + assert "Deleted 'my-endpoint'." in result.stdout + + def test_pause(self, runner: CliRunner) -> None: + endpoint = Mock(raw={"name": "paused"}) + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + api.pause_inference_endpoint.return_value = endpoint + result = runner.invoke(app, ["endpoints", "pause", "my-endpoint"]) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.pause_inference_endpoint.assert_called_once_with(name="my-endpoint", namespace=None, token=None) + assert '"name": "paused"' in result.stdout + + def test_resume(self, runner: CliRunner) -> None: + endpoint = Mock(raw={"name": "resumed"}) + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + api.resume_inference_endpoint.return_value = endpoint + result = runner.invoke(app, ["endpoints", "resume", "my-endpoint"]) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.resume_inference_endpoint.assert_called_once_with( + name="my-endpoint", + namespace=None, + token=None, + running_ok=True, + ) + assert '"name": "resumed"' in result.stdout + + def test_resume_fail_if_already_running(self, runner: CliRunner) -> None: + endpoint = Mock(raw={"name": "resumed"}) + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + api.resume_inference_endpoint.return_value = endpoint + result = runner.invoke( + app, + [ + "endpoints", + "resume", + "my-endpoint", + "--fail-if-already-running", + ], + ) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.resume_inference_endpoint.assert_called_once_with( + name="my-endpoint", + namespace=None, + token=None, + running_ok=False, + ) + assert '"name": "resumed"' in result.stdout + + def test_scale_to_zero(self, runner: CliRunner) -> None: + endpoint = Mock(raw={"name": "zero"}) + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + api.scale_to_zero_inference_endpoint.return_value = endpoint + result = runner.invoke(app, ["endpoints", "scale-to-zero", "my-endpoint"]) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.scale_to_zero_inference_endpoint.assert_called_once_with( + name="my-endpoint", + namespace=None, + token=None, + ) + assert '"name": "zero"' in result.stdout + + def test_list_catalog(self, runner: CliRunner) -> None: + with patch("huggingface_hub.cli.inference_endpoints.get_hf_api") as api_cls: + api = api_cls.return_value + api.list_inference_catalog.return_value = ["model"] + result = runner.invoke(app, ["endpoints", "catalog", "ls"]) + assert result.exit_code == 0 + api_cls.assert_called_once_with(token=None) + api.list_inference_catalog.assert_called_once_with(token=None) + assert '"models"' in result.stdout + assert '"model"' in result.stdout + + @contextmanager def tmp_current_directory() -> Generator[str, None, None]: with SoftTemporaryDirectory() as tmp_dir: