Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 22 additions & 16 deletions docs/getting_started/installation.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Installation
---
[](){ #installation }

This guide provides instructions on running vLLM with Intel Gaudi devices.

## Requirements
Expand All @@ -16,10 +16,13 @@ This guide provides instructions on running vLLM with Intel Gaudi devices.
[Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).

## Running vLLM on Gaudi with Docker Compose

Starting with the 1.22 release, we are introducing ready-to-run container images that bundle vLLM and Gaudi software. Please follow the [instruction](https://github.com/vllm-project/vllm-gaudi/tree/main/.cd) to quickly launch vLLM on Gaudi using a prebuilt Docker image and Docker Compose, with options for custom parameters and benchmarking.

## Quick Start Using Dockerfile
# --8<-- [start:docker_quickstart]

## --8<-- [start:docker_quickstart]

Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.

=== "Ubuntu"
Expand All @@ -34,11 +37,13 @@ Set up the container with the latest Intel Gaudi Software Suite release using th
of [Install Driver and Software](https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#install-driver-and-software) and "Configure Container
Runtime" section of [Docker Installation](https://docs.habana.ai/en/latest/Installation_Guide/Installation_Methods/Docker_Installation.html#configure-container-runtime).
Make sure you have ``habanalabs-container-runtime`` package installed and that ``habana`` container runtime is registered.
# --8<-- [end:docker_quickstart]

## --8<-- [end:docker_quickstart]

## Build from Source

### Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
Expand All @@ -62,11 +67,12 @@ Use the following commands to run a Docker image. Make sure to update the versio

=== "Step 1: Get Last good commit on vllm"

NOTE: vllm-gaudi is always follow latest vllm commit, however, vllm upstream
API update may crash vllm-gaudi, this commit saved is verified with vllm-gaudi
in a hourly basis
!!! note
Vllm-gaudi always follows the latest vllm commit. However, updates to the upstream vLLM
API may cause vLLM-Gaudi to crash. This saved commit has been verified with vLLM-Gaudi
on an hourly basis.

```bash{.console}
```bash
git clone https://github.com/vllm-project/vllm-gaudi
cd vllm-gaudi
export VLLM_COMMIT_HASH=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null)
Expand All @@ -76,7 +82,7 @@ Use the following commands to run a Docker image. Make sure to update the versio

Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source)

```bash{.console}
```bash
# Build vLLM from source for empty platform, reusing existing torch installation
git clone https://github.com/vllm-project/vllm
cd vllm
Expand All @@ -88,14 +94,14 @@ Use the following commands to run a Docker image. Make sure to update the versio

=== "Step 3: Install vLLM Plugin"

Install vLLM-Gaudi from source:
```{.console}
cd vllm-gaudi
pip install -e .
cd ..
Install vLLM-Gaudi from source:
```bash
cd vllm-gaudi
pip install -e .
cd ..
```

### Build and Install vLLM with nixl:
### Build and Install vLLM with nixl

=== "Install vLLM Plugin with nixl"

Expand All @@ -107,7 +113,7 @@ Use the following commands to run a Docker image. Make sure to update the versio

=== "Install vLLM Gaudi and nixl with Docker file"

```{.console}
```bash
docker build -t ubuntu.pytorch.vllm.nixl.latest \
-f .cd/Dockerfile.ubuntu.pytorch.vllm.nixl.latest github.com/vllm-project/vllm-gaudi
docker run -it --rm --runtime=habana \
Expand All @@ -119,7 +125,7 @@ Use the following commands to run a Docker image. Make sure to update the versio

=== "Full installation from source vLLM Gaudi with nixl"

```{.console}
```bash
# Fetch last good commit on vllm
git clone https://github.com/vllm-project/vllm-gaudi
cd vllm-gaudi
Expand Down
203 changes: 196 additions & 7 deletions docs/getting_started/quickstart.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
---
title: Quickstart
---
[](){ #quickstart }

This guide will help you quickly get started with vLLM to perform:
## vLLM Quick Start Guide

- [Offline batched inference][quickstart-offline]
- [Online serving using OpenAI-compatible server][quickstart-online]
This guide shows how to quickly launch vLLM on Gaudi using a prebuilt Docker
image with Docker Compose which is supported on Ubuntu only. It supports model benchmarking, custom runtime parameters,
and a selection of validated models — including the LLama, Mistral, and Qwen.
The advanced configuration is available via environment variables or YAML files.

## Requirements

Expand All @@ -19,15 +20,204 @@ This guide will help you quickly get started with vLLM to perform:
To achieve the best performance on HPU, please follow the methods outlined in the
[Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).

## Quick Start Using Dockerfile
## Running vLLM on Gaudi with Docker Compose

--8<-- "docs/getting_started/installation.md:docker_quickstart"
Follow the steps below to run the vLLM server or launch benchmarks on Gaudi using Docker Compose.

### 1. Clone the vLLM fork repository and navigate to the appropriate directory

git clone https://github.com/HabanaAI/vllm-fork.git
cd vllm-fork/.cd/

This ensures you have the required files and Docker Compose configurations.

### 2. Set the following environment variables

| **Variable** | **Description** |
| --- |--- |
| `MODEL` | Choose a model name from the [`vllm supported models`][supported-models] list. |
| `HF_TOKEN` | Your Hugging Face token (generate one at <https://huggingface.co>). |
| `DOCKER_IMAGE` | The Docker image name or URL for the vLLM Gaudi container. When using the Gaudi repository, make sure to select Docker images with the *vllm-installer* prefix in the file name. |

### 3. Run the vLLM server using Docker Compose

MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu22.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
docker compose up

To automatically run benchmarking for a selected model using default settings, add the `--profile benchmark up` option

MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE=="vault.habana.ai/gaudi-docker/|Version|/ubuntu22.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
docker compose --profile benchmark up

This command launches the vLLM server and runs the associated benchmark suite.

## Advanced Options

The following steps cover optional advanced configurations for
running the vLLM server and benchmark. These allow you to fine-tune performance,
memory usage, and request handling using additional environment variables or configuration files.
For most users, the basic setup is sufficient, but advanced users may benefit from these customizations.

=== "Run vLLM Using Docker Compose with Custom Parameters"

To override default settings, you can provide additional environment variables when starting the server. This advanced method allows fine-tuning for performance and memory usage.

**Environment variables**

| **Variable** | **Description** |
|---|---|
| `PT_HPU_LAZY_MODE` | Enables Lazy execution mode, potentially improving performance by batching operations. |
| `VLLM_SKIP_WARMUP` | Skips the model warmup phase to reduce startup time (may affect initial latency). |
| `MAX_MODEL_LEN` | Sets the maximum supported sequence length for the model. |
| `MAX_NUM_SEQS` | Specifies the maximum number of sequences processed concurrently. |
| `TENSOR_PARALLEL_SIZE` | Defines the degree of tensor parallelism. |
| `VLLM_EXPONENTIAL_BUCKETING` | Enables or disables exponential bucketing for warmup strategy. |
| `VLLM_DECODE_BLOCK_BUCKET_STEP` | Configures the step size for decode block allocation, affecting memory granularity. |
| `VLLM_DECODE_BS_BUCKET_STEP` | Sets the batch size step for decode operations, impacting how decode batches are grouped. |
| `VLLM_PROMPT_BS_BUCKET_STEP` | Adjusts the batch size step for prompt processing. |
| `VLLM_PROMPT_SEQ_BUCKET_STEP` | Controls the step size for prompt sequence allocation. |

**Example**

```bash
MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu22.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
docker compose up
```

=== "Run vLLM and Benchmark with Custom Parameters"

You can customize benchmark behavior by setting additional environment variables before running Docker Compose.

**Benchmark parameters:**

| **Variable** | **Description** |
|---|---|
| `INPUT_TOK` | Number of input tokens per prompt. |
| `OUTPUT_TOK` | Number of output tokens to generate per prompt. |
| `CON_REQ` | Number of concurrent requests during benchmarking. |
| `NUM_PROMPTS`| Total number of prompts to use in the benchmark. |

**Example:**

```bash
MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu22.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
INPUT_TOK=128 \
OUTPUT_TOK=128 \
CON_REQ=16 \
NUM_PROMPTS=64 \
docker compose --profile benchmark up
```

This launches the vLLM server and runs the benchmark using your specified parameters.

=== "Run vLLM and Benchmark with Combined Custom Parameters"

You can launch the vLLM server and benchmark together, providing any combination of server and benchmark-specific parameters.

**Example:**

```bash
MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu22.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
INPUT_TOK=128 \
OUTPUT_TOK=128 \
CON_REQ=16 \
NUM_PROMPTS=64 \
docker compose --profile benchmark up
```

This command starts the server and executes benchmarking with the provided configuration.

=== "Run vLLM and Benchmark Using Configuration Files"

You can also configure the server and benchmark via YAML configuration files. Set the following environment variables:

| **Variable** | **Description** |
|---|---|
| `VLLM_SERVER_CONFIG_FILE` | Path to the server config file inside the Docker container. |
| `VLLM_SERVER_CONFIG_NAME` | Name of the server config section. |
| `VLLM_BENCHMARK_CONFIG_FILE` | Path to the benchmark config file inside the container. |
| `VLLM_BENCHMARK_CONFIG_NAME` | Name of the benchmark config section. |

**Example**

```bash
HF_TOKEN=<your huggingface token> \
VLLM_SERVER_CONFIG_FILE=server_configurations/server_text.yaml \
VLLM_SERVER_CONFIG_NAME=llama31_8b_instruct \
VLLM_BENCHMARK_CONFIG_FILE=benchmark_configurations/benchmark_text.yaml \
VLLM_BENCHMARK_CONFIG_NAME=llama31_8b_instruct \
docker compose --profile benchmark up
```

!!! note
When using configuration files, you do not need to set the `MODEL` variable as the model details are included in the config files. However, the `HF_TOKEN` flag is still required.

=== "Run vLLM Directly Using Docker"

For maximum control, you can run the server directly using the `docker run` command, allowing full customization of Docker runtime settings.

**Example:**

```bash
docker run -it --rm \
-e MODEL=$MODEL \
-e HF_TOKEN=$HF_TOKEN \
-e http_proxy=$http_proxy \
-e https_proxy=$https_proxy \
-e no_proxy=$no_proxy \
--cap-add=sys_nice \
--ipc=host \
--runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-p 8000:8000 \
--name vllm-server \
<docker image name>
```

This method provides full flexibility over how the vLLM server is executed within the container.

---

## Supported Models

| **Model Name** | **Validated TP Size** |
|---|---|
| deepseek-ai/DeepSeek-R1-Distill-Llama-70B | 8 |
| meta-llama/Llama-3.1-70B-Instruct | 4 |
| meta-llama/Llama-3.1-405B-Instruct | 8 |
| meta-llama/Llama-3.1-8B-Instruct | 1 |
| meta-llama/Llama-3.3-70B-Instruct | 4 |
| mistralai/Mistral-7B-Instruct-v0.2 | 1 |
| mistralai/Mixtral-8x7B-Instruct-v0.1 | 2 |
| mistralai/Mixtral-8x22B-Instruct-v0.1 | 4 |
| Qwen/Qwen2.5-7B-Instruct | 1 |
| Qwen/Qwen2.5-VL-7B-Instruct | 1 |
| Qwen/Qwen2.5-14B-Instruct | 1 |
| Qwen/Qwen2.5-32B-Instruct | 1 |
| Qwen/Qwen2.5-72B-Instruct | 4 |
| ibm-granite/granite-8b-code-instruct-4k | 1 |
| ibm-granite/granite-20b-code-instruct-8k | 1 |

## Executing inference

=== "Offline Batched Inference"

[](){ #quickstart-offline }

```python
from vllm import LLM, SamplingParams

Expand All @@ -48,7 +238,6 @@ This guide will help you quickly get started with vLLM to perform:

=== "OpenAI Completions API"

[](){ #quickstart-online }
WIP

=== "OpenAI Chat Completions API with vLLM"
Expand Down
3 changes: 0 additions & 3 deletions docs/models/validated_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,6 @@ The following configurations have been validated to function with Gaudi 2 or Gau
| [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | 2, 4, 8 | BF16, FP8, FP16 (Gaudi 2) |Gaudi 2, Gaudi 3|
| [meta-llama/Meta-Llama-3.1-405B](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B) | 8 | BF16, FP8 |Gaudi 3|
| [meta-llama/Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) | 8 | BF16, FP8 |Gaudi 3|
| [meta-llama/Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) | 1 | BF16, FP8 | Gaudi 2, Gaudi 3|
| [meta-llama/Llama-3.2-90B-Vision](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision) | 4, 8 (min. for Gaudi 2) | BF16, FP8 | Gaudi 2, Gaudi 3|
| [meta-llama/Llama-3.2-90B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct) | 4, 8 (min. for Gaudi 2) | BF16 | Gaudi 2, Gaudi 3 |
| [meta-llama/Meta-Llama-3.3-70B](https://huggingface.co/meta-llama/Llama-3.3-70B) | 4 | BF16, FP8 | Gaudi 3|
| [meta-llama/Granite-3B-code-instruct-128k](https://huggingface.co/ibm-granite/granite-3b-code-instruct-128k) | 1 | BF16 | Gaudi 3|
| [meta-llama/Granite-3.0-8B-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) | 1 | BF16, FP8 | Gaudi 2, Gaudi 3|
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ plugins:

markdown_extensions:
- attr_list
- sane_lists
- md_in_html
- admonition
- pymdownx.details
Expand Down