vllm-project · PatrykWo · Oct 13, 2025 · Oct 10, 2025 · Oct 11, 2025 · Oct 11, 2025
@@ -1,7 +1,7 @@
 ---
 title: Installation
 ---
-[](){ #installation }
+
 This guide provides instructions on running vLLM with Intel Gaudi devices.
 
 ## Requirements
@@ -16,10 +16,13 @@ This guide provides instructions on running vLLM with Intel Gaudi devices.
     [Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
 
 ## Running vLLM on Gaudi with Docker Compose
+
 Starting with the 1.22 release, we are introducing ready-to-run container images that bundle vLLM and Gaudi software. Please follow the [instruction](https://github.com/vllm-project/vllm-gaudi/tree/main/.cd) to quickly launch vLLM on Gaudi using a prebuilt Docker image and Docker Compose, with options for custom parameters and benchmarking.
 
 ## Quick Start Using Dockerfile
-# --8<-- [start:docker_quickstart]
+
+## --8<-- [start:docker_quickstart]
+
 Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.
 
 === "Ubuntu"
@@ -34,11 +37,13 @@ Set up the container with the latest Intel Gaudi Software Suite release using th
     of [Install Driver and Software](https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#install-driver-and-software) and "Configure Container
     Runtime" section of [Docker Installation](https://docs.habana.ai/en/latest/Installation_Guide/Installation_Methods/Docker_Installation.html#configure-container-runtime).
     Make sure you have ``habanalabs-container-runtime`` package installed and that ``habana`` container runtime is registered.
-# --8<-- [end:docker_quickstart]
+
+## --8<-- [end:docker_quickstart]
 
 ## Build from Source
 
 ### Environment Verification
+
 To verify that the Intel Gaudi software was correctly installed, run the following:
 
     $ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
@@ -62,11 +67,12 @@ Use the following commands to run a Docker image. Make sure to update the versio
 
 === "Step 1: Get Last good commit on vllm"
 
-   NOTE: vllm-gaudi is always follow latest vllm commit, however, vllm upstream
-   API update may crash vllm-gaudi,  this commit saved is verified with vllm-gaudi
-   in a hourly basis
+    !!! note
+        Vllm-gaudi always follows the latest vllm commit. However, updates to the upstream vLLM
+        API may cause vLLM-Gaudi to crash. This saved commit has been verified with vLLM-Gaudi
+        on an hourly basis.
 
-    ```bash{.console}
+    ```bash
     git clone https://github.com/vllm-project/vllm-gaudi
     cd vllm-gaudi
     export VLLM_COMMIT_HASH=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null)
@@ -76,7 +82,7 @@ Use the following commands to run a Docker image. Make sure to update the versio
 
     Install vLLM with `pip` or  [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source)
 
-    ```bash{.console}
+    ```bash
     # Build vLLM from source for empty platform, reusing existing torch installation
     git clone https://github.com/vllm-project/vllm
     cd vllm
@@ -88,14 +94,14 @@ Use the following commands to run a Docker image. Make sure to update the versio
 
 === "Step 3: Install vLLM Plugin"
 
-   Install  vLLM-Gaudi from source:
-    ```{.console}
-        cd vllm-gaudi
-        pip install -e .
-        cd ..
+    Install  vLLM-Gaudi from source:
+    ```bash
+    cd vllm-gaudi
+    pip install -e .
+    cd ..
     ```
 
-### Build and Install vLLM with nixl:
+### Build and Install vLLM with nixl
 
 === "Install vLLM Plugin with nixl"
 
@@ -107,7 +113,7 @@ Use the following commands to run a Docker image. Make sure to update the versio
 
 === "Install vLLM Gaudi and nixl with Docker file"
 
-    ```{.console}
+    ```bash
     docker build -t ubuntu.pytorch.vllm.nixl.latest \
       -f .cd/Dockerfile.ubuntu.pytorch.vllm.nixl.latest github.com/vllm-project/vllm-gaudi
     docker run -it --rm --runtime=habana \
@@ -119,7 +125,7 @@ Use the following commands to run a Docker image. Make sure to update the versio
 
 === "Full installation from source vLLM Gaudi with nixl"
 
-    ```{.console}
+    ```bash
     # Fetch last good commit on vllm
     git clone https://github.com/vllm-project/vllm-gaudi
     cd vllm-gaudi

@@ -1,12 +1,13 @@
 ---
 title: Quickstart
 ---
-[](){ #quickstart }
 
-This guide will help you quickly get started with vLLM to perform:
+## vLLM Quick Start Guide
 
-- [Offline batched inference][quickstart-offline]
-- [Online serving using OpenAI-compatible server][quickstart-online]
+This guide shows how to quickly launch vLLM on Gaudi using a prebuilt Docker
+image with Docker Compose which is supported on Ubuntu only. It supports model benchmarking, custom runtime parameters,
+and a selection of validated models — including the LLama, Mistral, and Qwen.
+The advanced configuration is available via environment variables or YAML files.
 
 ## Requirements
 
@@ -19,15 +20,204 @@ This guide will help you quickly get started with vLLM to perform:
     To achieve the best performance on HPU, please follow the methods outlined in the
     [Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
 
-## Quick Start Using Dockerfile
+## Running vLLM on Gaudi with Docker Compose
 
---8<-- "docs/getting_started/installation.md:docker_quickstart"
+Follow the steps below to run the vLLM server or launch benchmarks on Gaudi using Docker Compose.
+
+### 1. Clone the vLLM fork repository and navigate to the appropriate directory
+
+    git clone https://github.com/HabanaAI/vllm-fork.git
+    cd vllm-fork/.cd/
+
+This ensures you have the required files and Docker Compose configurations.
+
+### 2. Set the following environment variables
+
+| **Variable** | **Description** |
+| --- |--- |
+| `MODEL` | Choose a model name from the [`vllm supported models`][supported-models] list.  |
+| `HF_TOKEN` | Your Hugging Face token (generate one at <https://huggingface.co>). |
+| `DOCKER_IMAGE` | The Docker image name or URL for the vLLM Gaudi container. When using the Gaudi repository, make sure to select Docker images with the *vllm-installer* prefix in the file name. |
+
+### 3. Run the vLLM server using Docker Compose
+
+    MODEL="Qwen/Qwen2.5-14B-Instruct" \
+    HF_TOKEN="<your huggingface token>" \
+    DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu22.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
+    docker compose up
+
+To automatically run benchmarking for a selected model using default settings, add the  `--profile benchmark up` option
+
+    MODEL="Qwen/Qwen2.5-14B-Instruct" \
+    HF_TOKEN="<your huggingface token>" \
+    DOCKER_IMAGE=="vault.habana.ai/gaudi-docker/|Version|/ubuntu22.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
+    docker compose --profile benchmark up
+
+This command launches the vLLM server and runs the associated benchmark suite.
+
+## Advanced Options
+
+The following steps cover optional advanced configurations for
+running the vLLM server and benchmark. These allow you to fine-tune performance,
+memory usage, and request handling using additional environment variables or configuration files.
+For most users, the basic setup is sufficient, but advanced users may benefit from these customizations.
+
+=== "Run vLLM Using Docker Compose with Custom Parameters"
+
+    To override default settings, you can provide additional environment variables when starting the server. This advanced method allows fine-tuning for performance and memory usage.
+
+    **Environment variables**
+
+    | **Variable** | **Description** |
+    |---|---|
+    |  `PT_HPU_LAZY_MODE`              | Enables Lazy execution mode, potentially improving performance by batching operations. |
+    |  `VLLM_SKIP_WARMUP`              | Skips the model warmup phase to reduce startup time (may affect initial latency).                     |
+    |  `MAX_MODEL_LEN`                 | Sets the maximum supported sequence length for the model.               |    
+    |  `MAX_NUM_SEQS`                  | Specifies the maximum number of sequences processed concurrently.       |
+    |  `TENSOR_PARALLEL_SIZE`          | Defines the degree of tensor parallelism.                               |
+    |  `VLLM_EXPONENTIAL_BUCKETING`    | Enables or disables exponential bucketing for warmup strategy.          |
+    |  `VLLM_DECODE_BLOCK_BUCKET_STEP` | Configures the step size for decode block allocation, affecting memory granularity.         |
+    |  `VLLM_DECODE_BS_BUCKET_STEP`    | Sets the batch size step for decode operations, impacting how decode batches are grouped.             |
+    |  `VLLM_PROMPT_BS_BUCKET_STEP`    | Adjusts the batch size step for prompt processing.                      |
+    |  `VLLM_PROMPT_SEQ_BUCKET_STEP`   | Controls the step size for prompt sequence allocation.                  |
+
+    **Example**
+
+    ```bash
+    MODEL="Qwen/Qwen2.5-14B-Instruct" \
+    HF_TOKEN="<your huggingface token>" \
+    DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu22.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
+    TENSOR_PARALLEL_SIZE=1 \
+    MAX_MODEL_LEN=2048 \
+    docker compose up
+    ```
+
+=== "Run vLLM and Benchmark with Custom Parameters"
+
+    You can customize benchmark behavior by setting additional environment variables before running Docker Compose.
+
+    **Benchmark parameters:**
+
+    | **Variable** | **Description** |
+    |---|---|
+    |  `INPUT_TOK`  | Number of input tokens per prompt.                           |
+    |  `OUTPUT_TOK` | Number of output tokens to generate per prompt.              |
+    |  `CON_REQ`    | Number of concurrent requests during benchmarking.           |
+    |  `NUM_PROMPTS`| Total number of prompts to use in the benchmark.             |
+
+    **Example:**
+
+    ```bash
+    MODEL="Qwen/Qwen2.5-14B-Instruct" \
+    HF_TOKEN="<your huggingface token>" \
+    DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu22.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
+    INPUT_TOK=128 \
+    OUTPUT_TOK=128 \
+    CON_REQ=16 \
+    NUM_PROMPTS=64 \
+    docker compose --profile benchmark up
+    ```
+
+    This launches the vLLM server and runs the benchmark using your specified parameters.
+
+=== "Run vLLM and Benchmark with Combined Custom Parameters"
+
+    You can launch the vLLM server and benchmark together, providing any combination of server and benchmark-specific parameters.
+
+    **Example:**
+
+    ```bash
+    MODEL="Qwen/Qwen2.5-14B-Instruct" \
+    HF_TOKEN="<your huggingface token>" \
+    DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu22.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
+    TENSOR_PARALLEL_SIZE=1 \
+    MAX_MODEL_LEN=2048 \
+    INPUT_TOK=128 \
+    OUTPUT_TOK=128 \
+    CON_REQ=16 \
+    NUM_PROMPTS=64 \
+    docker compose --profile benchmark up
+    ```
+
+    This command starts the server and executes benchmarking with the provided configuration.
+
+=== "Run vLLM and Benchmark Using Configuration Files"
+
+    You can also configure the server and benchmark via YAML configuration files. Set the following environment variables:
+
+    | **Variable** | **Description** |
+    |---|---|
+    |  `VLLM_SERVER_CONFIG_FILE`          | Path to the server config file inside the Docker container. |
+    |  `VLLM_SERVER_CONFIG_NAME`          | Name of the server config section.                          |
+    |  `VLLM_BENCHMARK_CONFIG_FILE`       | Path to the benchmark config file inside the container.     |
+    |  `VLLM_BENCHMARK_CONFIG_NAME`       | Name of the benchmark config section.                       |
+
+    **Example**
+
+    ```bash
+    HF_TOKEN=<your huggingface token> \
+    VLLM_SERVER_CONFIG_FILE=server_configurations/server_text.yaml \
+    VLLM_SERVER_CONFIG_NAME=llama31_8b_instruct \
+    VLLM_BENCHMARK_CONFIG_FILE=benchmark_configurations/benchmark_text.yaml \
+    VLLM_BENCHMARK_CONFIG_NAME=llama31_8b_instruct \
+    docker compose --profile benchmark up
+    ```
+
+    !!! note
+        When using configuration files, you do not need to set the  `MODEL` variable as the model details are included in the config files. However, the  `HF_TOKEN` flag is still required.
+
+=== "Run vLLM Directly Using Docker"
+
+    For maximum control, you can run the server directly using the  `docker run` command, allowing full customization of Docker runtime settings.
+
+    **Example:**
+
+    ```bash
+    docker run -it --rm \
+        -e MODEL=$MODEL \
+        -e HF_TOKEN=$HF_TOKEN \
+        -e http_proxy=$http_proxy \
+        -e https_proxy=$https_proxy \
+        -e no_proxy=$no_proxy \
+        --cap-add=sys_nice \
+        --ipc=host \
+        --runtime=habana \
+        -e HABANA_VISIBLE_DEVICES=all \
+        -p 8000:8000 \
+        --name vllm-server \
+        <docker image name>
+    ```
+
+    This method provides full flexibility over how the vLLM server is executed within the container.
+
+---
+
+## Supported Models
+
+| **Model Name**                                                | **Validated TP Size**  |
+|---|---|
+| deepseek-ai/DeepSeek-R1-Distill-Llama-70B                 | 8                  |
+| meta-llama/Llama-3.1-70B-Instruct                         | 4                  |
+| meta-llama/Llama-3.1-405B-Instruct                        | 8                  |
+| meta-llama/Llama-3.1-8B-Instruct                          | 1                  |
+| meta-llama/Llama-3.3-70B-Instruct                         | 4                  |
+| mistralai/Mistral-7B-Instruct-v0.2                        | 1                  |
+| mistralai/Mixtral-8x7B-Instruct-v0.1                      | 2                  |
+| mistralai/Mixtral-8x22B-Instruct-v0.1                     | 4                  |
+| Qwen/Qwen2.5-7B-Instruct                                  | 1                  |
+| Qwen/Qwen2.5-VL-7B-Instruct                               | 1                  |
+| Qwen/Qwen2.5-14B-Instruct                                 | 1                  |
+| Qwen/Qwen2.5-32B-Instruct                                 | 1                  |
+| Qwen/Qwen2.5-72B-Instruct                                 | 4                  |
+| ibm-granite/granite-8b-code-instruct-4k                   | 1                  |
+| ibm-granite/granite-20b-code-instruct-8k                  | 1                  |
 
 ## Executing inference
 
 === "Offline Batched Inference"
 
     [](){ #quickstart-offline }
+
     ```python
     from vllm import LLM, SamplingParams
 
@@ -48,7 +238,6 @@ This guide will help you quickly get started with vLLM to perform:
 
 === "OpenAI Completions API"
 
-    [](){ #quickstart-online }
     WIP
 
 === "OpenAI Chat Completions API with vLLM"

@@ -17,9 +17,6 @@ The following configurations have been validated to function with Gaudi 2 or Gau
 | [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)     | 2, 4, 8    | BF16, FP8, FP16 (Gaudi 2)    |Gaudi 2, Gaudi 3|
 | [meta-llama/Meta-Llama-3.1-405B](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B)     | 8    | BF16, FP8    |Gaudi 3|
 | [meta-llama/Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct)     | 8    | BF16, FP8    |Gaudi 3|
-| [meta-llama/Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision)     | 1    | BF16, FP8    | Gaudi 2, Gaudi 3|
-| [meta-llama/Llama-3.2-90B-Vision](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision)     | 4, 8 (min. for Gaudi 2)    | BF16, FP8    | Gaudi 2, Gaudi 3|
-| [meta-llama/Llama-3.2-90B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct)     | 4, 8 (min. for Gaudi 2)    | BF16    | Gaudi 2, Gaudi 3 |
 | [meta-llama/Meta-Llama-3.3-70B](https://huggingface.co/meta-llama/Llama-3.3-70B)     | 4  | BF16, FP8    | Gaudi 3|
 | [meta-llama/Granite-3B-code-instruct-128k](https://huggingface.co/ibm-granite/granite-3b-code-instruct-128k)     | 1  | BF16    | Gaudi 3|
 | [meta-llama/Granite-3.0-8B-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct)     | 1  | BF16, FP8    | Gaudi 2, Gaudi 3|

@@ -82,6 +82,7 @@ plugins:
 
 markdown_extensions:
   - attr_list
+  - sane_lists
   - md_in_html
   - admonition
   - pymdownx.details