Skip to content

Commit 68af055

Browse files
authored
Updated README.md for June 10 release (#574)
* Updated README.md for June 10 release * Added Docker Manifest git hash
1 parent 93d7a4d commit 68af055

File tree

1 file changed

+55
-45
lines changed

1 file changed

+55
-45
lines changed

docs/dev-docker/README.md

Lines changed: 55 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@ This documentation includes information for running the popular Llama 3.1 series
1010

1111
The pre-built image includes:
1212

13-
- ROCm™ 6.3.1
13+
- ROCm™ 6.4.1
1414
- HipblasLT 0.15
15-
- vLLM 0.8.5
15+
- vLLM 0.9.0.1
1616
- PyTorch 2.7
1717

1818
## Pull latest Docker Image
@@ -21,10 +21,14 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
2121

2222
## What is New
2323

24-
- AITER V1 engine performance improvement
24+
- Updated to ROCm 6.4.1 and vLLM v0.9.0.1
25+
- AITER MHA
26+
- IBM 3d kernel for unified attention
27+
- Full graph capture for split attention
2528

2629
## Known Issues and Workarounds
27-
- None
30+
31+
- No AITER MoE. Do not use VLLM_ROCM_USE_AITER for Mixtral or DeepSeek models.
2832

2933
## Performance Results
3034

@@ -37,14 +41,14 @@ The table below shows performance data where a local inference client is fed req
3741

3842
| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
3943
|-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
40-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 16622.2 |
41-
| | | | 128 | 4096 | 1500 | 1500 | 13779.8 |
42-
| | | | 500 | 2000 | 2000 | 2000 | 13424.9 |
43-
| | | | 2048 | 2048 | 1500 | 1500 | 8356.5 |
44-
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4243.9 |
45-
| | | | 128 | 4096 | 1500 | 1500 | 3394.4 |
46-
| | | | 500 | 2000 | 2000 | 2000 | 3201.8 |
47-
| | | | 2048 | 2048 | 500 | 500 | 2208.0 |
44+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 16581.5 |
45+
| | | | 128 | 4096 | 1500 | 1500 | 13667.3 |
46+
| | | | 500 | 2000 | 2000 | 2000 | 13367.1 |
47+
| | | | 2048 | 2048 | 1500 | 1500 | 8352.6 |
48+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4275.0 |
49+
| | | | 128 | 4096 | 1500 | 1500 | 3356.7 |
50+
| | | | 500 | 2000 | 2000 | 2000 | 3201.4 |
51+
| | | | 2048 | 2048 | 500 | 500 | 2179.7 |
4852

4953
*TP stands for Tensor Parallelism.*
5054

@@ -54,38 +58,38 @@ The table below shows latency measurement, which typically involves assessing th
5458

5559
| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
5660
|-------|-----------|----------|------------|--------|---------|-------------------|
57-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.851 |
58-
| | | | 2 | 128 | 2048 | 16.995 |
59-
| | | | 4 | 128 | 2048 | 17.578 |
60-
| | | | 8 | 128 | 2048 | 19.277 |
61-
| | | | 16 | 128 | 2048 | 21.111 |
62-
| | | | 32 | 128 | 2048 | 23.902 |
63-
| | | | 64 | 128 | 2048 | 30.976 |
64-
| | | | 128 | 128 | 2048 | 44.107 |
65-
| | | | 1 | 2048 | 2048 | 15.981 |
66-
| | | | 2 | 2048 | 2048 | 17.322 |
67-
| | | | 4 | 2048 | 2048 | 18.025 |
68-
| | | | 8 | 2048 | 2048 | 20.218 |
69-
| | | | 16 | 2048 | 2048 | 22.690 |
70-
| | | | 32 | 2048 | 2048 | 27.407 |
71-
| | | | 64 | 2048 | 2048 | 37.099 |
72-
| | | | 128 | 2048 | 2048 | 56.659 |
73-
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 45.929 |
74-
| | | | 2 | 128 | 2048 | 46.871 |
75-
| | | | 4 | 128 | 2048 | 48.763 |
76-
| | | | 8 | 128 | 2048 | 51.621 |
77-
| | | | 16 | 128 | 2048 | 54.822 |
78-
| | | | 32 | 128 | 2048 | 63.642 |
79-
| | | | 64 | 128 | 2048 | 82.256 |
80-
| | | | 128 | 128 | 2048 | 110.142 |
81-
| | | | 1 | 2048 | 2048 | 46.489 |
82-
| | | | 2 | 2048 | 2048 | 47.465 |
83-
| | | | 4 | 2048 | 2048 | 49.906 |
84-
| | | | 8 | 2048 | 2048 | 54.252 |
85-
| | | | 16 | 2048 | 2048 | 60.275 |
86-
| | | | 32 | 2048 | 2048 | 74.346 |
87-
| | | | 64 | 2048 | 2048 | 104.508 |
88-
| | | | 128 | 2048 | 2048 | 154.134 |
61+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.566 |
62+
| | | | 2 | 128 | 2048 | 16.858 |
63+
| | | | 4 | 128 | 2048 | 17.518 |
64+
| | | | 8 | 128 | 2048 | 18.898 |
65+
| | | | 16 | 128 | 2048 | 21.023 |
66+
| | | | 32 | 128 | 2048 | 23.896 |
67+
| | | | 64 | 128 | 2048 | 30.753 |
68+
| | | | 128 | 128 | 2048 | 43.767 |
69+
| | | | 1 | 2048 | 2048 | 15.496 |
70+
| | | | 2 | 2048 | 2048 | 17.380 |
71+
| | | | 4 | 2048 | 2048 | 17.983 |
72+
| | | | 8 | 2048 | 2048 | 19.771 |
73+
| | | | 16 | 2048 | 2048 | 22.702 |
74+
| | | | 32 | 2048 | 2048 | 27.392 |
75+
| | | | 64 | 2048 | 2048 | 36.879 |
76+
| | | | 128 | 2048 | 2048 | 57.003 |
77+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 45.828 |
78+
| | | | 2 | 128 | 2048 | 46.757 |
79+
| | | | 4 | 128 | 2048 | 48.322 |
80+
| | | | 8 | 128 | 2048 | 51.479 |
81+
| | | | 16 | 128 | 2048 | 54.861 |
82+
| | | | 32 | 128 | 2048 | 63.119 |
83+
| | | | 64 | 128 | 2048 | 82.362 |
84+
| | | | 128 | 128 | 2048 | 109.698 |
85+
| | | | 1 | 2048 | 2048 | 46.514 |
86+
| | | | 2 | 2048 | 2048 | 47.271 |
87+
| | | | 4 | 2048 | 2048 | 49.679 |
88+
| | | | 8 | 2048 | 2048 | 54.366 |
89+
| | | | 16 | 2048 | 2048 | 60.390 |
90+
| | | | 32 | 2048 | 2048 | 74.209 |
91+
| | | | 64 | 2048 | 2048 | 104.728 |
92+
| | | | 128 | 2048 | 2048 | 154.041 |
8993

9094
*TP stands for Tensor Parallelism.*
9195

@@ -487,7 +491,7 @@ To reproduce the release docker:
487491
```bash
488492
git clone https://github.com/ROCm/vllm.git
489493
cd vllm
490-
git checkout 91a56009841e11b84a2aeb9cc5aa305ab2808ede
494+
git checkout 71faa188073d427c57862c45bf17745f3b54b1b1
491495
docker build -f docker/Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
492496
```
493497

@@ -504,6 +508,12 @@ Use AITER release candidate branch instead:
504508

505509
## Changelog
506510

511+
20250605_aiter:
512+
- Updated to ROCm 6.4.1 and vLLM v0.9.0.1
513+
- AITER MHA
514+
- IBM 3d kernel for unified attention
515+
- Full graph capture for split attention
516+
507517
20250521_aiter:
508518
- AITER V1 engine performance improvement
509519

0 commit comments

Comments
 (0)