Release v3.7.0
Release Highlights
Llama 3.1 405B Support in Mi355x
This release significantly enhances support for Large Language Models by enabling Llama 3.1 405B FP4 to run on a single Mi35x GPU, leveraging its higher memory capacity. Key components include:
- FP16 Attention: Optimized attention mechanisms using FP16 precision reduce memory usage and speed up computation, improving inference efficiency.
- FP8 KV-Cache: Support for FP8 precision in the key-value (KV) cache further reduces memory footprint during transformer attention, enabling faster processing and better utilization of GPU memory.
- MXFP4 GEMM Kernels: Introduction of MXFP4 (Microscaling FP4) GEMM operations, a low-bit floating-point format designed to maximize performance and memory efficiency without compromising accuracy. MXFP4 uses one FP8 scale extending to 32 FP4 values and supports two formats: F8E8M0FNU and F4E2M1FN
Together, these enhancements make it possible to serve extremely large models more effectively on Mi35x GPUs, enabling higher throughput and better utilization of resources.
See user guide for more usage information.
SHARK-UI v0.4
SHARK UI v0.4 lays down the foundation for reliable test coverage. See the release notes for more details, and stay tuned!
Change Log
Git History
- Prepare FP8 quantization for sharded support by @Alex-Vasile in #1849
- Move create_sample_tensor_from_class to utils by @Alex-Vasile in #1877
- [Fusilli] Test harness for MLIR asm emitter by @AaronStGeorge in #1872
- [Fusilli][NFC] Adopt LLVM code style throughout by @sjain-stanford in #1885
- [sharktank] Yet another refactor of assert_tensor_close for tree support and integer dtypes by @sogartar in #1858
- [sharktank] Add FP4 quantized tensor split and cat by @sogartar in #1873
- Add missing
tensor-parallelism-size
flag from llama_serving.md by @vivekkhandelwal1 in #1887 - [sharktank][llm] Restrict page dimension to device_block_count by @Groverkss in #1880
- [tuner] add support for attention op by @bangtianliu in #1772
- [Fusilli] Dependency cleanup (remove IREE/LLVM/Catch2 source builds, bring in
lit
&filecheck
standalone) by @sjain-stanford in #1886 - Fix barrier and transfer test by @Alex-Vasile in #1890
- [Fusilli] Bring in
iree-opt
for basic MLIR verification test by @AaronStGeorge in #1898 - [sharktank] Make sharded-split tensor support quantized tensors by @sogartar in #1875
- Ensure consistent names for tensor shards. by @Alex-Vasile in #1893
- Add tests for roundtripping datatypes by @Alex-Vasile in #1892
- [Fusilli] Boilerplate to bring in
iree-compile
for subprocess call by @AaronStGeorge in #1901 - added fix for python_pipe by @bmullick-amd in #1712
- Revert 2 related PR's that were causing a llama numerics regression by @IanNod in #1908
- [Fusilli] MLIR Assembly Emitter for Fusilli Graph with ConvFProp Node by @sjain-stanford in #1906
- [sharktank] Add toy Llama FP4 quantization by @sogartar in #1857
- Bump IREE requirement pins to 3.6.0rc20250718 by @shark-pr-automator[bot] in #1874
- [Fusilli] Use pre-built IREE rather than pip dependency by @AaronStGeorge in #1907
- Ensure error message is properly logged when
kvcache OOM
occurs by @stbaione in #1923 - [Fusilli] Use docker in CI by @sjain-stanford in #1922
- [Fusilli] s/RankedTensorType/ValueTensorType by @sjain-stanford in #1927
- [Fusilli] Remove UID and associated functionality by @sjain-stanford in #1918
- Remove write_range by @rsuderman in #1928
- Cast attn_output to h dtype in paged_llama_attention_block by @sebvince in #1761
- Bump IREE requirement pins to 3.7.0rc20250724 by @shark-pr-automator[bot] in #1921
- [Fusilli] Create ErrorOr type by @AaronStGeorge in #1930
- Cleanup ops.attention overrides for uniform behavior by @rsuderman in #1929
- Migrate existing mi300 runners to new mi325 capacity. by @deedongala in #1926
- Fix for the repeatable response due to truncation of prompt by @amd-vivekag in #1924
- Rate Limiting based on available page by @dezhiAmd in #1736
- Fix perplexity case for SDPA scale by @rsuderman in #1938
- [Fusilli] Track used symbols on tensors/nodes for SSA; fixes UB by @sjain-stanford in #1940
- [tuner] improve support for attention op by @bangtianliu in #1909
- [sharktank] Improve toy llama generation to account for pipeline/tensor parallelism and quantization block size requirements by @sogartar in #1915
- [sharktank] Improve non-default torch device placement by @sogartar in #1914
- [sharktank] Move functions for construction of random LLM input by @sogartar in #1913
- [sharktank] In eager do not use kernel for fp4 matmul by @sogartar in #1899
- [sharktank] Fix ShardedRotaryLayer to not produce nested replicated tensors by @sogartar in #1916
- [sharktank] make fp4 block-quantized have scales with trailing singleton dimension by @sogartar in #1900
- [sharktank] Add sharding of fp4 quantized toy Llama theta by @sogartar in #1876
- [Fusilli] Update find_program build macro to also register executable and set import location by @sjain-stanford in #1945
- [Fusilli] Since we're doing the pasta theme now (just needed one more
L
) by @sjain-stanford in #1946 - [Fusilli] Set import location to path, not variable name by @AaronStGeorge in #1947
- Add an MLIR kernel for properly handling FP4 casting by @KyleHerndon in #1882
- shortfin_apps.sd: Add benchmark by @bmullick-amd in #1738
- Fix error: zero-dimensional arrays cannot be concatenated by @dezhiAmd in #1951
- [sharktank] propagate transpose_rhs into the shards on sharded matmumls by @sogartar in #1902
- [sharktank] make quantize a proper op with multiple dispatch by @sogartar in #1944
- [Fusilli] Generate compiled artifacts from Fusilli Graph by @AaronStGeorge in #1936
- [Fusilli] Ensure tests do not write to cache by @AaronStGeorge in #1952
- Bump version to 3.7.0 after 3.6.0 release. by @sa-faizal in #1895
- [sharktank] make unpack an op with dispatch by @sogartar in #1959
- [sharktank] Fix loading of legacy fp4 block scaled quantized tensors having scale with no external tensor trait by @sogartar in #1969
- [sharktank] Add op shards and implement for BlockScaledLayout by @sogartar in #1960
- Remove pipelining and sharding from LLM by @rsuderman in #1932
- [Fusilli] NFC refactor by @sjain-stanford in #1972
- [Fusilli] Switch to (pre-) source-built and statically linked IREERuntime, and python packages for
iree-compile
by @sjain-stanford in #1977 - [sharktank] Remove sharded LLM CI tests by @sogartar in #1980
- [sharktank] make unpack_qs a dispatchable op by @sogartar in #1961
- Add more info on MI350 LDS to the kernel optimization guide by @sebvince in #1970
- Minor changes on kernel optimization guide by @sebvince in #1983
- [Fusilli] Enable gfx942 compilation test among other changes by @sjain-stanford in #1984
- Retool export scripting to be invokable from python by @rsuderman in #1978
- Added LLM testing utilities including decode and perplexity by @rsuderman in #1958
- [sharktank] Remove lingering sharded LLM nightly tests by @sogartar in #1981
- Update Wave dependency to
wave-lang
fromiree-turbine
by @paulzzy in #1925 - [FUSILLI] Use lowercase tool names in CMake cache variables by @AaronStGeorge in #1989
- Fix Model Integration Test by @dezhiAmd in #1967
- [sharktank] Add dequantize op to allow multiple dispatch by @sogartar in #1962
- [sharktank] make cat work with f8 eager CPU by @sogartar in #1963
- [sharktank] swiglu op by @oyazdanb in #1992
- test cleanup: removes tests which are not valid anymore by @amd-vivekag in #1994
- Disable offline serving for CI by @pravg-amd in #1974
- [Fusilli] Set opt level to O3 for GFX942 backend by @sjain-stanford in #1986
- Remove unused models from model management by @rsuderman in #2002
- [docs] Fix typo in amdgpu_kernel_optimization_guide.md by @kuhar in #2004
- Add CI tests against torch version 2.6.0 and remove for 2.4.1 by @PhaneeshB in #1678
- [sharktank] Fix nightly testCompareV1_1XxlFluxRepoIreeBf16AgainstTorchEagerF32 by @sogartar in #1995
- [sharktank] fix nightly Flux test testCompareDevIreeBf16AgainstEagerF32 by @sogartar in #1997
- Bump IREE requirement pins to 3.7.0rc20250730 by @shark-pr-automator[bot] in #1933
- [doc] Add MI355X info to .md by @RattataKing in #2005
- Strip Tests and Support for multigreedy by @rsuderman in #2012
- Add temperature support to new decoder by @rsuderman in #2011
- Add wave fp4 gemm unit test for export, compile, and run by @aviator19941 in #1976
- Disable number of page restriction by @rsuderman in #2020
- Remove wave fp4 gemm optimizations that cause nan's in prefill and PPL segfault in decode by @aviator19941 in #2022
- Tweak temperature default and temperature value check by @rsuderman in #2019
- [sharktank] fix test for wave fp4 gemm kernel by @sogartar in #2023
- Update amdgpu_kernel_optimization_guide.md by @kuhar in #2018
- Argmax export should not drop the options dimension by @rsuderman in #2025
- [sharktank] Rope for openweight by @oyazdanb in #2003
- Refactor decoder with stateful tools by @rsuderman in #1871
- Bump IREE requirement pins to 3.7.0rc20250811 by @shark-pr-automator[bot] in #2010
- Add export tooling for generating testing artifacts by @rsuderman in #1993
- Fix vmfb runner by @rsuderman in #2027
- Enable new decoder - remove multigreedy case by @rsuderman in #2026
- Unify attention dispatching by @KyleHerndon in #1935
- Lisal.prefill allocation in decoder by @lisaliu1 in #2015
- Remove unused server tests / infrastructure by @rsuderman in #2030
- Use int8 for handling float8_e4m3fn compatiblity by @KyleHerndon in #1883
- Handle prefill EOT and remove tokenizer padding by @rsuderman in #2038
- Refactor
LlmExecutorProcess
by @stbaione in #1966 - Delete old token selection strategy by @rsuderman in #2029
- [sharktank] Update rotary layer to support prefill offset by @archana-ramalingam in #2031
- Correct the sharktank attention kernel overrides to correctly allow a… by @KyleHerndon in #2036
- Add a raw benchmark utility by @rsuderman in #2042
- [sharktank] Add attention updates required for prefill offset by @archana-ramalingam in #2043
- [sharktank] Add fp4 gemm asm kernel integration by @jinchen62 in #2041
- [sharktank] Add chunked prefill to llm by @archana-ramalingam in #2013
- Generating logs for export, compilation and iree_benchmark test by @yash-amd in #2035
- [CI] Llama 3.1 8b fp16 nightly harness by @amd-vivekag in #2007
- Strip down tinystories to fewer tests - strip beam search crud by @rsuderman in #2051
- Remove unused cruft from exec request by @rsuderman in #2052
- Move allocate_cache from batcher.py to decoder.py by @dezhiAmd in #2050
- [Shortfin] Fix bugs in workloadbuilder by @Alex-Vasile in #1917
- [sharktank] Add scales shuffler kernel for gemm asm kernel by @jinchen62 in #2058
- Add kvcache dtype option to run_llm_vmfb by @pravg-amd in #2037
- test adding checks by @yash-amd in #2065
- Softmax should occur in f32 by @rsuderman in #2062
- [sharktank] Add a more concise trampoline description by @KyleHerndon in #2061
- Stripping out unused or meaningless model tests by @rsuderman in #2064
- Run Deploy to Github Pages and Push Logs everytime by @yash-amd in #2068
- [Dataset] Add json load/save support to Dataset by @Groverkss in #2070
- shortfin_apps.sd: Add precompile shellscript, update flagfile for gfx942. by @monorimet in #2069
- [Sharktank] Pipeline parallelism FFN example refactor by @Alex-Vasile in #2072
- Bump iree to
3.7.0rc20250822
and kernels for matmul_transpose change by @stbaione in #2083 - Kernel selection for attention and matmul by @KyleHerndon in #2071
- Add EOS token ids for decode utility by @rsuderman in #2084
- [sharktank] Slice gemm asm kernel output and fix inputs padding by @jinchen62 in #2074
- Add a vmfb perplexity generating tool by @rsuderman in #2063
- removes serialization when return_input_ids is True by @amd-vivekag in #2079
- Updaed setup to enable model to run every 6 hours from IREE source code by @pdhirajkumarprasad in #2089
- Online serving response checks by @yash-amd in #2073
- Refactor transfer_between_blocks_if_needed by @Alex-Vasile in #2080
- Improve kernel selection to enable a priority list of matches by @KyleHerndon in #2087
- Update parameter loading to make runtime much faster by @IanNod in #2092
- [shartank] Embedding gemm asm kernels as hex by @jinchen62 in #2094
- Finish plumbing matmul-kernel arg through perplexity_iree.py by @aviator19941 in #2093
- fixed ci failure and added new CI to generate IRPA by @pdhirajkumarprasad in #2095
- Llm invocation cleanup by @stbaione in #2077
- Replace KVCache list of tensors with simple CacheAllocation wrapper object by @Alex-Vasile in #2081
- Add Wave gemm optimizations by @aviator19941 in #2085
- [Sharktank] Improve and use transfer_between_blocks by @Alex-Vasile in #2100
- Llm task responder by @stbaione in #2078
- [Sharktank] Allocate KVCache with .zeros instead of .empty by @Alex-Vasile in #2099
- [Sharktank] Change pipeline_parallism_size to be a calculated property by @Alex-Vasile in #2101
- Lisal.return ids from allocation by @lisaliu1 in #2057
- Llm task remove exec requests by @stbaione in #2082
- [Sharktank] Add some external model support by @dan-garvey in #2103
- [tuner] add acc layout match to constraint generation for attention by @bangtianliu in #2104
- Add model based VMFB tool and perplexity option by @rsuderman in #2111
- [Sharktank] Type hints for CachedRotaryLayer by @Alex-Vasile in #2102
- [Sharktank] Create paged_attention method by @Alex-Vasile in #2106
- [Sharktank] Remove old unused code by @Alex-Vasile in #2114
- [Sharktank] Move implementation of helper functions out of BaseCausalLMModel by @Alex-Vasile in #2113
- Change script to eval and dedupe some behavior by @rsuderman in #2115
- [Sharktank] Remove BaseCausalLLMModel methods and replace usages with calls to new helper functions. by @Alex-Vasile in #2117
- Add decode perplexity test to toy llama tests by @rsuderman in #2112
- [sharktank] Make torch SDPA the default attention implementation by @sogartar in #2118
- updated CI to export mistral IRPA by @pdhirajkumarprasad in #2120
- Changes for six hour report by @yash-amd in #2121
- [Sharktank] Replumb PagedAttention to have their own KVCache object by @Alex-Vasile in #2109
- [Tuner] Add time limit to benchmark phases by @RattataKing in #2108
- Update amdgpu_kernel_optimization_guide.md by @kuhar in #2126
- [Sharktank] attention for openweight by @oyazdanb in #2017
- Enable
start_positions
in prefill execution by @stbaione in #2129 - Add script to validate numerics by @pravg-amd in #2142
- added vmfb numeric check in CI by @pdhirajkumarprasad in #2143
- Run mistral on isl 2048 by @yash-amd in #2138
- [sharktank] in assert_tensor_close print abs diff and expected stats by @sogartar in #2124
- Bump iree pin on shortfin runtime build to iree-3.7.0rc20250828 by @rsuderman in #2128
- [shortfin][llm] Move batching logic behind a strategy pattern by @vinayakdsci in #2075
- [Sharktank] Replumb KV Quantizers by @Alex-Vasile in #2107
- fixed file path by @pdhirajkumarprasad in #2148
- Add llama 3.1 8b new perplexity script to CI by @rsuderman in #2134
- [Fusilli] Cleanup
ErrorOr<>
error propagation tests by @AaronStGeorge in #2149 - [Sharktank] Pipeline parallelism version of rotary embedding layer by @Alex-Vasile in #2110
- [Sharktank] Remove model.paged_attention by @Alex-Vasile in #2146
- Moves paged kv cache kernel to kernels dir by @IanNod in #2131
- [Sharktank] Remove use_attention_mask by @Alex-Vasile in #2153
- Revert "Moves paged kv cache kernel to kernels dir" by @IanNod in #2154
- Refactor paged attention by @dezhiAmd in #2139
- Add an iree baseline test for llama 8b by @rsuderman in #2158
- Revert "Refactor paged attention (#2139)" by @rsuderman in #2166
- Update amdgpu_kernel_optimization_guide.md by @kuhar in #2168
- [Sharktank] Remove block_index from PagedAttention.forward by @Alex-Vasile in #2163
- [Fusilli] Runtime Interface 1/N by @sjain-stanford in #2125
- [Sharktank] Change TestAttentionBlock to use causal attention mask by @Alex-Vasile in #2160
- Add flag to compile with tensor ukernels. by @yash-amd in #2177
New Contributors
- @AaronStGeorge made their first contribution in #1872
- @bmullick-amd made their first contribution in #1712
- @sebvince made their first contribution in #1761
- @deedongala made their first contribution in #1926
- @sa-faizal made their first contribution in #1895
- @paulzzy made their first contribution in #1925
- @yash-amd made their first contribution in #2035
Full Changelog: v3.6.0...v3.7.0