Release Release v3.7.0 · nod-ai/shark-ai

Release Highlights

Llama 3.1 405B Support in Mi355x

This release significantly enhances support for Large Language Models by enabling Llama 3.1 405B FP4 to run on a single Mi35x GPU, leveraging its higher memory capacity. Key components include:

FP16 Attention: Optimized attention mechanisms using FP16 precision reduce memory usage and speed up computation, improving inference efficiency.
FP8 KV-Cache: Support for FP8 precision in the key-value (KV) cache further reduces memory footprint during transformer attention, enabling faster processing and better utilization of GPU memory.
MXFP4 GEMM Kernels: Introduction of MXFP4 (Microscaling FP4) GEMM operations, a low-bit floating-point format designed to maximize performance and memory efficiency without compromising accuracy. MXFP4 uses one FP8 scale extending to 32 FP4 values and supports two formats: F8E8M0FNU and F4E2M1FN

Together, these enhancements make it possible to serve extremely large models more effectively on Mi35x GPUs, enabling higher throughput and better utilization of resources.

See user guide for more usage information.

SHARK-UI v0.4

SHARK UI v0.4 lays down the foundation for reliable test coverage. See the release notes for more details, and stay tuned!

Change Log

Git History

Prepare FP8 quantization for sharded support by @Alex-Vasile in #1849
Move create_sample_tensor_from_class to utils by @Alex-Vasile in #1877
[Fusilli] Test harness for MLIR asm emitter by @AaronStGeorge in #1872
[Fusilli][NFC] Adopt LLVM code style throughout by @sjain-stanford in #1885
[sharktank] Yet another refactor of assert_tensor_close for tree support and integer dtypes by @sogartar in #1858
[sharktank] Add FP4 quantized tensor split and cat by @sogartar in #1873
Add missing tensor-parallelism-size flag from llama_serving.md by @vivekkhandelwal1 in #1887
[sharktank][llm] Restrict page dimension to device_block_count by @Groverkss in #1880
[tuner] add support for attention op by @bangtianliu in #1772
[Fusilli] Dependency cleanup (remove IREE/LLVM/Catch2 source builds, bring in lit & filecheck standalone) by @sjain-stanford in #1886
Fix barrier and transfer test by @Alex-Vasile in #1890
[Fusilli] Bring in iree-opt for basic MLIR verification test by @AaronStGeorge in #1898
[sharktank] Make sharded-split tensor support quantized tensors by @sogartar in #1875
Ensure consistent names for tensor shards. by @Alex-Vasile in #1893
Add tests for roundtripping datatypes by @Alex-Vasile in #1892
[Fusilli] Boilerplate to bring in iree-compile for subprocess call by @AaronStGeorge in #1901
added fix for python_pipe by @bmullick-amd in #1712
Revert 2 related PR's that were causing a llama numerics regression by @IanNod in #1908
[Fusilli] MLIR Assembly Emitter for Fusilli Graph with ConvFProp Node by @sjain-stanford in #1906
[sharktank] Add toy Llama FP4 quantization by @sogartar in #1857
Bump IREE requirement pins to 3.6.0rc20250718 by @shark-pr-automator[bot] in #1874
[Fusilli] Use pre-built IREE rather than pip dependency by @AaronStGeorge in #1907
Ensure error message is properly logged when kvcache OOM occurs by @stbaione in #1923
[Fusilli] Use docker in CI by @sjain-stanford in #1922
[Fusilli] s/RankedTensorType/ValueTensorType by @sjain-stanford in #1927
[Fusilli] Remove UID and associated functionality by @sjain-stanford in #1918
Remove write_range by @rsuderman in #1928
Cast attn_output to h dtype in paged_llama_attention_block by @sebvince in #1761
Bump IREE requirement pins to 3.7.0rc20250724 by @shark-pr-automator[bot] in #1921
[Fusilli] Create ErrorOr type by @AaronStGeorge in #1930
Cleanup ops.attention overrides for uniform behavior by @rsuderman in #1929
Migrate existing mi300 runners to new mi325 capacity. by @deedongala in #1926
Fix for the repeatable response due to truncation of prompt by @amd-vivekag in #1924
Rate Limiting based on available page by @dezhiAmd in #1736
Fix perplexity case for SDPA scale by @rsuderman in #1938
[Fusilli] Track used symbols on tensors/nodes for SSA; fixes UB by @sjain-stanford in #1940
[tuner] improve support for attention op by @bangtianliu in #1909
[sharktank] Improve toy llama generation to account for pipeline/tensor parallelism and quantization block size requirements by @sogartar in #1915
[sharktank] Improve non-default torch device placement by @sogartar in #1914
[sharktank] Move functions for construction of random LLM input by @sogartar in #1913
[sharktank] In eager do not use kernel for fp4 matmul by @sogartar in #1899
[sharktank] Fix ShardedRotaryLayer to not produce nested replicated tensors by @sogartar in #1916
[sharktank] make fp4 block-quantized have scales with trailing singleton dimension by @sogartar in #1900
[sharktank] Add sharding of fp4 quantized toy Llama theta by @sogartar in #1876
[Fusilli] Update find_program build macro to also register executable and set import location by @sjain-stanford in #1945
[Fusilli] Since we're doing the pasta theme now (just needed one more L) by @sjain-stanford in #1946
[Fusilli] Set import location to path, not variable name by @AaronStGeorge in #1947
Add an MLIR kernel for properly handling FP4 casting by @KyleHerndon in #1882
shortfin_apps.sd: Add benchmark by @bmullick-amd in #1738
Fix error: zero-dimensional arrays cannot be concatenated by @dezhiAmd in #1951
[sharktank] propagate transpose_rhs into the shards on sharded matmumls by @sogartar in #1902
[sharktank] make quantize a proper op with multiple dispatch by @sogartar in #1944
[Fusilli] Generate compiled artifacts from Fusilli Graph by @AaronStGeorge in #1936
[Fusilli] Ensure tests do not write to cache by @AaronStGeorge in #1952
Bump version to 3.7.0 after 3.6.0 release. by @sa-faizal in #1895
[sharktank] make unpack an op with dispatch by @sogartar in #1959
[sharktank] Fix loading of legacy fp4 block scaled quantized tensors having scale with no external tensor trait by @sogartar in #1969
[sharktank] Add op shards and implement for BlockScaledLayout by @sogartar in #1960
Remove pipelining and sharding from LLM by @rsuderman in #1932
[Fusilli] NFC refactor by @sjain-stanford in #1972
[Fusilli] Switch to (pre-) source-built and statically linked IREERuntime, and python packages for iree-compile by @sjain-stanford in #1977
[sharktank] Remove sharded LLM CI tests by @sogartar in #1980
[sharktank] make unpack_qs a dispatchable op by @sogartar in #1961
Add more info on MI350 LDS to the kernel optimization guide by @sebvince in #1970
Minor changes on kernel optimization guide by @sebvince in #1983
[Fusilli] Enable gfx942 compilation test among other changes by @sjain-stanford in #1984
Retool export scripting to be invokable from python by @rsuderman in #1978
Added LLM testing utilities including decode and perplexity by @rsuderman in #1958
[sharktank] Remove lingering sharded LLM nightly tests by @sogartar in #1981
Update Wave dependency to wave-lang from iree-turbine by @paulzzy in #1925
[FUSILLI] Use lowercase tool names in CMake cache variables by @AaronStGeorge in #1989
Fix Model Integration Test by @dezhiAmd in #1967
[sharktank] Add dequantize op to allow multiple dispatch by @sogartar in #1962
[sharktank] make cat work with f8 eager CPU by @sogartar in #1963
[sharktank] swiglu op by @oyazdanb in #1992
test cleanup: removes tests which are not valid anymore by @amd-vivekag in #1994
Disable offline serving for CI by @pravg-amd in #1974
[Fusilli] Set opt level to O3 for GFX942 backend by @sjain-stanford in #1986
Remove unused models from model management by @rsuderman in #2002
[docs] Fix typo in amdgpu_kernel_optimization_guide.md by @kuhar in #2004
Add CI tests against torch version 2.6.0 and remove for 2.4.1 by @PhaneeshB in #1678
[sharktank] Fix nightly testCompareV1_1XxlFluxRepoIreeBf16AgainstTorchEagerF32 by @sogartar in #1995
[sharktank] fix nightly Flux test testCompareDevIreeBf16AgainstEagerF32 by @sogartar in #1997
Bump IREE requirement pins to 3.7.0rc20250730 by @shark-pr-automator[bot] in #1933
[doc] Add MI355X info to .md by @RattataKing in #2005
Strip Tests and Support for multigreedy by @rsuderman in #2012
Add temperature support to new decoder by @rsuderman in #2011
Add wave fp4 gemm unit test for export, compile, and run by @aviator19941 in #1976
Disable number of page restriction by @rsuderman in #2020
Remove wave fp4 gemm optimizations that cause nan's in prefill and PPL segfault in decode by @aviator19941 in #2022
Tweak temperature default and temperature value check by @rsuderman in #2019
[sharktank] fix test for wave fp4 gemm kernel by @sogartar in #2023
Update amdgpu_kernel_optimization_guide.md by @kuhar in #2018
Argmax export should not drop the options dimension by @rsuderman in #2025
[sharktank] Rope for openweight by @oyazdanb in #2003
Refactor decoder with stateful tools by @rsuderman in #1871
Bump IREE requirement pins to 3.7.0rc20250811 by @shark-pr-automator[bot] in #2010
Add export tooling for generating testing artifacts by @rsuderman in #1993
Fix vmfb runner by @rsuderman in #2027
Enable new decoder - remove multigreedy case by @rsuderman in #2026
Unify attention dispatching by @KyleHerndon in #1935
Lisal.prefill allocation in decoder by @lisaliu1 in #2015
Remove unused server tests / infrastructure by @rsuderman in #2030
Use int8 for handling float8_e4m3fn compatiblity by @KyleHerndon in #1883
Handle prefill EOT and remove tokenizer padding by @rsuderman in #2038
Refactor LlmExecutorProcess by @stbaione in #1966
Delete old token selection strategy by @rsuderman in #2029
[sharktank] Update rotary layer to support prefill offset by @archana-ramalingam in #2031
Correct the sharktank attention kernel overrides to correctly allow a… by @KyleHerndon in #2036
Add a raw benchmark utility by @rsuderman in #2042
[sharktank] Add attention updates required for prefill offset by @archana-ramalingam in #2043
[sharktank] Add fp4 gemm asm kernel integration by @jinchen62 in #2041
[sharktank] Add chunked prefill to llm by @archana-ramalingam in #2013
Generating logs for export, compilation and iree_benchmark test by @yash-amd in #2035
[CI] Llama 3.1 8b fp16 nightly harness by @amd-vivekag in #2007
Strip down tinystories to fewer tests - strip beam search crud by @rsuderman in #2051
Remove unused cruft from exec request by @rsuderman in #2052
Move allocate_cache from batcher.py to decoder.py by @dezhiAmd in #2050
[Shortfin] Fix bugs in workloadbuilder by @Alex-Vasile in #1917
[sharktank] Add scales shuffler kernel for gemm asm kernel by @jinchen62 in #2058
Add kvcache dtype option to run_llm_vmfb by @pravg-amd in #2037
test adding checks by @yash-amd in #2065
Softmax should occur in f32 by @rsuderman in #2062
[sharktank] Add a more concise trampoline description by @KyleHerndon in #2061
Stripping out unused or meaningless model tests by @rsuderman in #2064
Run Deploy to Github Pages and Push Logs everytime by @yash-amd in #2068
[Dataset] Add json load/save support to Dataset by @Groverkss in #2070
shortfin_apps.sd: Add precompile shellscript, update flagfile for gfx942. by @monorimet in #2069
[Sharktank] Pipeline parallelism FFN example refactor by @Alex-Vasile in #2072
Bump iree to 3.7.0rc20250822 and kernels for matmul_transpose change by @stbaione in #2083
Kernel selection for attention and matmul by @KyleHerndon in #2071
Add EOS token ids for decode utility by @rsuderman in #2084
[sharktank] Slice gemm asm kernel output and fix inputs padding by @jinchen62 in #2074
Add a vmfb perplexity generating tool by @rsuderman in #2063
removes serialization when return_input_ids is True by @amd-vivekag in #2079
Updaed setup to enable model to run every 6 hours from IREE source code by @pdhirajkumarprasad in #2089
Online serving response checks by @yash-amd in #2073
Refactor transfer_between_blocks_if_needed by @Alex-Vasile in #2080
Improve kernel selection to enable a priority list of matches by @KyleHerndon in #2087
Update parameter loading to make runtime much faster by @IanNod in #2092
[shartank] Embedding gemm asm kernels as hex by @jinchen62 in #2094
Finish plumbing matmul-kernel arg through perplexity_iree.py by @aviator19941 in #2093
fixed ci failure and added new CI to generate IRPA by @pdhirajkumarprasad in #2095
Llm invocation cleanup by @stbaione in #2077
Replace KVCache list of tensors with simple CacheAllocation wrapper object by @Alex-Vasile in #2081
Add Wave gemm optimizations by @aviator19941 in #2085
[Sharktank] Improve and use transfer_between_blocks by @Alex-Vasile in #2100
Llm task responder by @stbaione in #2078
[Sharktank] Allocate KVCache with .zeros instead of .empty by @Alex-Vasile in #2099
[Sharktank] Change pipeline_parallism_size to be a calculated property by @Alex-Vasile in #2101
Lisal.return ids from allocation by @lisaliu1 in #2057
Llm task remove exec requests by @stbaione in #2082
[Sharktank] Add some external model support by @dan-garvey in #2103
[tuner] add acc layout match to constraint generation for attention by @bangtianliu in #2104
Add model based VMFB tool and perplexity option by @rsuderman in #2111
[Sharktank] Type hints for CachedRotaryLayer by @Alex-Vasile in #2102
[Sharktank] Create paged_attention method by @Alex-Vasile in #2106
[Sharktank] Remove old unused code by @Alex-Vasile in #2114
[Sharktank] Move implementation of helper functions out of BaseCausalLMModel by @Alex-Vasile in #2113
Change script to eval and dedupe some behavior by @rsuderman in #2115
[Sharktank] Remove BaseCausalLLMModel methods and replace usages with calls to new helper functions. by @Alex-Vasile in #2117
Add decode perplexity test to toy llama tests by @rsuderman in #2112
[sharktank] Make torch SDPA the default attention implementation by @sogartar in #2118
updated CI to export mistral IRPA by @pdhirajkumarprasad in #2120
Changes for six hour report by @yash-amd in #2121
[Sharktank] Replumb PagedAttention to have their own KVCache object by @Alex-Vasile in #2109
[Tuner] Add time limit to benchmark phases by @RattataKing in #2108
Update amdgpu_kernel_optimization_guide.md by @kuhar in #2126
[Sharktank] attention for openweight by @oyazdanb in #2017
Enable start_positions in prefill execution by @stbaione in #2129
Add script to validate numerics by @pravg-amd in #2142
added vmfb numeric check in CI by @pdhirajkumarprasad in #2143
Run mistral on isl 2048 by @yash-amd in #2138
[sharktank] in assert_tensor_close print abs diff and expected stats by @sogartar in #2124
Bump iree pin on shortfin runtime build to iree-3.7.0rc20250828 by @rsuderman in #2128
[shortfin][llm] Move batching logic behind a strategy pattern by @vinayakdsci in #2075
[Sharktank] Replumb KV Quantizers by @Alex-Vasile in #2107
fixed file path by @pdhirajkumarprasad in #2148
Add llama 3.1 8b new perplexity script to CI by @rsuderman in #2134
[Fusilli] Cleanup ErrorOr<> error propagation tests by @AaronStGeorge in #2149
[Sharktank] Pipeline parallelism version of rotary embedding layer by @Alex-Vasile in #2110
[Sharktank] Remove model.paged_attention by @Alex-Vasile in #2146
Moves paged kv cache kernel to kernels dir by @IanNod in #2131
[Sharktank] Remove use_attention_mask by @Alex-Vasile in #2153
Revert "Moves paged kv cache kernel to kernels dir" by @IanNod in #2154
Refactor paged attention by @dezhiAmd in #2139
Add an iree baseline test for llama 8b by @rsuderman in #2158
Revert "Refactor paged attention (#2139)" by @rsuderman in #2166
Update amdgpu_kernel_optimization_guide.md by @kuhar in #2168
[Sharktank] Remove block_index from PagedAttention.forward by @Alex-Vasile in #2163
[Fusilli] Runtime Interface 1/N by @sjain-stanford in #2125
[Sharktank] Change TestAttentionBlock to use causal attention mask by @Alex-Vasile in #2160
Add flag to compile with tensor ukernels. by @yash-amd in #2177

New Contributors

@AaronStGeorge made their first contribution in #1872
@bmullick-amd made their first contribution in #1712
@sebvince made their first contribution in #1761
@deedongala made their first contribution in #1926
@sa-faizal made their first contribution in #1895
@paulzzy made their first contribution in #1925
@yash-amd made their first contribution in #2035

Full Changelog: v3.6.0...v3.7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v3.7.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Release Highlights

Llama 3.1 405B Support in Mi355x

SHARK-UI v0.4

Change Log

New Contributors

Contributors

Uh oh!