Skip to content

Release v3.7.0

Choose a tag to compare

@sa-faizal sa-faizal released this 05 Sep 18:48
· 255 commits to main since this release
ed942a5

Release Highlights

Llama 3.1 405B Support in Mi355x

This release significantly enhances support for Large Language Models by enabling Llama 3.1 405B FP4 to run on a single Mi35x GPU, leveraging its higher memory capacity. Key components include:

  • FP16 Attention: Optimized attention mechanisms using FP16 precision reduce memory usage and speed up computation, improving inference efficiency.
  • FP8 KV-Cache: Support for FP8 precision in the key-value (KV) cache further reduces memory footprint during transformer attention, enabling faster processing and better utilization of GPU memory.
  • MXFP4 GEMM Kernels: Introduction of MXFP4 (Microscaling FP4) GEMM operations, a low-bit floating-point format designed to maximize performance and memory efficiency without compromising accuracy. MXFP4 uses one FP8 scale extending to 32 FP4 values and supports two formats: F8E8M0FNU and F4E2M1FN

Together, these enhancements make it possible to serve extremely large models more effectively on Mi35x GPUs, enabling higher throughput and better utilization of resources.

See user guide for more usage information.

SHARK-UI v0.4

SHARK UI v0.4 lays down the foundation for reliable test coverage. See the release notes for more details, and stay tuned!

Change Log

Git History

New Contributors

Full Changelog: v3.6.0...v3.7.0