Releases · unslothai/llama.cpp

26 Aug 05:00

74f52f7

b6277 Latest

Latest

CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (#15451)

* CUDA: optimize get_int_from_table_16

* CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs

* revise documentation

---------

Co-authored-by: xix <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-08-26T05:01:00Z
llama-b6277-bin-macos-arm64.zip

sha256:eb920e446397e1c88115f592e6bc3964a3bdffaaaa183824b473004e2f4175db

10.9 MB 2025-08-26T05:01:08Z
llama-b6277-bin-macos-x64.zip

sha256:3a4c6fbfe8863437d4586a35d45180f59fd44d1cfee89e5ccb3f7993c891e9c8

28.1 MB 2025-08-26T05:01:09Z
llama-b6277-bin-ubuntu-vulkan-x64.zip

sha256:8d0ff9b3eada7bc4e0149dbc9d8e86163ac1827e54226632652ede559054b69d

24.8 MB 2025-08-26T05:01:11Z
llama-b6277-bin-ubuntu-x64.zip

sha256:435275c8199149921112baaa77feb09aa2d4bb30c689f021dc1a034c5e646378

12.9 MB 2025-08-26T05:01:12Z
llama-b6277-bin-win-cpu-arm64.zip

sha256:a533a3b72812fa7d279a25a231553955a5f58cd63d01c594e3e2f140dc310384

11.1 MB 2025-08-26T05:01:13Z
llama-b6277-bin-win-cpu-x64.zip

sha256:92510a62eb6b00ea3ec9698c9feeb22d75a531643dd8af57679a4f3070981217

14.1 MB 2025-08-26T05:01:14Z
llama-b6277-bin-win-cuda-12.4-x64.zip

sha256:3a3688829dcfda70d1c9a5a5b6c5a6335455d156c4d6c3899574fded6615615d

137 MB 2025-08-26T05:01:15Z
llama-b6277-bin-win-hip-radeon-x64.zip

sha256:54eead84e7011bafe7b50054d4ec2c66efdd9312de239aaf31957755d9c4b3a8

287 MB 2025-08-26T05:01:19Z
llama-b6277-bin-win-opencl-adreno-arm64.zip

sha256:2e59e340e43368e8c416a32dba3fedc132c9e7483d621397ee9c928cd2cb1725

11.5 MB 2025-08-26T05:01:27Z
Source code (zip)

2025-08-25T21:21:22Z
Source code (tar.gz)

2025-08-25T21:21:22Z

05 Aug 23:55

github-actions

b6096

fd1234c

b6096

llama : add gpt-oss (#15091)

* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: slaren <[email protected]>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <[email protected]>

change kvalues_mxfp4 table to match e2m1 (#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: slaren <[email protected]>

Assets 15

28 Jul 03:29

github-actions

b6006

7f97599

b6006

quantize : update README.md (#14905)

* Update README.md

* Fix trailing whitespace

* Update README.md

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Assets 15

15 Jul 11:08

github-actions

b5926

c659ace

b5926

Merge branch 'ggml-org:master' into master

Assets 15

14 Jul 23:02

github-actions

b5924

c06e908

b5924

Merge branch 'ggml-org:master' into master

Assets 15

14 Jul 14:00

github-actions

b5922

d099160

b5922

Merge branch 'ggml-org:master' into master

Assets 15

14 Jul 13:32

github-actions

b5919

c6d2de8

b5919

Merge branch 'ggml-org:master' into master

Assets 15

14 Jul 10:28

github-actions

b5917

914f3aa

b5917

Merge branch 'ggml-org:master' into master

Assets 15

14 Jul 01:42

github-actions

b5913

835a0b6

b5913

Update unicode.cpp

Assets 15

13 Jul 21:42

github-actions

b5912

7f4e47f

b5912

Merge branch 'master' of https://github.com/unslothai/llama.cpp

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: unslothai/llama.cpp

b6277

Uh oh!

b6096

Uh oh!

b6006

Uh oh!

b5926

Uh oh!

b5924

Uh oh!

b5922

Uh oh!

b5919

Uh oh!

b5917

Uh oh!

b5913

Uh oh!

b5912

Uh oh!