v1.4.0 #838

ggerganov · 2023-04-30T16:56:02Z

ggerganov
Apr 30, 2023
Maintainer

Overview

This is a new major release adding integer quantization and partial GPU (NVIDIA) support

Integer quantization

This allows the ggml Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.

Supported quantization modes: Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, Q8_0
Implementation details: Integer quantisation support #540
Usage instructions: README
All WASM examples now support Q5 quantized models: https://whisper.ggerganov.com

Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and performance improvements for quantized Whisper models:

LLaMA quantization (measured on M1 Pro)

Model	Measure	F16	Q4_0	Q4_1	Q4_2	Q5_0	Q5_1	Q8_0
7B	perplexity	5.9565	6.2103	6.1286	6.1698	6.0139	5.9934	5.9571
7B	file size	13.0G	4.0G	4.8G	4.0G	4.4G	4.8G	7.1G
7B	ms/tok @ 4th	128	56	61	84	91	95	75
7B	ms/tok @ 8th	128	47	55	48	53	59	75
7B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0
13B	perplexity	5.2455	5.3748	5.3471	5.3433	5.2768	5.2582	5.2458
13B	file size	25.0G	7.6G	9.1G	7.6G	8.4G	9.1G	14G
13B	ms/tok @ 4th	239	104	113	160	176	185	141
13B	ms/tok @ 8th	240	85	99	97	108	117	147
13B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0

ref: https://github.com/ggerganov/llama.cpp#quantization

RWKV quantization

Format	Perplexity (169M)	Latency, ms (1.5B)	File size, GB (1.5B)
`Q4_0`	17.507	76	1.53
`Q4_1`	17.187	72	1.68
`Q4_2`	17.060	85	1.53
`Q5_0`	16.194	78	1.60
`Q5_1`	15.851	81	1.68
`Q8_0`	15.652	89	2.13
`FP16`	15.623	117	2.82
`FP32`	15.623	198	5.64

ref: ggml-org/ggml#89 (comment)

This feature is possible thanks to the many contributions in the llama.cpp project: ggml : improve integer quantization

GPU support via cuBLAS

Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.

Implementation details: Add CUDA support via cuBLAS #834
Usage instructions: README

This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together

This release remains in "beta" stage as I haven't verified that everything works as expected.

What's Changed

Updated escape_double_quotes() Function by @tauseefmohammed2 in Updated escape_double_quotes() Function #776
examples : add missing #include by @pH5 in examples : add missing #include <cstdint> #798
Flush upon finishing inference by @tarasglek in Flush upon finishing inference #811
Escape quotes in csv output by @laytan in Escape quotes in csv output #815
C++11style by @wuyudi in C++11style #768
Optionally allow a Core ML build of Whisper to work with or without Core ML models by @Canis-UK in Optionally allow a Core ML build of Whisper to work with or without Core ML models #812
add some tips about in the readme of the android project folder by @Zolliner in add some tips about in the readme of the android project folder #816
whisper: Use correct seek_end when offset is used by @ThijsRay in whisper: Use correct seek_end when offset is used #833
ggml : fix 32-bit ARM NEON by @ggerganov in ggml : fix 32-bit ARM NEON #836
Add CUDA support via cuBLAS by @ggerganov in Add CUDA support via cuBLAS #834
Integer quantisation support by @ggerganov in Integer quantisation support #540

New Contributors

@tauseefmohammed2 made their first contribution in Updated escape_double_quotes() Function #776
@pH5 made their first contribution in examples : add missing #include <cstdint> #798
@tarasglek made their first contribution in Flush upon finishing inference #811
@laytan made their first contribution in Escape quotes in csv output #815
@wuyudi made their first contribution in C++11style #768
@Canis-UK made their first contribution in Optionally allow a Core ML build of Whisper to work with or without Core ML models #812
@Zolliner made their first contribution in add some tips about in the readme of the android project folder #816
@ThijsRay made their first contribution in whisper: Use correct seek_end when offset is used #833

Full Changelog: v1.3.0...v1.4.0

This discussion was created from the release v1.4.0.

Wariar · 2023-10-30T11:58:55Z

Wariar
Oct 30, 2023

Is it possible to get command tool binary for windows in this version ?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.4.0 #838

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

v1.4.0 #838

Uh oh!

ggerganov Apr 30, 2023 Maintainer

Overview

Integer quantization

LLaMA quantization (measured on M1 Pro)

RWKV quantization

GPU support via cuBLAS

What's Changed

New Contributors

Replies: 1 comment

Uh oh!

Wariar Oct 30, 2023

ggerganov
Apr 30, 2023
Maintainer

Wariar
Oct 30, 2023