Skip to content

Commit f90900f

Browse files
committed
added quantization notebook
1 parent 83bc2c4 commit f90900f

File tree

3 files changed

+249
-0
lines changed

3 files changed

+249
-0
lines changed
Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "96fd1423-4a3f-43ae-9803-bd8e2110cb93",
6+
"metadata": {},
7+
"source": [
8+
"# Quantization\n",
9+
"\n",
10+
"In this notebook we demonstrate how models can be quantized to save memory. Note that the quantized model is not just smaller but may also perform worse.\n",
11+
"\n",
12+
"Read more\n",
13+
"* [Quantization in Huggingface Transformers documentation](https://huggingface.co/docs/transformers/main/en/quantization/overview)\n",
14+
"* [Quantization using bitsandbytes](https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes)\n",
15+
"* [Blog post about 8-bit quantization using bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration)"
16+
]
17+
},
18+
{
19+
"cell_type": "code",
20+
"execution_count": 1,
21+
"id": "a02ffeb1-1f8d-4051-9bf1-1658edce16b9",
22+
"metadata": {},
23+
"outputs": [],
24+
"source": [
25+
"from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
26+
"from utilities import calculate_model_memory_in_gb\n",
27+
"import torch"
28+
]
29+
},
30+
{
31+
"cell_type": "code",
32+
"execution_count": 2,
33+
"id": "17369f45-2bd3-45bd-8172-29d8761f809e",
34+
"metadata": {},
35+
"outputs": [],
36+
"source": [
37+
"model_name = \"google/gemma-2b-it\"\n",
38+
"attn_implementation = \"eager\""
39+
]
40+
},
41+
{
42+
"cell_type": "markdown",
43+
"id": "8aeedc5a-3ed3-4abc-b864-b8a03ee4671e",
44+
"metadata": {},
45+
"source": [
46+
"This is the very normal way to load a model from Huggingface. Note that we specify to store the model in RAM, and not in GPU memory. This makes sense as we do not plan to run the model, and CPU typically has access to more memory. "
47+
]
48+
},
49+
{
50+
"cell_type": "code",
51+
"execution_count": 3,
52+
"id": "1282fc16-0737-4c6c-a32c-cca9546175c2",
53+
"metadata": {},
54+
"outputs": [
55+
{
56+
"name": "stderr",
57+
"output_type": "stream",
58+
"text": [
59+
"`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.\n",
60+
"Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use\n",
61+
"`config.hidden_activation` if you want to override this behaviour.\n",
62+
"See https://github.com/huggingface/transformers/pull/29402 for more details.\n"
63+
]
64+
},
65+
{
66+
"data": {
67+
"application/vnd.jupyter.widget-view+json": {
68+
"model_id": "79e5f5a054c9469f9f88b23e7c7fb962",
69+
"version_major": 2,
70+
"version_minor": 0
71+
},
72+
"text/plain": [
73+
"Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]"
74+
]
75+
},
76+
"metadata": {},
77+
"output_type": "display_data"
78+
}
79+
],
80+
"source": [
81+
"model = AutoModelForCausalLM.from_pretrained(\n",
82+
" model_name,\n",
83+
" device_map=\"cpu\"\n",
84+
")"
85+
]
86+
},
87+
{
88+
"cell_type": "markdown",
89+
"id": "e04fb7ef-db4f-41b2-bd85-9a5c5b23a832",
90+
"metadata": {},
91+
"source": [
92+
"We can then determine the model size in memory:"
93+
]
94+
},
95+
{
96+
"cell_type": "code",
97+
"execution_count": 4,
98+
"id": "eeb96995-2889-4637-b80b-534bf28dddc3",
99+
"metadata": {},
100+
"outputs": [
101+
{
102+
"data": {
103+
"text/plain": [
104+
"9.336219787597656"
105+
]
106+
},
107+
"execution_count": 4,
108+
"metadata": {},
109+
"output_type": "execute_result"
110+
}
111+
],
112+
"source": [
113+
"calculate_model_memory_in_gb(model)"
114+
]
115+
},
116+
{
117+
"cell_type": "markdown",
118+
"id": "f80b3b70-e6d1-4e64-b82e-8eec72adf696",
119+
"metadata": {},
120+
"source": [
121+
"## 8-bit quantization\n",
122+
"\n",
123+
"We will now load the model again with a defined 8-bit quantization configuration."
124+
]
125+
},
126+
{
127+
"cell_type": "code",
128+
"execution_count": 5,
129+
"id": "5073ee43-de9d-49b4-9f58-3419a02ad8c9",
130+
"metadata": {},
131+
"outputs": [],
132+
"source": [
133+
"bnb_config = BitsAndBytesConfig(\n",
134+
" load_in_8bit=True\n",
135+
")"
136+
]
137+
},
138+
{
139+
"cell_type": "code",
140+
"execution_count": 6,
141+
"id": "843c1267-3224-46c4-9d20-65fd375f5896",
142+
"metadata": {},
143+
"outputs": [
144+
{
145+
"data": {
146+
"application/vnd.jupyter.widget-view+json": {
147+
"model_id": "67b0f9a20d2149b99c7ea59304af5675",
148+
"version_major": 2,
149+
"version_minor": 0
150+
},
151+
"text/plain": [
152+
"Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]"
153+
]
154+
},
155+
"metadata": {},
156+
"output_type": "display_data"
157+
}
158+
],
159+
"source": [
160+
"quantized_model = AutoModelForCausalLM.from_pretrained(\n",
161+
" model_name,\n",
162+
" quantization_config=bnb_config,\n",
163+
" device_map=\"cpu\"\n",
164+
")"
165+
]
166+
},
167+
{
168+
"cell_type": "code",
169+
"execution_count": 7,
170+
"id": "1dad8eb5-39ae-4ae8-86f5-51c1e2d9e3c3",
171+
"metadata": {},
172+
"outputs": [
173+
{
174+
"data": {
175+
"text/plain": [
176+
"4.668109893798828"
177+
]
178+
},
179+
"execution_count": 7,
180+
"metadata": {},
181+
"output_type": "execute_result"
182+
}
183+
],
184+
"source": [
185+
"calculate_model_memory_in_gb(quantized_model)"
186+
]
187+
},
188+
{
189+
"cell_type": "markdown",
190+
"id": "196a61ae-2008-474f-8ccc-1b4b04b0da54",
191+
"metadata": {},
192+
"source": [
193+
"## Exercise\n",
194+
"Explore alternative [Quantization configurations](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#transformers.BitsAndBytesConfig) and try to make the model as small as possible."
195+
]
196+
},
197+
{
198+
"cell_type": "code",
199+
"execution_count": null,
200+
"id": "e7176b15-0bfe-4f36-9ec7-7b166eb08157",
201+
"metadata": {},
202+
"outputs": [],
203+
"source": []
204+
}
205+
],
206+
"metadata": {
207+
"kernelspec": {
208+
"display_name": "Python 3 (ipykernel)",
209+
"language": "python",
210+
"name": "python3"
211+
},
212+
"language_info": {
213+
"codemirror_mode": {
214+
"name": "ipython",
215+
"version": 3
216+
},
217+
"file_extension": ".py",
218+
"mimetype": "text/x-python",
219+
"name": "python",
220+
"nbconvert_exporter": "python",
221+
"pygments_lexer": "ipython3",
222+
"version": "3.11.11"
223+
}
224+
},
225+
"nbformat": 4,
226+
"nbformat_minor": 5
227+
}

docs/72_quantization/utilities.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
def get_folder_size_in_gb(directory):
2+
"""
3+
Inspects a folder recursively and returns it size in gigabytes.
4+
"""
5+
import os
6+
from os.path import join, getsize
7+
from os import walk
8+
total_size = 0
9+
for dirpath, dirnames, filenames in walk(directory):
10+
for filename in filenames:
11+
file_path = join(dirpath, filename)
12+
total_size += getsize(file_path)
13+
return total_size / (1024 ** 3)
14+
15+
def calculate_model_memory_in_gb(model):
16+
"""
17+
Inspects a pytorch model and returns its size in memory in gigabytes.
18+
"""
19+
total_size_in_bytes = sum(p.numel() * p.element_size() for p in model.parameters())
20+
size_in_mb = total_size_in_bytes / (1024 ** 3)
21+
return size_in_mb

docs/_toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,7 @@ parts:
135135
- file: 71_fine_tuning_hf/merging_model.ipynb
136136
- file: 71_fine_tuning_hf/test_model.ipynb
137137
- file: 71_fine_tuning_hf/hf_data_upload.ipynb
138+
- file: 72_quantization/quantization.ipynb
138139

139140
- file: 80_benchmarking_llms/readme.md
140141
sections:

0 commit comments

Comments
 (0)