compare quantization on cpu and gpu

haesleinhuepf · haesleinhuepf · commit 652e06592dee · 2025-01-23T10:19:17.000+01:00
diff --git a/docs/72_quantization/quantization.ipynb b/docs/72_quantization/quantization.ipynb
@@ -24,7 +24,8 @@
    "source": [
     "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
     "from utilities import calculate_model_memory_in_gb\n",
-    "import torch"
+    "import torch\n",
+    "import numpy as np"
    ]
   },
   {
@@ -65,7 +66,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "79e5f5a054c9469f9f88b23e7c7fb962",
+       "model_id": "937bfa80eb814bc6b7848d3646777ef7",
        "version_major": 2,
        "version_minor": 0
       },
@@ -144,7 +145,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "67b0f9a20d2149b99c7ea59304af5675",
+       "model_id": "aba28d966b9f405488417df37091bb67",
        "version_major": 2,
        "version_minor": 0
       },
@@ -185,13 +186,142 @@
     "calculate_model_memory_in_gb(quantized_model)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "a091cca6-69e6-47a0-9e97-453e989705d3",
+   "metadata": {},
+   "source": [
+    "Apparently, quantization is implemented differently for CPU and GPU devices. If we load the model into GPU-memory, its size is different."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "be0c1a05-ebd8-4e5b-b294-b158f8b630ee",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "57d988878f9b40309daade21a83020b9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "quantized_gpu_model = AutoModelForCausalLM.from_pretrained(\n",
+    "    model_name,\n",
+    "    quantization_config=bnb_config,\n",
+    "    device_map=\"cuda:0\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "d4eb965a-6bcf-4f6e-9257-13db1c43a32c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "2.822406768798828"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "calculate_model_memory_in_gb(quantized_gpu_model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0510f950-3803-41e8-bb86-71c46146a619",
+   "metadata": {},
+   "source": [
+    "We can elaborate a bit more on this by inspecting the existing [element sizes  given in bytes](https://pytorch.org/docs/stable/generated/torch.Tensor.element_size.html) of parameters in the models."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "6b36498b-b719-4896-a287-d34cec95022c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([4])"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.unique([p.element_size() for p in model.parameters()])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "13199fe3-c955-4787-9f56-53248e6793b2",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([2])"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.unique([p.element_size() for p in quantized_model.parameters()])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "41993337-3dd9-46ce-98b5-920ee75d0a51",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([1, 2])"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.unique([p.element_size() for p in quantized_gpu_model.parameters()])"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "196a61ae-2008-474f-8ccc-1b4b04b0da54",
    "metadata": {},
    "source": [
     "## Exercise\n",
-    "Explore alternative [Quantization configurations](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#transformers.BitsAndBytesConfig) and try to make the model as small as possible."
+    "Explore alternative [Quantization configurations](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#transformers.BitsAndBytesConfig) and try to make the model as small as possible. Hint: Compare different approaches using `device_map=\"cpu\"` and `device_map=\"cuda:0\"` using a GPU."
    ]
   },
   {