You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[](https://star-history.com/#johko/computer-vision-course&Date)
48
+
[](https://star-history.com/#huggingface/computer-vision-course&Date)
Copy file name to clipboardExpand all lines: chapters/en/unit0/welcome/welcome.mdx
+2-27Lines changed: 2 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,36 +6,11 @@ Welcome to the **community-driven course on computer vision**. Computer vision i
6
6
7
7
Throughout this course, we'll cover everything from the basics to the latest advancements in computer vision. It's structured to include various foundational topics, making it friendly and accessible for everyone. We're delighted to have you join us for this exciting journey!
8
8
9
-
On this page, you can find how to join the learners community, make a submission and get a certificate, and more details about the course!
10
-
11
-
## Assignment 📄
12
-
13
-
To obtain your certification for completing the course, complete the following assignments:
14
-
15
-
1. Training/fine-tuning a model
16
-
2. Building an application and hosting it on Hugging Face Spaces
17
-
18
-
### Training/fine-tuning a Model
19
-
20
-
There are notebooks under the Notebooks/Vision Transformers section. As of now, we have notebooks for object detection, image segmentation, and image classification. You can either train a model on a dataset that exists on 🤗 Hub or upload a dataset to a dataset repository and train a model on that.
21
-
22
-
The model repository needs to have the following:
23
-
24
-
25
-
1. A properly filled model card, you can check out [here for more information](https://huggingface.co/docs/hub/en/model-cards).
26
-
2. If you trained a model with transformers and pushed it to Hub, the model card will be generated. In that case, edit the card and fill in more details.
27
-
3. Add the dataset’s ID to the model card to link the model repository to the dataset repository.
28
-
29
-
### Creating a Space
30
-
31
-
In this assignment section, you'll be building a Gradio-based application for your computer vision model and sharing it on 🤗 Spaces. Learn more about these tasks using the following resources:
32
-
33
-
-[Getting started with Gradio](https://huggingface.co/learn/nlp-course/chapter9/1?fw=pt#introduction-to-gradio)
34
-
-[How to share your application on 🤗 Spaces](https://huggingface.co/learn/nlp-course/chapter9/4?fw=pt)
9
+
On this page, you can find how to join the learners community and more details about the course!
35
10
36
11
## Certification 🥇
37
12
38
-
Once you've finished the assignments — Training/fine-tuning a Model and Creating a Space — please complete the [form](https://forms.gle/isiVSw59oiiHP6pN9) with your name, email, and links to your model and Space repositories to receive your certificate.
13
+
Sorry, but currently we don't offer certification for this course. If you want to get involved in building a way for people to prove what they have learned in this course and make it a highly automated process, feel free to open a discussion or an issue.
Copy file name to clipboardExpand all lines: chapters/en/unit3/vision-transformers/knowledge-distillation.mdx
+16-7Lines changed: 16 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Knowledge Distillation with Vision Transformers
2
2
3
-
We are going to learn about Knowledge Distillation, the method behind [distilGPT](https://huggingface.co/distilgpt2) and [distilbert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), two of *the most downloaded models on the Hugging Face Hub!*
3
+
We are going to learn about Knowledge Distillation, the method behind [distilGPT](https://huggingface.co/distilgpt2) and [distilbert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), two of _the most downloaded models on the Hugging Face Hub!_
4
4
5
5
Presumably, we've all had teachers who "teach" by simply providing us the correct answers and then testing us on questions we haven't seen before, analogous to supervised learning of machine learning models where we provide a labeled dataset to train on. Instead of having a model train on labels, however, we can pursue [Knowledge Distillation](https://arxiv.org/abs/1503.02531) as an alternative to arriving at a much smaller model that can perform comparably to the larger model and much faster to boot.
6
6
@@ -12,13 +12,13 @@ Imagine you were given this multiple-choice question:
12
12
13
13
If you had someone just tell you, "The answer is Draco Malfoy," that doesn't teach you a whole lot about each of the characters' relative relationships with Harry Potter.
14
14
15
-
On the other hand, if someone tells you, "I am very confident it is not Ron Weasley, I am somewhat confident it is not Neville Longbottom, and I am very confident that it *is* Draco Malfoy," this gives you some information about these characters' relationships to Harry Potter! This is precisely the kind of information that gets passed down to our student model under the Knowledge Distillation paradigm.
15
+
On the other hand, if someone tells you, "I am very confident it is not Ron Weasley, I am somewhat confident it is not Neville Longbottom, and I am very confident that it _is_ Draco Malfoy," this gives you some information about these characters' relationships to Harry Potter! This is precisely the kind of information that gets passed down to our student model under the Knowledge Distillation paradigm.
16
16
17
17
## Distilling the Knowledge in a Neural Network
18
18
19
-
In the paper [*Distilling the Knowledge in a Neural Network*](https://arxiv.org/abs/1503.02531), Hinton et al. introduced the training methodology known as knowledge distillation, taking inspiration from *insects*, of all things. Just as insects transition from larval to adult forms that are optimized for different tasks, large-scale machine learning models can initially be cumbersome, like larvae, for extracting structure from data but can distill their knowledge into smaller, more efficient models for deployment.
19
+
In the paper [_Distilling the Knowledge in a Neural Network_](https://arxiv.org/abs/1503.02531), Hinton et al. introduced the training methodology known as knowledge distillation, taking inspiration from _insects_, of all things. Just as insects transition from larval to adult forms that are optimized for different tasks, large-scale machine learning models can initially be cumbersome, like larvae, for extracting structure from data but can distill their knowledge into smaller, more efficient models for deployment.
20
20
21
-
The essence of Knowledge Distillation is using the predicted logits from a teacher network to pass information to a smaller, more efficient student model. We do this by re-writing the loss function to contain a *distillation loss*, which encourages the student model's distribution over the output space to approximate the teacher's.
21
+
The essence of Knowledge Distillation is using the predicted logits from a teacher network to pass information to a smaller, more efficient student model. We do this by re-writing the loss function to contain a _distillation loss_, which encourages the student model's distribution over the output space to approximate the teacher's.
22
22
23
23
The distillation loss is formulated as:
24
24
@@ -28,8 +28,14 @@ The KL loss refers to the [Kullback-Leibler Divergence](https://en.wikipedia.org
28
28
29
29
To see this loss function implemented in Python and a fully worked out example in Python, let's check out the [notebook for this section](https://github.com/huggingface/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb).
# Leveraging Knowledge Distillation for Edge Devices
@@ -39,19 +45,22 @@ Knowledge distillation has become increasingly crucial as AI models are deployed
39
45
## The Consequences(good & bad) of Knowledge Distillation
40
46
41
47
### 1. Entropy Gain
48
+
42
49
In the context of information theory, entropy is analogous to its counterpart in physics, where it measures the "chaos" or disorder within a system. In our scenario, it quantifies the amount of information a distribution contains. Consider the following example:
43
50
44
51
- Which is harder to remember: `[0, 1, 0, 0]` or `[0.2, 0.5, 0.2, 0.1]`?
45
52
46
53
The first vector, `[0, 1, 0, 0]`, is easier to remember and compress, as it contains less information. This can be represented as "1" in the second position. On the other hand, `[0.2, 0.5, 0.2, 0.1]` contains more information.
47
54
Building on that, let’s say, for example, we trained an 80M parameter network on ImageNet and then distilled it (as discussed earlier) into a 5M parameter student model. We would find that the entropy contained in the output of the teacher model is much lower than that of the student model. This means that the output of the student model, even though correct, is more chaotic than the teacher’s outputs. This comes down to a simple fact: the teacher’s additional parameters help it discern between classes more easily as it extracts more features. This perspective on knowledge distillation is very interesting and is actively being researched to reduce the student’s entropy, either by using it as a loss function or by applying similar metrics inspired by physics (such as energy).
48
55
49
-
50
56
### 2. Coherent Gradient Updates
57
+
51
58
Models learn iteratively by minimizing a loss function and updating their parameters through gradient descent. Consider a set of parameters `P = {w1, w2, w3, ..., wn}`, whose role in the teacher model is to activate when detecting a sample of class A. If an ambiguous sample resembles class A but belongs to class B, the model's gradient update will be aggressive after the misclassification, leading to instability. In contrast, the distillation process, with the teacher model's soft targets, promotes more stable and coherent gradient updates during training, resulting in a smoother learning process for the student model.
52
59
53
60
### 3. Ability to Train on Unlabeled Data
61
+
54
62
The presence of a teacher model allows the student model to train on unlabeled data. The teacher model can generate pseudo-labels for these unlabeled samples, which the student model can then use for training. This approach significantly increases the amount of usable training data.
55
63
56
64
### 4. A Shift in Perspective
65
+
57
66
Deep learning models are typically trained with the assumption that providing enough data will allow them to approximate a function `F` that accurately represents the underlying phenomenon. However, in many cases, data scarcity makes this assumption unrealistic. The traditional approach involves building larger models and fine-tuning them iteratively to achieve optimal results. In contrast, knowledge distillation shifts this perspective: given that we already have a well-trained teacher model `F`, the goal becomes approximating `F` using a smaller model `f`.
But there is a catch! The Convolutional Neural Networks (CNN) are designed with an assumption missing in the VT. This assumption is based on how we perceive the objects in the images as humans. It is described in the following section.
14
+
But there is a catch! The Convolutional Neural Networks (CNN) are designed with an assumption missing in the VT. This assumption is based on how we perceive the objects in the images as humans. It is described in the following section.
15
15
16
-
## What are the differences between CNNs and Vision Transformers?
16
+
## What are the differences between CNNs and Vision Transformers?
17
17
18
18
### Inductive Bias
19
19
@@ -28,7 +28,6 @@ CNN models are very good at these two biases. ViT do not have this assumption. T
28
28
The transformer architecture being (mostly) different types of linear functions allows ViT to become highly scalable. And that in turn allows ViT to overcome the problem of not having the above two
29
29
inductive biases with massive ammount of data!
30
30
31
-
32
31
### But how can everyone get access to massive datasets?
33
32
34
33
It's not feasible for everyone to train a Vision Transformer on millions of images to get good performance. Instead, one can use openly available model weights from places such as the [Hugging Face Hub](https://huggingface.co/models?sort=trending).
@@ -47,7 +46,7 @@ You can go through the transfer learning tutorial using Vision Transformers for
Copy file name to clipboardExpand all lines: chapters/en/unit3/vision-transformers/vision-transformers-for-image-segmentation.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -92,7 +92,7 @@ As you can see below, the results include multiple instances of the same classes
92
92
With many pre-trained segmentation models available, transfer learning and finetuning are commonly used to adapt these models to specific use cases, especially since transformer-based segmentation models, like MaskFormer, are data-hungry and challenging to train from scratch.
93
93
These techniques leverage pre-trained representations to adapt these models to new data efficiently. Typically, for MaskFormer, the backbone, the pixel decoder, and the transformer decoder are kept frozen to leverage their learned general features, while the transformer module is finetuned to adapt its class prediction and mask generation capabilities to new segmentation tasks.
94
94
95
-
[This notebook](https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-segmentation.ipynb) will walk you through a transfer learning tutorial on image segmentation using MaskFormer.
95
+
[This notebook](https://colab.research.google.com/github/huggingface/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-segmentation.ipynb) will walk you through a transfer learning tutorial on image segmentation using MaskFormer.
0 commit comments