Skip to content

Commit 0e01b07

Browse files
authored
Merge pull request #408 from johko/general-updates
Some General Updates
2 parents b66b88e + ed047cc commit 0e01b07

File tree

5 files changed

+26
-43
lines changed

5 files changed

+26
-43
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ Join [the Hugging Face discord](https://discord.gg/hugging-face-8795489624644936
4040
### Contributors
4141

4242
<a href="https://github.com/huggingface/computer-vision-course/graphs/contributors">
43-
<img src="https://contrib.rocks/image?repo=johko/computer-vision-course" />
43+
<img src="https://contrib.rocks/image?repo=huggingface/computer-vision-course" />
4444
</a>
4545

4646
### Star History
4747

48-
[![Star History Chart](https://api.star-history.com/svg?repos=johko/computer-vision-course&type=Date)](https://star-history.com/#johko/computer-vision-course&Date)
48+
[![Star History Chart](https://api.star-history.com/svg?repos=huggingface/computer-vision-course&type=Date)](https://star-history.com/#huggingface/computer-vision-course&Date)

chapters/en/unit0/welcome/welcome.mdx

Lines changed: 2 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -6,36 +6,11 @@ Welcome to the **community-driven course on computer vision**. Computer vision i
66

77
Throughout this course, we'll cover everything from the basics to the latest advancements in computer vision. It's structured to include various foundational topics, making it friendly and accessible for everyone. We're delighted to have you join us for this exciting journey!
88

9-
On this page, you can find how to join the learners community, make a submission and get a certificate, and more details about the course!
10-
11-
## Assignment 📄
12-
13-
To obtain your certification for completing the course, complete the following assignments:
14-
15-
1. Training/fine-tuning a model
16-
2. Building an application and hosting it on Hugging Face Spaces
17-
18-
### Training/fine-tuning a Model
19-
20-
There are notebooks under the Notebooks/Vision Transformers section. As of now, we have notebooks for object detection, image segmentation, and image classification. You can either train a model on a dataset that exists on 🤗 Hub or upload a dataset to a dataset repository and train a model on that.
21-
22-
The model repository needs to have the following:
23-
24-
25-
1. A properly filled model card, you can check out [here for more information](https://huggingface.co/docs/hub/en/model-cards).
26-
2. If you trained a model with transformers and pushed it to Hub, the model card will be generated. In that case, edit the card and fill in more details.
27-
3. Add the dataset’s ID to the model card to link the model repository to the dataset repository.
28-
29-
### Creating a Space
30-
31-
In this assignment section, you'll be building a Gradio-based application for your computer vision model and sharing it on 🤗 Spaces. Learn more about these tasks using the following resources:
32-
33-
- [Getting started with Gradio](https://huggingface.co/learn/nlp-course/chapter9/1?fw=pt#introduction-to-gradio)
34-
- [How to share your application on 🤗 Spaces](https://huggingface.co/learn/nlp-course/chapter9/4?fw=pt)
9+
On this page, you can find how to join the learners community and more details about the course!
3510

3611
## Certification 🥇
3712

38-
Once you've finished the assignments — Training/fine-tuning a Model and Creating a Space — please complete the [form](https://forms.gle/isiVSw59oiiHP6pN9) with your name, email, and links to your model and Space repositories to receive your certificate.
13+
Sorry, but currently we don't offer certification for this course. If you want to get involved in building a way for people to prove what they have learned in this course and make it a highly automated process, feel free to open a discussion or an issue.
3914

4015
## Join the community!
4116

chapters/en/unit3/vision-transformers/knowledge-distillation.mdx

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Knowledge Distillation with Vision Transformers
22

3-
We are going to learn about Knowledge Distillation, the method behind [distilGPT](https://huggingface.co/distilgpt2) and [distilbert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), two of *the most downloaded models on the Hugging Face Hub!*
3+
We are going to learn about Knowledge Distillation, the method behind [distilGPT](https://huggingface.co/distilgpt2) and [distilbert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), two of _the most downloaded models on the Hugging Face Hub!_
44

55
Presumably, we've all had teachers who "teach" by simply providing us the correct answers and then testing us on questions we haven't seen before, analogous to supervised learning of machine learning models where we provide a labeled dataset to train on. Instead of having a model train on labels, however, we can pursue [Knowledge Distillation](https://arxiv.org/abs/1503.02531) as an alternative to arriving at a much smaller model that can perform comparably to the larger model and much faster to boot.
66

@@ -12,13 +12,13 @@ Imagine you were given this multiple-choice question:
1212

1313
If you had someone just tell you, "The answer is Draco Malfoy," that doesn't teach you a whole lot about each of the characters' relative relationships with Harry Potter.
1414

15-
On the other hand, if someone tells you, "I am very confident it is not Ron Weasley, I am somewhat confident it is not Neville Longbottom, and I am very confident that it *is* Draco Malfoy," this gives you some information about these characters' relationships to Harry Potter! This is precisely the kind of information that gets passed down to our student model under the Knowledge Distillation paradigm.
15+
On the other hand, if someone tells you, "I am very confident it is not Ron Weasley, I am somewhat confident it is not Neville Longbottom, and I am very confident that it _is_ Draco Malfoy," this gives you some information about these characters' relationships to Harry Potter! This is precisely the kind of information that gets passed down to our student model under the Knowledge Distillation paradigm.
1616

1717
## Distilling the Knowledge in a Neural Network
1818

19-
In the paper [*Distilling the Knowledge in a Neural Network*](https://arxiv.org/abs/1503.02531), Hinton et al. introduced the training methodology known as knowledge distillation, taking inspiration from *insects*, of all things. Just as insects transition from larval to adult forms that are optimized for different tasks, large-scale machine learning models can initially be cumbersome, like larvae, for extracting structure from data but can distill their knowledge into smaller, more efficient models for deployment.
19+
In the paper [_Distilling the Knowledge in a Neural Network_](https://arxiv.org/abs/1503.02531), Hinton et al. introduced the training methodology known as knowledge distillation, taking inspiration from _insects_, of all things. Just as insects transition from larval to adult forms that are optimized for different tasks, large-scale machine learning models can initially be cumbersome, like larvae, for extracting structure from data but can distill their knowledge into smaller, more efficient models for deployment.
2020

21-
The essence of Knowledge Distillation is using the predicted logits from a teacher network to pass information to a smaller, more efficient student model. We do this by re-writing the loss function to contain a *distillation loss*, which encourages the student model's distribution over the output space to approximate the teacher's.
21+
The essence of Knowledge Distillation is using the predicted logits from a teacher network to pass information to a smaller, more efficient student model. We do this by re-writing the loss function to contain a _distillation loss_, which encourages the student model's distribution over the output space to approximate the teacher's.
2222

2323
The distillation loss is formulated as:
2424

@@ -28,8 +28,14 @@ The KL loss refers to the [Kullback-Leibler Divergence](https://en.wikipedia.org
2828

2929
To see this loss function implemented in Python and a fully worked out example in Python, let's check out the [notebook for this section](https://github.com/huggingface/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb).
3030

31-
<a target="_blank" href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb">
32-
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
31+
<a
32+
target="_blank"
33+
href="https://colab.research.google.com/github/huggingface/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb"
34+
>
35+
<img
36+
src="https://colab.research.google.com/assets/colab-badge.svg"
37+
alt="Open In Colab"
38+
/>
3339
</a>
3440

3541
# Leveraging Knowledge Distillation for Edge Devices
@@ -39,19 +45,22 @@ Knowledge distillation has become increasingly crucial as AI models are deployed
3945
## The Consequences(good & bad) of Knowledge Distillation
4046

4147
### 1. Entropy Gain
48+
4249
In the context of information theory, entropy is analogous to its counterpart in physics, where it measures the "chaos" or disorder within a system. In our scenario, it quantifies the amount of information a distribution contains. Consider the following example:
4350

4451
- Which is harder to remember: `[0, 1, 0, 0]` or `[0.2, 0.5, 0.2, 0.1]`?
4552

4653
The first vector, `[0, 1, 0, 0]`, is easier to remember and compress, as it contains less information. This can be represented as "1" in the second position. On the other hand, `[0.2, 0.5, 0.2, 0.1]` contains more information.
4754
Building on that, let’s say, for example, we trained an 80M parameter network on ImageNet and then distilled it (as discussed earlier) into a 5M parameter student model. We would find that the entropy contained in the output of the teacher model is much lower than that of the student model. This means that the output of the student model, even though correct, is more chaotic than the teacher’s outputs. This comes down to a simple fact: the teacher’s additional parameters help it discern between classes more easily as it extracts more features. This perspective on knowledge distillation is very interesting and is actively being researched to reduce the student’s entropy, either by using it as a loss function or by applying similar metrics inspired by physics (such as energy).
4855

49-
5056
### 2. Coherent Gradient Updates
57+
5158
Models learn iteratively by minimizing a loss function and updating their parameters through gradient descent. Consider a set of parameters `P = {w1, w2, w3, ..., wn}`, whose role in the teacher model is to activate when detecting a sample of class A. If an ambiguous sample resembles class A but belongs to class B, the model's gradient update will be aggressive after the misclassification, leading to instability. In contrast, the distillation process, with the teacher model's soft targets, promotes more stable and coherent gradient updates during training, resulting in a smoother learning process for the student model.
5259

5360
### 3. Ability to Train on Unlabeled Data
61+
5462
The presence of a teacher model allows the student model to train on unlabeled data. The teacher model can generate pseudo-labels for these unlabeled samples, which the student model can then use for training. This approach significantly increases the amount of usable training data.
5563

5664
### 4. A Shift in Perspective
65+
5766
Deep learning models are typically trained with the assumption that providing enough data will allow them to approximate a function `F` that accurately represents the underlying phenomenon. However, in many cases, data scarcity makes this assumption unrealistic. The traditional approach involves building larger models and fine-tuning them iteratively to achieve optimal results. In contrast, knowledge distillation shifts this perspective: given that we already have a well-trained teacher model `F`, the goal becomes approximating `F` using a smaller model `f`.

chapters/en/unit3/vision-transformers/vision-transformers-for-image-classification.mdx

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,12 @@ As the Transformers architecture scaled well in Natural Language Processing, the
88

99
To summarize, in Vision transformer, images are reorganized as 2D grids of patches. The models are trained on those patches.
1010

11-
The main idea can be found at the picture below:
11+
The main idea can be found at the picture below:
1212
![Vision Transformer](https://huggingface.co/datasets/hf-vision/course-assets/blob/main/Screenshot%20from%202024-12-27%2014-25-49.png)
1313

14-
But there is a catch! The Convolutional Neural Networks (CNN) are designed with an assumption missing in the VT. This assumption is based on how we perceive the objects in the images as humans. It is described in the following section.
14+
But there is a catch! The Convolutional Neural Networks (CNN) are designed with an assumption missing in the VT. This assumption is based on how we perceive the objects in the images as humans. It is described in the following section.
1515

16-
## What are the differences between CNNs and Vision Transformers?
16+
## What are the differences between CNNs and Vision Transformers?
1717

1818
### Inductive Bias
1919

@@ -28,7 +28,6 @@ CNN models are very good at these two biases. ViT do not have this assumption. T
2828
The transformer architecture being (mostly) different types of linear functions allows ViT to become highly scalable. And that in turn allows ViT to overcome the problem of not having the above two
2929
inductive biases with massive ammount of data!
3030

31-
3231
### But how can everyone get access to massive datasets?
3332

3433
It's not feasible for everyone to train a Vision Transformer on millions of images to get good performance. Instead, one can use openly available model weights from places such as the [Hugging Face Hub](https://huggingface.co/models?sort=trending).
@@ -47,7 +46,7 @@ You can go through the transfer learning tutorial using Vision Transformers for
4746

4847
<a
4948
target="_blank"
50-
href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-image-classification.ipynb"
49+
href="https://colab.research.google.com/github/huggingface/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-image-classification.ipynb"
5150
>
5251
<img
5352
src="https://colab.research.google.com/assets/colab-badge.svg"
@@ -81,7 +80,7 @@ This notebook will walk you through a fine-tuning tutorial using Vision Transfor
8180

8281
<a
8382
target="_blank"
84-
href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/fine-tuning-multilabel-image-classification.ipynb"
83+
href="https://colab.research.google.com/github/huggingface/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/fine-tuning-multilabel-image-classification.ipynb"
8584
>
8685
<img
8786
src="https://colab.research.google.com/assets/colab-badge.svg"

chapters/en/unit3/vision-transformers/vision-transformers-for-image-segmentation.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ As you can see below, the results include multiple instances of the same classes
9292
With many pre-trained segmentation models available, transfer learning and finetuning are commonly used to adapt these models to specific use cases, especially since transformer-based segmentation models, like MaskFormer, are data-hungry and challenging to train from scratch.
9393
These techniques leverage pre-trained representations to adapt these models to new data efficiently. Typically, for MaskFormer, the backbone, the pixel decoder, and the transformer decoder are kept frozen to leverage their learned general features, while the transformer module is finetuned to adapt its class prediction and mask generation capabilities to new segmentation tasks.
9494

95-
[This notebook](https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-segmentation.ipynb) will walk you through a transfer learning tutorial on image segmentation using MaskFormer.
95+
[This notebook](https://colab.research.google.com/github/huggingface/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-segmentation.ipynb) will walk you through a transfer learning tutorial on image segmentation using MaskFormer.
9696

9797
## References
9898

0 commit comments

Comments
 (0)