Skip to content

Commit a2eec65

Browse files
Merge pull request #403 from Kallinteris-Andreas/patch-4
fix "Deformable DETR" -> "Conditional DETR" typo
2 parents 778b3ff + 52bf1c7 commit a2eec65

File tree

1 file changed

+3
-3
lines changed
  • chapters/en/unit3/vision-transformers

1 file changed

+3
-3
lines changed

chapters/en/unit3/vision-transformers/detr.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,9 @@ The second problem is resolved similarly to YOLOv3, in which multi-scale feature
3737

3838
### Conditional DETR
3939
Conditional DETR also sets out to resolve the problem of slow training convergence in the original DETR, resulting in convergence that is over 6.7 times faster. The authors found that the object queries are general and are not specific to the input image. Using **Conditional Cross-Attention** in the decoder, the queries can better localize the areas for bounding box regression.
40-
![A decoder layer for Deformable DETR](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/DETR_DecoderLayer.png)
41-
_Left: DETR Decoder Layer. Right: Deformable DETR Decoder Layer_
42-
The original DETR and Deformable DETR decoder layers are compared in the figure above, with the main difference being the query input of the cross-attention block. The authors make a distinction between content query c<sub>q</sub> (decoder self attention output) and spatial query p<sub>q</sub>. The original DETR simply adds them together. In Deformable DETR, they are concatenated, with c<sub>q</sub> focusing on the content of the object and p<sub>q</sub> focusing on the bounding box regions.
40+
![A decoder layer for Conditional DETR](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/DETR_DecoderLayer.png)
41+
_Left: DETR Decoder Layer. Right: Conditional DETR Decoder Layer_
42+
The original DETR and Conditional DETR decoder layers are compared in the figure above, with the main difference being the query input of the cross-attention block. The authors make a distinction between content query c<sub>q</sub> (decoder self attention output) and spatial query p<sub>q</sub>. The original DETR simply adds them together. In Conditional DETR, they are concatenated, with c<sub>q</sub> focusing on the content of the object and p<sub>q</sub> focusing on the bounding box regions.
4343
The spatial query p<sub>q</sub> is the result of both the decoder embeddings and object queries projecting to the same space (to become T and p<sub>s</sub> respectively) and multiplied together. The previous layers' decoder embeddings contain information for the bounding box regions, and the object queries contains information of learned reference points for each bounding box. Thus, their projections combine into a representation that allows for cross-attention to measure their similarities with the encoder input and sinusoidal positional embedding. This is more effective than DETR which only uses object queries and fixed reference points.
4444

4545
## DETR Inference

0 commit comments

Comments
 (0)