Merge pull request #403 from Kallinteris-Andreas/patch-4

sergiopaniego · web-flow · commit a2eec655e40b · 2025-07-09T10:53:41.000+02:00
fix "Deformable DETR" -&gt; "Conditional DETR" typo
diff --git a/chapters/en/unit3/vision-transformers/detr.mdx b/chapters/en/unit3/vision-transformers/detr.mdx
@@ -37,9 +37,9 @@ The second problem is resolved similarly to YOLOv3, in which multi-scale feature
 
 ### Conditional DETR
 Conditional DETR also sets out to resolve the problem of slow training convergence in the original DETR, resulting in convergence that is over 6.7 times faster. The authors found that the object queries are general and are not specific to the input image. Using **Conditional Cross-Attention** in the decoder, the queries can better localize the areas for bounding box regression.
-![A decoder layer for Deformable DETR](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/DETR_DecoderLayer.png)
-_Left: DETR Decoder Layer. Right: Deformable DETR Decoder Layer_   
-The original DETR and Deformable DETR decoder layers are compared in the figure above, with the main difference being the query input of the cross-attention block. The authors make a distinction between content query c<sub>q</sub> (decoder self attention output) and spatial query p<sub>q</sub>. The original DETR simply adds them together. In Deformable DETR, they are concatenated, with c<sub>q</sub> focusing on the content of the object and p<sub>q</sub> focusing on the bounding box regions.   
+![A decoder layer for Conditional DETR](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/DETR_DecoderLayer.png)
+_Left: DETR Decoder Layer. Right: Conditional DETR Decoder Layer_   
+The original DETR and Conditional DETR decoder layers are compared in the figure above, with the main difference being the query input of the cross-attention block. The authors make a distinction between content query c<sub>q</sub> (decoder self attention output) and spatial query p<sub>q</sub>. The original DETR simply adds them together. In Conditional DETR, they are concatenated, with c<sub>q</sub> focusing on the content of the object and p<sub>q</sub> focusing on the bounding box regions.   
 The spatial query p<sub>q</sub> is the result of both the decoder embeddings and object queries projecting to the same space (to become T and p<sub>s</sub> respectively) and multiplied together. The previous layers' decoder embeddings contain information for the bounding box regions, and the object queries contains information of learned reference points for each bounding box. Thus, their projections combine into a representation that allows for cross-attention to measure their similarities with the encoder input and sinusoidal positional embedding. This is more effective than DETR which only uses object queries and fixed reference points.
 
 ## DETR Inference