Weekly Seminar (Member) #2

cognitive88 · 2025-09-09T06:34:25Z

cognitive88
Sep 9, 2025
Maintainer

EfficientLLM 세미나 기록

cognitive88 · 2025-09-09T07:23:19Z

cognitive88
Sep 9, 2025
Maintainer Author

Date: 2025-09-09
Program / 발표자

OT / 전경호

0 replies

cognitive88 · 2025-09-16T12:49:27Z

cognitive88
Sep 16, 2025
Maintainer Author

Date: 2025-09-16

Program

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models / 전경호

주요 Sparse Attention 연구 흐름 및 최신 연구 정리
연구 흐름: Static Sparse Attention --> Dynamic Sparse Attention --> Next-Gen Architecture
Week2-sparse attention.pdf

질문 정리
Q) H2O 연구에서 Heavy-hitter token들이 attention sink 현상에 의한 초기 토큰들을 포함하는지, 어떤 토큰 분포를 가지는지
Q) Attention sink 현상을 벗어날 수 있는, 회피할 수 있는 방법/연구가 있는지

TIP) 연구 결과를 좀 더 computing efficiency (latency, throughput, ...) 측면에서 정리해보자.

4 replies

spark7275 Sep 16, 2025
Collaborator

너무 좋은 발표 였습니다. .
너무 초보라 세부적인 것은 아직 어려웠지만, 전체적인 맥을 잘 이해할 수 있게 잘 설명 해주신것 같습니다.
(제가 어느정도 제대로 이해한거겠죠? --;)

너무 좋았습니다. 감사합니다!!!

hk-kaden-kim Sep 23, 2025
Collaborator

Sparse Attention Mechanism과 관련된 다양한 논문들에 대해 알 수 있어서 도움이 많이 되었습니다.

서베이 논문이라 살펴봐야 할 내용이 많았을 거 같은데 잘 요약해주신 거 같아요.

감사합니다!

seungahdev Sep 23, 2025

상당한 분량이었는데, 덕분에 뒤에 발표 준비하면서 다양한 시각으로 생각해볼 수 있었습니다!

좋은 발표 감사합니다 👍

seungwoos Sep 24, 2025
Collaborator

sparse attention 관련 논문들은 많이 접해보지는 못했는데, 방대한 내용을 잘 정리해주셔서 연구 흐름을 어느정도 이해할 수 있었던 것 같습니다.

감사합니다 :)

spark7275 · 2025-09-23T13:03:51Z

spark7275
Sep 23, 2025
Collaborator

Program

Data Center Optimization for the AI / 박재욱
AI_Data_Center_and_the_optimization.pdf

5 replies

hk-kaden-kim Sep 23, 2025
Collaborator

AI모델의 성장속도가 너!무! 빠른 거 같아요.

이에 대응하기 위해 전력, 쿨링, 네트워크와 같이 현실적인 문제들도 반드시 함께 고려 되어야 한다는 걸 배울 수 있었습니다.

발표 자료 공유 감사드립니다!

seungahdev Sep 23, 2025

좋은 발표 감사합니다!
새로운 측면에서의 최적화에 대해서 생각해볼 수 있는 좋은 기회였습니다 👍

spark7275 Sep 23, 2025
Collaborator

좋은 피드백 감사합니다.
아직 lab 주제 내용은 들으면서 감은 오는데.. 쉽지 않네요.
다음 발표 내용은 잘 준비 해보겠습니다. :) 모두들 화이팅입니다.

cognitive88 Sep 24, 2025
Maintainer Author

예상했던 것보다 도움되는 내용이 많았고, 발표에 사용된 PPT구성이나 스피치 측면에서도 배울점이 많았습니다.

처음에는 Cooling의 중요성을 크게 인지하지 못했는데 결국 GPU를 사용하는 관점에서 기기 자체의 성능 하락을(쓰로틀링) 피하기 위해 쿨링이 중요하다는 것을 인지하게 되었고, GPU 기반의 모든 서비스를 이용하는 가장 근본적인, 전기 에너지의 중요성을 다시 한 번 깨달았습니다. 감사합니다!

seungwoos Sep 24, 2025
Collaborator

GPU를 사용만 하던 입장에서는 경험할 기회가 많이 없는 내용들을 들을 수 있어서 너무 좋았습니다.
좋은 발표 감사합니다 :)

seungahdev · 2025-09-30T12:24:57Z

seungahdev
Sep 30, 2025

Date : 2025-09-23

vLLM - Efficient Memory Management for Large Language Model Serving with PagedAttention

발표자료 : https://docs.google.com/presentation/d/1qGs-o0XDuaUQZLjMbVIKsoxkSmV_RllhJUnlA7qgH98/edit?usp=sharing

3 replies

hk-kaden-kim Sep 30, 2025
Collaborator

KV Cache 에 대한 필요성 그리고 한계점을 해결하기 위해 vLLM에서 제안한 Memory Allocation 방식까지 이해하는 데 도움이 많이 되었습니다.

예시를 들어가며 설명해주셔서 이해하기가 편했습니다.

감사합니다.

seungwoos Oct 11, 2025
Collaborator

vLLM을 찍먹 수준으로만 사용해보고 내부단 까지는 봐보지 않았는데 vLLM 내부에서 어떻게 kv cache를 관리하는지 이해할 수 있었습니다.
좋은 발표 감사합니다 :)

cognitive88 Oct 12, 2025
Maintainer Author

vLLM 내부 동작을 쉽게 이해할 수 있었고 특히

sliding window attention 을 적용할때, full attention의 kv cache 관리 정책과 다른 정책을 사용한다는 점,
다양한 sparse attention scheme 들을 vllm에서 쉽게 지원할 수 없는 이유에 대해 알 수 있어서 많은 도움이 됐습니다.

감사합니다.

hk-kaden-kim · 2025-09-30T14:08:30Z

hk-kaden-kim
Sep 30, 2025
Collaborator

Date: 2025-09-30

Program

MInference1.0:AcceleratingPre-fillingfor Long-ContextLLMsviaDynamicSparseAttention / 김형균

Dynamic Sparse Attention 알고리즘을 제안한 MInference 논문에 대한 소개
plab-effcientllm-hkim-mminference1.pdf

질문 정리
(질문을 따로 메모해두지 않아 누락된 게 있을 수도 있습니다. 그런 게 있다면 댓글로 남겨주시면 감사하겠습니다.)

Q) Evaluation 에서 End-to-End Latency 측정은 정확히 어떤 단계까지 포함하는 건가요? Prefill Only? Prefill and Decoder?
A) 논문 내용에 따르면 Prefill로 예측됩니다. 정확한 내용은 구현된 코드블럭을 살펴볼 필요성이 있을 거 같네요.

Q) Evaluation 에서 각 Head 별 Latency 측정 그래프는 어떤 방식으로 한 건가요?
A) 해당 그래프는 Single Attention Kernel 환경에서 각 Spars Attention Pattern들을 적용해가며 도출되었습니다.

Q) Online 과정에서 Sparsity Indices Approximation은 Full-attention을 사용하는 건가요?
A) 논문의 Algorithm 2, 3를 보면, Full-attention을 사용하지는 않는 것으로 확인됩니다. 총 두 번의 QK^T 연산이 진행되는데 모두 O(N^2) 보다 낮은 연산량으로 예측됩니다.

Q) Decoder 부분에도 Dynamic Sparse Attention을 적용한 건가요?
A) 논문에 따르면 Decoder의 경우 별도의 Sparse Attention을 적용하지 않고 Greedy 한 방식으로 적용한 것 같습니다. 결과값에 영향을 주지 않기 위함으로 보입니다.

Q) Batch size 설정도 해당 알고리즘의 퍼포먼스에 영향을 미칠까요?
A) 만약 Indices가 Batch 내에서 공유된다면 퍼포먼스와 Batch Size의 연관성을 볼 수 있을 거 같습니다. 하지만, 논문에서 제시한 컨셉으로 미루어보아 각 Head의 Pattern에 대한 Index parameters 는 개별 샘플에 따라 구해지는 것 같습니다. 즉, Batch Size의 영향을 받지 않을 거 같습니다.

Q) MMInference는 VLM에 대한 건가요? MInference와 어떤 차이가 있는 건가요?
A) 맞습니다. MMInference는 LLM이 아닌 VLM에 대한 Dynamic Sparse Attention 입니다. 따라서 Grid Pattern 이라던가 Permutation과 같은 방식들이 추가로 적용되어 있습니다. 자세한 건 다음 세미나 때 MMInference를 하게 된다면 다루어 보도록 하겠습니다.

2 replies

seungwoos Oct 11, 2025
Collaborator

prefill 과정에서 attention map이 특정 sparse pattern을 가진다는 것이 흥미로웠습니다.
MMInference도 기대가 됩니다 (?)
감사합니다 :)

cognitive88 Oct 12, 2025
Maintainer Author

Minference 를 포함하여 다양한 최적화 연구에서 결국 attention map에 특정 패턴이 있다는걸로 이야기되고 있는데, Minference 처럼 휴리스틱하게 제한된 셋의 패턴을 이용하는게 나을지, Seer attention 처럼 그 패턴또한 학습으로 찾는 방법이 나을지 앞으로의 연구가 궁금하네요.
또한, Minference는 SGLang, vLLM에 통합이 되었는데, 프레임워크 내에서 어떻게 작동하는지 찾아보는것도 흥미로울 것 같네요! (시간이 난다면 한 번 찾아보겠습니다아)
발표 감사합니다.

seungwoos · 2025-10-11T13:00:18Z

seungwoos
Oct 11, 2025
Collaborator

Date: 2025-09-30

논문
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

발표 자료
smoothquant.pdf

1 reply

cognitive88 Oct 12, 2025
Maintainer Author

최근 양자화를 공부하면서 다양한 양자화 기법들을 모두 알아가는게 버거웠는데, 유명한 기법인 SmoothQuant 발표 잘 들었고 많은 도움이 됐습니다. 다음번에 AWQ를 하실지 다른걸 하실지 모르겠지만 다음 발표도 기대가됩니다!
감사합니다.

cognitive88 · 2025-10-16T14:46:42Z

cognitive88
Oct 16, 2025
Maintainer Author

Date: 2025-10-14

Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization

Attention에 양자화를 적용하기위해 아웃라이어 및 정확도 문제를 해결하기 위한 연구

발표자료

pseudo_SageAttention.pdf

0 replies

Weekly Seminar (Member) #2

Uh oh!

Uh oh!

cognitive88 Sep 9, 2025 Maintainer

Replies: 7 comments · 15 replies

Uh oh!

cognitive88 Sep 9, 2025 Maintainer Author

Uh oh!

Uh oh!

cognitive88 Sep 16, 2025 Maintainer Author

Uh oh!

spark7275 Sep 16, 2025 Collaborator

Uh oh!

hk-kaden-kim Sep 23, 2025 Collaborator

Uh oh!

seungahdev Sep 23, 2025

Uh oh!

seungwoos Sep 24, 2025 Collaborator

Uh oh!

spark7275 Sep 23, 2025 Collaborator

Uh oh!

Uh oh!

hk-kaden-kim Sep 23, 2025 Collaborator

Uh oh!

seungahdev Sep 23, 2025

Uh oh!

spark7275 Sep 23, 2025 Collaborator

Uh oh!

cognitive88 Sep 24, 2025 Maintainer Author

Uh oh!

seungwoos Sep 24, 2025 Collaborator

Uh oh!

seungahdev Sep 30, 2025

Uh oh!

Uh oh!

hk-kaden-kim Sep 30, 2025 Collaborator

Uh oh!

seungwoos Oct 11, 2025 Collaborator

Uh oh!

cognitive88 Oct 12, 2025 Maintainer Author

Uh oh!

Uh oh!

hk-kaden-kim Sep 30, 2025 Collaborator

Uh oh!

seungwoos Oct 11, 2025 Collaborator

Uh oh!

Uh oh!

cognitive88 Oct 12, 2025 Maintainer Author

Uh oh!

seungwoos Oct 11, 2025 Collaborator

Uh oh!

cognitive88 Oct 12, 2025 Maintainer Author

Uh oh!

cognitive88 Oct 16, 2025 Maintainer Author

cognitive88
Sep 9, 2025
Maintainer

Replies: 7 comments 15 replies

cognitive88
Sep 9, 2025
Maintainer Author

cognitive88
Sep 16, 2025
Maintainer Author

spark7275 Sep 16, 2025
Collaborator

hk-kaden-kim Sep 23, 2025
Collaborator

seungwoos Sep 24, 2025
Collaborator

spark7275
Sep 23, 2025
Collaborator

hk-kaden-kim Sep 23, 2025
Collaborator

spark7275 Sep 23, 2025
Collaborator

cognitive88 Sep 24, 2025
Maintainer Author

seungwoos Sep 24, 2025
Collaborator

seungahdev
Sep 30, 2025

hk-kaden-kim Sep 30, 2025
Collaborator

seungwoos Oct 11, 2025
Collaborator

cognitive88 Oct 12, 2025
Maintainer Author

hk-kaden-kim
Sep 30, 2025
Collaborator

seungwoos Oct 11, 2025
Collaborator

cognitive88 Oct 12, 2025
Maintainer Author

seungwoos
Oct 11, 2025
Collaborator

cognitive88 Oct 12, 2025
Maintainer Author

cognitive88
Oct 16, 2025
Maintainer Author