-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Labels
Description
🎯 Goal (What & Why)
Speed up inference performance for attention layers with KV caching
Implemented based on functionality in Hugging Face Transformers. Needs to support data, model, and sequence parallelism.
🚀 Execution Plan
Step 1
support range of positions starting from > 0 in fast-llm inference which is needed for HF generate with cache as they send only the last token to model forward
Step 2: W
implement KV cache for HF generte
📌 Acceptance Criteria (Must-Haves for Completion)
- The feature must be functional and tested.
- The implementation must be documented in practical terms.
- The PR must include a performance/impact summary.
- No refactors unless directly necessary for feature completion.
🛠️ Project Management
- Assign the project to the Fast-LLM project.
- Set the
Estimatefield (in days) in the GitHub project. - Use the
Sizefield to categorize the PR size (Small/Medium/Large). - Assign an owner when opening the issue.
tscholak