Skip to content

Speed up inference performance for attention and ssm layers with caching #279

@bigximik

Description

@bigximik

🎯 Goal (What & Why)

Speed up inference performance for attention layers with KV caching

Implemented based on functionality in Hugging Face Transformers. Needs to support data, model, and sequence parallelism.

🚀 Execution Plan

Step 1

support range of positions starting from > 0 in fast-llm inference which is needed for HF generate with cache as they send only the last token to model forward

Step 2: W

implement KV cache for HF generte

📌 Acceptance Criteria (Must-Haves for Completion)

  • The feature must be functional and tested.
  • The implementation must be documented in practical terms.
  • The PR must include a performance/impact summary.
  • No refactors unless directly necessary for feature completion.

🛠️ Project Management

  • Assign the project to the Fast-LLM project.
  • Set the Estimate field (in days) in the GitHub project.
  • Use the Size field to categorize the PR size (Small/Medium/Large).
  • Assign an owner when opening the issue.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions