This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Linting:
pre-commit run --all-filesStyle: PEP8, max line length 120, double quotes, LF endings. C++ source under src/ uses clang-format.
Tests:
pytest tests/test_lmdeploy # all unit tests
pytest tests/test_lmdeploy/test_model.py # specific file
pytest tests/test_lmdeploy/test_lite/ # quantization tests
pytest tests/test_lmdeploy/test_vl/ # vision-language testsDebug logging:
LMDEPLOY_LOG_LEVEL=DEBUG python ...Build (TurboMind C++ extension):
- Controlled via
setup.py+ CMake. Relevant env vars:LMDEPLOY_TARGET_DEVICE(defaultcuda),DISABLE_TURBOMIND,CMAKE_BUILD_TYPE,CUDACXX. - Requirements split by device:
requirements/runtime_cuda.txt,runtime_ascend.txt, etc.
lmdeploy/pipeline.py is the main user-facing entry point (pipeline() in api.py). It instantiates either the PyTorch engine (lmdeploy/pytorch/) or the TurboMind engine (lmdeploy/turbomind/) based on config.
Model patching is the core mechanism: HuggingFace models are loaded normally, then their layers are dynamically replaced with optimized LMDeploy implementations.
lmdeploy/pytorch/models/module_map.py— registry mapping HF class names → LMDeploy replacement classes. Device-specific overrides inDEVICE_SPECIAL_MODULE_MAP.lmdeploy/pytorch/models/patch.py— applies the substitutions at runtime via_get_rewrite_qualname()/_class_from_qualname().lmdeploy/pytorch/models/— 40+ per-model files (e.g.,llama.py,qwen.py,deepseek_v2.py). Each reimplements attention, MLP, and embeddings using custom kernels.lmdeploy/pytorch/nn/— reusable optimized modules:linear/(AWQ, W8A8, blocked-FP8, LoRA variants),attention.py,norm.py,rotary_embedding.py,moe/.lmdeploy/pytorch/kernels/— Triton/CUDA kernels (e.g.,w8a8_triton_kernels.py).lmdeploy/pytorch/backends/— kernel/operator dispatchers per quantization type (FP8, AWQ, CUDA).
Engine execution flow (key files):
engine.py— main PyTorch engine.paging/scheduler.py— sequences → batches; prefill/decode, block eviction, prefix caching (BlockTrie).engine/engine_loop.py— async inference loop.- (See
pytorch/engine/andpytorch/paging/for full execution detail.)
Configuration dataclasses (lmdeploy/pytorch/config.py): ModelConfig, CacheConfig, SchedulerConfig, BackendConfig, DistConfig, MiscConfig.
- Python wrapper:
lmdeploy/turbomind/turbomind.py(~800 lines). Bridges intolmdeploy/lib/_turbomind(pybind11 extension built fromsrc/turbomind/). - Tensor interop via
torch.from_dlpack()/_tm.from_dlpack(). - Config and model conversion:
lmdeploy/turbomind/deploy/config.py,supported_models.py. - Parallel config helpers:
update_parallel_config(),complete_parallel_config()inmessages.py.
Entrypoints in lmdeploy/lite/apis/: calibrate.py (main), auto_awq.py, gptq.py, smooth_quant.py.
Flow: load HF model → CalibrationContext collects activation statistics → scale computation (lmdeploy/lite/quantization/) → write quantized weights.
lite/quantization/awq.py— AWQ (NORM_FCS_MAP, FC_FCS_MAP define per-model layer structure).lite/quantization/weight/quantizer.py— weight quantizer.lite/quantization/activation/observer.py— activation statistics.lite/modeling/— model-specific GPTQ implementations (e.g.,internlm2_gptq.py).lite/utils/cal_qparams.py— quantization parameter calculation utilities.
Layer/norm/head mappings per model family are defined directly in calibrate.py and awq.py.
lmdeploy/vl/model/— VLM preprocessing (InternVL, Qwen-VL, LLaVA, CogVLM, etc.).lmdeploy/vl/media/— image/video loaders and base classes.lmdeploy/pytorch/multimodal/— multimodal input handling for the PyTorch engine.- Reference VLM implementation:
lmdeploy/vl/model/qwen3.py.
lmdeploy/messages.py— core types:GenerationConfig,EngineConfig,TurbomindEngineConfig,SchedulerSequence,MessageStatus.lmdeploy/model.py— chat templates; critical for correct conversation formatting.lmdeploy/archs.py— architecture registry mapping model arch names to runtime patches.lmdeploy/tokenizer.py— HuggingFace/SentencePiece tokenizer wrapper.lmdeploy/serve/openai/— OpenAI-compatible API server.
Use the /support-new-model skill for a complete step-by-step guide.