-
Notifications
You must be signed in to change notification settings - Fork 1
reduce maxTokens for glm-4-9b-chat to fit 50GB GPU #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
reduce maxTokens for glm-4-9b-chat to fit 50GB GPU #47
Conversation
Summary of ChangesHello @nicole-lihui, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical memory issue encountered when deploying the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to resolve an out-of-memory issue on 50GB GPUs for the glm-4-9b-chat model by limiting the token processing capacity. It introduces customRuntimeArgs to set --max-model-len and --max-num-batched-tokens for the vLLM runtime. While this is a good step, I've pointed out a significant inconsistency where spec.config.maxTokens is not updated to match the new runtime limit, which could lead to runtime errors and a poor user experience. I've recommended aligning these values for consistency.
| - customRuntimeArgs: | ||
| - --max-num-batched-tokens=32768 # default | ||
| - --max-model-len=32768 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While adding --max-model-len=32768 correctly configures the vLLM runtime to prevent out-of-memory errors, there's an inconsistency with spec.config.maxTokens which remains at 128000 on line 7. This can lead to a confusing user experience or runtime errors if a user requests a number of tokens to generate that is valid according to maxTokens but exceeds the max-model-len limit when combined with the prompt length.
To ensure consistency and prevent unexpected failures, spec.config.maxTokens should be aligned with max-model-len. I recommend reducing spec.config.maxTokens to 32768. Since this line is not part of the current changes, please consider amending this pull request to include this change.
Reason: The default 128k maxTokens causes OOM on 50GB GPUs for long-context inference.
fd96c01 to
7d1edc3
Compare
| deployments: | ||
| - customRuntimeArgs: [] | ||
| - customRuntimeArgs: | ||
| - --max-num-batched-tokens=32768 # Reduce maxTokens from 128k to 32k to fit 50GB GPU and avoid OOM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一般不要设置到最大长度,单机的情况下可以不修改。
part of #44
The default 128k maxTokens causes OOM on 50GB GPUs for long-context inference.
https://huggingface.co/zai-org/glm-4-9b-chat/blob/main/generation_config.json