Skip to content

Conversation

@0xrushi
Copy link

@0xrushi 0xrushi commented Oct 26, 2025

Purpose

Resolves #27509

  • Introduces new model nvidia/omnivinci
  • Adds support for Hugging Face configs and tokenizers located in subfolders.

Test Plan

python -m vllm.entrypoints.openai.api_server --model nvidia/omnivinci --quantization bitsandbytes --dtype float16 --gpu-memory-utilization 0.8 --max-model-len 8192

or

from vllm import LLM, SamplingParams

def main() -> None:
    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.9,
        max_tokens=256,
    )

    prompts = [
        "Explain the difference between BFS and DFS in one paragraph.",
        "Write a short poem about autumn in two lines.",
    ]

    llm = LLM(
        model="nvidia/omnivinci",
        quantization="bitsandbytes",
        dtype="float16",
        max_model_len=8192,
        gpu_memory_utilization=0.8,
        trust_remote_code=False,
    )

    outputs = llm.generate(prompts, sampling_params)

    for i, out in enumerate(outputs):
        print(f"\n=== Prompt {i} ===")
        print(prompts[i])
        print("\n--- Output ---")
        print(out.outputs[0].text.strip())


if __name__ == "__main__":
    main()

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@0xrushi 0xrushi requested a review from 22quinn as a code owner October 26, 2025 03:03
Signed-off-by: 0xrushi <[email protected]>
@mergify mergify bot added the new-model Requests to new models label Oct 26, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the nvidia/omnivinci model and adds functionality to load Hugging Face configurations, tokenizers, and model weights from a subfolder, which is a valuable enhancement. The changes are well-structured and include corresponding tests. However, I've identified two critical issues. First, the logic for loading subfolder configurations does not correctly handle offline mode, which will cause failures for users in environments without internet access. Second, the subfolder weight loading mechanism appears to be implemented only for the BitsAndBytesModelLoader, meaning the new omnivinci model will fail to load in the default non-quantized use case. Addressing these issues is crucial for the feature to be robust and work as expected for all users.

Comment on lines 131 to 139
sub_cfg = AutoConfig.from_pretrained(
model,
trust_remote_code=trust_remote_code,
revision=revision,
code_revision=code_revision,
token=_get_hf_token(),
subfolder="llm",
**{k: v for k, v in kwargs.items() if k != "local_files_only"},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The local_files_only keyword argument is explicitly removed when calling AutoConfig.from_pretrained for the subfolder configuration. This will cause issues in offline mode, as from_pretrained will attempt to connect to the Hugging Face Hub even if HF_HUB_OFFLINE is set. This could lead to failures for users operating in environments without internet access. The kwargs should be passed directly to respect the offline setting.

Suggested change
sub_cfg = AutoConfig.from_pretrained(
model,
trust_remote_code=trust_remote_code,
revision=revision,
code_revision=code_revision,
token=_get_hf_token(),
subfolder="llm",
**{k: v for k, v in kwargs.items() if k != "local_files_only"},
)
sub_cfg = AutoConfig.from_pretrained(
model,
trust_remote_code=trust_remote_code,
revision=revision,
code_revision=code_revision,
token=_get_hf_token(),
subfolder="llm",
**kwargs,
)

Comment on lines +18 to +23
# direct the default loader to read weights from the llm/ subfolder
self.allow_patterns_overrides = [
"llm/*.safetensors",
"llm/consolidated*.safetensors",
"llm/*.pt",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The mechanism to load weights from a subfolder using allow_patterns_overrides seems to be implemented only in BitsAndBytesModelLoader. This means that the OmniVinciForCausalLM model will likely fail to load its weights if it's not used with bitsandbytes quantization, as the default model loader will not be aware of this attribute or the logic to search in the llm/ subfolder. This makes the model support partial and can fail in the common non-quantized use case. To fix this, the subfolder weight loading logic should be added to the default model loader as well. The comment "direct the default loader" is also currently misleading.

@0xrushi 0xrushi changed the title feat: Add Omnivinci model + subfolder HF config/tokenizer support feat: [DRAFT] Add Omnivinci model + subfolder HF config/tokenizer support Oct 27, 2025
@0xrushi 0xrushi changed the title feat: [DRAFT] Add Omnivinci model + subfolder HF config/tokenizer support feat: [DRAFT Ignore for now] Add Omnivinci model + subfolder HF config/tokenizer support Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-model Requests to new models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support nvidia/omnivinci

1 participant