Skip to content

Unnecessary Download of Model Weights using Rclone during Performance Test (Existing Weights Already Available) #640

@vinaykagithapu

Description

@vinaykagithapu

When running MLPerf inference benchmark with pre-downloaded model weights, the mlcr run-mlperf script still attempts to download model weights using rclone, instead of using the provided local path. This causes unnecessary failures if remote storage is unavailable.


Steps to Reproduce

  1. Set environment variables:
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
export HF_TOKEN="huggiface-api-acces-token"
export CHECKPOINT_PATH="/work/models/llama31_8b"
export DATASET_PATH="/work/dataset/cnndm"
  1. Download model weights & dataset (pre-downloaded successfully):
mlcr get,ml-model,llama3,_meta-llama/Llama-3.1-8B-Instruct,_hf \
  --outdirname=${CHECKPOINT_PATH} --hf_token=${HF_TOKEN} -j --skip_system_deps

mlcr get,dataset,cnndm,_validation,_datacenter,_llama3,_mlc,_r2-downloader \
  --outdirname=$DATASET_PATH -j --skip_system_deps
  1. Run benchmark with local paths:
mlcr run-mlperf,inference,_r5.1-dev,_performance-only \
    --model=llama3_1-8b \
    --use_model_from_host=${CHECKPOINT_PATH}/repo \
    --use_dataset_from_host=${DATASET_PATH}/llama3-1-8b-cnn-eval.uri/cnn_eval.json \
    --implementation=reference \
    --framework=vllm \
    --vllm_tp_size=8 \
    --category=datacenter \
    --scenario=Offline \
    --execution_mode=valid \
    --device=cuda \
    --threads=8 \
    --quiet \
    --precision=bfloat16 \
    --test_query_count=100 \
    --batch_size=16 \
    --results_dir=/work/result18 \
    --hf_token $HF_TOKEN \
    --skip_system_deps=True

Expected Behavior

If --use_model_from_host is specified and weights already exist locally, the benchmark should skip rclone download and directly use the local path.


Actual Behavior

The script always triggers rclone sync:

Downloading: rclone sync 'mlc-llama3-1:inference/Llama-3.1-8b-Instruct' ...
2025/09/18 09:29:47 ERROR : Google drive root 'inference/Llama-3.1-8b-Instruct': directory not found
...
mlc.script_action.ScriptExecutionError: Script run execution failed. Error : MLC script failed (name = download-file, return code = 768)

Even though model weights are already present in ${CHECKPOINT_PATH}/repo.


Environment

  • mlcflow version: 1.1.1
  • Python version: 3.12
  • Framework: vLLM
  • GPU: 8 H100
  • OS/Distro: Ubuntu 24.04.1 LTS

Suggested Fix

  • Add a skip-download mechanism when --use_model_from_host is passed.
  • Check local path existence before attempting rclone sync.
  • Alternatively, provide a dedicated flag like --disable_remote_download.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions