Since QwQ support max 32768 tokens, however, the default max_tokens is 81920 in scripts/run_web_thinker.py, I am wondering what length is used for evaluating QwQ model for difference benchmarks?
python scripts/run_web_thinker_report.py \
--dataset_name glaive \
--split test \
--concurrent_limit 32 \
--search_engine "serper" \
--serper_api_key "YOUR_GOOGLE_SERPER_API" \
--api_base_url "YOUR_API_BASE_URL" \
--model_name "QwQ-32B" \
--aux_api_base_url "YOUR_AUX_API_BASE_URL" \
--aux_model_name "Qwen2.5-32B-Instruct" \
--tokenizer_path "PATH_TO_YOUR_TOKENIZER" \
--aux_tokenizer_path "PATH_TO_YOUR_AUX_TOKENIZER"
Since QwQ support max 32768 tokens, however, the default max_tokens is 81920 in scripts/run_web_thinker.py, I am wondering what length is used for evaluating QwQ model for difference benchmarks?
python scripts/run_web_thinker_report.py \ --dataset_name glaive \ --split test \ --concurrent_limit 32 \ --search_engine "serper" \ --serper_api_key "YOUR_GOOGLE_SERPER_API" \ --api_base_url "YOUR_API_BASE_URL" \ --model_name "QwQ-32B" \ --aux_api_base_url "YOUR_AUX_API_BASE_URL" \ --aux_model_name "Qwen2.5-32B-Instruct" \ --tokenizer_path "PATH_TO_YOUR_TOKENIZER" \ --aux_tokenizer_path "PATH_TO_YOUR_AUX_TOKENIZER"