Skip to content

Conversation

@pgmpablo157321
Copy link
Contributor

Fix #419

@github-actions
Copy link

github-actions bot commented Aug 28, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@pgmpablo157321 pgmpablo157321 force-pushed the standalone_score_compute branch 2 times, most recently from f5cffb2 to b917d71 Compare August 29, 2025 19:29
@pgmpablo157321 pgmpablo157321 marked this pull request as ready for review August 29, 2025 19:29
@pgmpablo157321 pgmpablo157321 requested review from a team as code owners August 29, 2025 19:29
ShriyaRishab
ShriyaRishab previously approved these changes Sep 4, 2025
@ShriyaRishab
Copy link
Contributor

ShriyaRishab commented Sep 4, 2025

I tested it locally on a few results

With scaling.json -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system tyche_ngpu512_ngc25.04_nemo --benchmark_folder /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b --usage training --ruleset 5.0.0 --scale
NOTICE: Applying scaling factor 1.1538461538461537 to dir /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
MLPerf training
Folder: /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
Version: 5.0.0
System: tyche_ngpu512_ngc25.04_nemo
Benchmark: llama31_405b
Score - Time to Train (minutes): 121.7573269230769

Without --scale but scaling.json file still exists from previous run in the folder -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system tyche_ngpu512_ngc25.04_nemo --benchmark_folder /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b --usage training --ruleset 5.0.0
NOTICE: Applying scaling factor 1.1538461538461537 to dir /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
MLPerf training
Folder: /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
Version: 5.0.0
System: tyche_ngpu512_ngc25.04_nemo
Benchmark: llama31_405b
Score - Time to Train (minutes): 121.7573269230769

After manually deleting scaling.json -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system tyche_ngpu512_ngc25.04_nemo --benchmark_folder /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b --usage training --ruleset 5.0.0
ruleset 5.0.0
MLPerf training
Folder: /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
Version: 5.0.0
System: tyche_ngpu512_ngc25.04_nemo
Benchmark: llama31_405b
Score - Time to Train (minutes): 105.52301666666666

But if I don't manually delete scaling.json and run it without the --scale flag, it still does automatic scaling because there is a preexisting scaling.json file in the folder. @pgmpablo157321 - is this expected behavior and should we add some information in the README about how to deal with the scaling.json files in the folder?

@ShriyaRishab
Copy link
Contributor

Testing power scores -

With --has_power

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama2_70b_lora  --system xyz --benchmark_folder /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora --usage training --ruleset 5.0.0 --has_power
NOTICE: Applying scaling factor 1.0188034188034187 to dir /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
MLPerf training
Folder: /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
Version: 5.0.0
System: xyz
Benchmark: llama2_70b_lora
Score - Time to Train (minutes): 11.324490299145298
Power Score - Energy (kJ): 6114237.986822284

Without --has-power -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama2_70b_lora  --system xyz --benchmark_folder /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora --usage training --ruleset 5.0.0
NOTICE: Applying scaling factor 1.0188034188034187 to dir /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
MLPerf training
Folder: /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
Version: 5.0.0
System: xyz
Benchmark: llama2_70b_lora
Score - Time to Train (minutes): 11.324490299145298

After deleting scaling.json and with --has_power -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama2_70b_lora  --system xyz --benchmark_folder training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora --usage training --ruleset 5.0.0 --has_power
MLPerf training
Folder: training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
Version: 5.0.0
System: xyz
Benchmark: llama2_70b_lora
Score - Time to Train (minutes): 11.11548125
Power Score - Energy (kJ): 6001391.312568853

@ShriyaRishab
Copy link
Contributor

ShriyaRishab commented Sep 4, 2025

Few more issues that need to be dealt with -

Trying to compute scores or just 1 or 2 files returns None although it would help to just print out the individual scores of each of the files in the folder -

$ ls /temp_results
result_0.txt  result_1.txt
$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system xyz --benchmark_folder /temp_results --usage training --ruleset 5.0.0
MLPerf training
Folder: /temp_results
Version: 5.0.0
System: xyz
Benchmark: llama31_405b
Score - Time to Train (minutes): None

Changing file names to be anything other than result_*.txt does not compute scores although this is expected.

$ ls /temp_results
0.txt  1.txt  2.txt
$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system xyz --benchmark_folder /temp_results --usage training --ruleset 5.0.0
MLPerf training
Folder: /temp_results
Version: 5.0.0
System: xyz
Benchmark: llama31_405b
Score - Time to Train (minutes): None

@ShriyaRishab
Copy link
Contributor

@pgmpablo157321 TODO items as discussed in the training WG

  1. Always delete scaling.json file so that the scores are computed without scaling unless --scale is passed in which case, scaling.json is created and scores are printed after scaling.
  2. When m<N log files are present, print score per file and also add a NOTICE stating that N logs are needed but only m are provided

Additional piece for (2) would be to also print the samples to converge along with the scores for each log file so submitters get a sense of their convergence as well. Is that also something we can add?

"--has_power", action="store_true", help="Compute power score as well"
)
parser.add_argument(
"--benchmark_folder",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend taking a list of files rather than a folder name. then the user could specify the list of files as folder/result*.txt to get all the result.txt files in a folder, but could also specify a single file, and could specify log files and directories that are named differently than result*.txt, like foo/bar/baz/*.log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this requires significant changes in the RCP checker, particularly changing the check_directory function and it's interactions:

def check_directory(dir, usage, version, verbose, bert_train_samples, rcp_file=None, rcp_pass='full_rcp', rcp_bypass=False, set_scaling=False):

Given the time to the next submission, I recommend that we postpone this change

@pgmpablo157321
Copy link
Contributor Author

Following changes were added:

  1. Scaling factor gets reseted when computing the score. Recalculated in case --scale is passed
  2. Per file score/results are included in the output
  3. Olympic scoring is skipped if there are less results than needed for submission, a warning is raised when this happens.
  4. Benchmark argument is no longer required, inferred from the result files

@pgmpablo157321
Copy link
Contributor Author

Sample run 1:

python -m mlperf_logging.result_summarizer.compute_score --system TEST \
    --benchmark_folder training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat \
    --usage training --ruleset 5.0.0 --scale

Output:

NOTICE: Applying scaling factor 1.0014814814814814 to dir training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat
INFO -------------------------------------------------------
MLPerf training
Folder: training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat
Version: 5.0.0
System: TEST
Benchmark: rgat
-------------------------------------------------------------
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_9.txt: 4.88135
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_8.txt: 5.164983333333334
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_6.txt: 5.131083333333333
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_7.txt: 5.11245
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_5.txt: 5.101683333333334
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_4.txt: 5.379166666666667
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_0.txt: 4.59125
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_1.txt: 5.06715
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_3.txt: 5.142033333333334
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_2.txt: 4.86545
Final score - Time to Train (minutes): 5.065766654320987

Sample run 2 (manually deleting result_3.txt):

python -m mlperf_logging.result_summarizer.compute_score --system TEST \
    --benchmark_folder training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat \
    --usage training --ruleset 5.0.0 --scale

Output:

WARNING: Not enough runs found for an official submission. Found: 9, required: 10
NOTICE: Applying scaling factor 1.0033927056827818 to dir training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat
INFO -------------------------------------------------------
MLPerf training
Folder: training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat
Version: 5.0.0
System: TEST
Benchmark: rgat
-------------------------------------------------------------
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_9.txt: 4.88135
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_8.txt: 5.164983333333334
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_6.txt: 5.131083333333333
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_7.txt: 5.11245
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_5.txt: 5.101683333333334
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_4.txt: 5.379166666666667
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_0.txt: 4.59125
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_1.txt: 5.06715
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_dgl/rgat/result_2.txt: 4.86545
WARNING: Olympic scoring skipped
Final score - Time to Train (minutes): 5.049804200043979

@pgmpablo157321 pgmpablo157321 force-pushed the standalone_score_compute branch from a47029e to 32e2eec Compare September 9, 2025 00:16
@pgmpablo157321 pgmpablo157321 force-pushed the standalone_score_compute branch from 32e2eec to 105f189 Compare September 9, 2025 00:35
ShriyaRishab
ShriyaRishab previously approved these changes Sep 9, 2025
Copy link
Contributor

@ShriyaRishab ShriyaRishab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@pgmpablo157321
Copy link
Contributor Author

Also added logging the sample count:

NOTICE: Applying scaling factor 1.0511463844797178 to dir training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora
INFO -------------------------------------------------------
MLPerf training
Folder: training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora
Version: 5.0.0
System: TEST
Benchmark: llama2_70b_lora
-------------------------------------------------------------
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora/result_9.txt: 10.89865. Samples to converge: 3072
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora/result_8.txt: 9.451066666666668. Samples to converge: 2688
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora/result_6.txt: 10.892983333333333. Samples to converge: 3072
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora/result_7.txt: 10.90115. Samples to converge: 3072
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora/result_5.txt: 10.900933333333333. Samples to converge: 3072
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora/result_4.txt: 9.4506. Samples to converge: 2688
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora/result_0.txt: 10.90205. Samples to converge: 3072
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora/result_1.txt: 10.899316666666667. Samples to converge: 3072
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora/result_3.txt: 10.894383333333334. Samples to converge: 3072
Score - Time to Train (minutes) for training_results_v5.0/GigaComputing/results/G893-SD1_pytorch/llama2_70b_lora/result_2.txt: 10.89935. Samples to converge: 3072
Final score - Time to Train (minutes): 11.265376690182244

@pgmpablo157321 pgmpablo157321 force-pushed the standalone_score_compute branch from 05e7fae to b0b2fe3 Compare September 9, 2025 22:25
@pgmpablo157321 pgmpablo157321 merged commit 5e82c8f into master Sep 10, 2025
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Sep 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can we have a simple script to compute training scores?

4 participants