File tree Expand file tree Collapse file tree 1 file changed +0
-23
lines changed
Expand file tree Collapse file tree 1 file changed +0
-23
lines changed Original file line number Diff line number Diff line change 77 pullPolicy : Never
88 workflow :
99 completed : 10
10- # What we could try:
11- # Having a change to epochs
12- # Stopping the workflow when we reach a certain threshold accuracy, etc.
13- # events:
14- # Custom metric - derived from parsing lammps log
15- # Max decreases down to 1 (default), with 3 breaks in between, and 3 times
16- # - metric: mean.lammps.lammps-walltime
17- # when: "<= 10"
18- # action: shrink
19- # repetitions: 3
20- # backoff: 3
21- # minSize: 1
2210
2311 cluster :
2412 maxSize : 4
4028 PYTHONUNBUFFERED : " 0"
4129 epochs : " 1"
4230 script : torchrun --rdzv_id=123 --nnodes=${nodes} --nproc_per_node=1 --master_addr=${jobname}-jobset-0-0.${jobname}.default.svc.cluster.local --master_port=$MASTER_PORT --node_rank=$RANK mnist.py --epochs=${epochs} --log-interval=1
43-
44- # Event parsing. Assume for a log for now
45- # events:
46- # script: |
47- # def parse_log(log):
48- # import re
49- # match = re.search('Total wall time: (?P<walltime>.*)', log)
50- # walltime = match.groupdict()['walltime']
51- # hours, minutes, seconds = walltime.split(':')
52- # walltime = (float(hours) * 60 * 60) + (float(minutes) * 60) + (float(seconds))
53- # return {"lammps-walltime": walltime}
You can’t perform that action at this time.
0 commit comments