Skip to content

Conversation

@ChrisPaulBennett
Copy link
Contributor

@ChrisPaulBennett ChrisPaulBennett commented Mar 12, 2025

This apart of 3 pull requests for adding CPU time and Max RSS analysis to the Cylc UI.

This adds the Max RSS and CPU time (as measured by cgroups) to the table view, box plot and time series views.

This adds a python profiler script. This profiler will will be ran by cylc in the same crgroup as the cylc task. It will periodically poll cgroups and save data to a file. Cylc will then store these values in the sql db file.

Linked to;
cylc/cylc-ui#2100
cylc/cylc-uiserver#675

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@ChrisPaulBennett ChrisPaulBennett marked this pull request as draft March 12, 2025 09:19
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Changed the name of the profiler module.
Linting

Profiler sends KB instead of bytes

Time Series now working

CPU/Memory Logging working
@ChrisPaulBennett ChrisPaulBennett force-pushed the cylc_profiler branch 2 times, most recently from 4a6dbe8 to 30a7bb0 Compare April 1, 2025 15:31
ChrisPaulBennett and others added 19 commits April 2, 2025 09:34
Initial profiler implementation (non working)

Changed the name of the profiler module.
Linting

Profiler sends KB instead of bytes

Time Series now working

CPU/Memory Logging working

Adding profiler unit tests

updating tests

Fail gracefully if cgroups cannot be found

Revert "Fail gracefully if cgroups cannot be found"

This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e.

Linting

Modifying unit tests
Linting
Changed the name of the profiler module.

Profiler sends KB instead of bytes

Time Series now working
* The COPYING file appears to have moved from `dist-info/COPYING` into
  `dist-info/licenses/COPYING`.
* Attempt to fix flaky test.
* Cut out shell profile files to omit some spurious stderr.
* Attempt to fix flaky test.
* Cut out shell profile files to omit some spurious stderr.
Revert "Fail gracefully if cgroups cannot be found"

This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e.

Adding profiler unit tests

updating tests
@oliver-sanders
Copy link
Member

oliver-sanders commented Oct 21, 2025

We seem to have agreement on the validity cgroups field(s) we're polling, we should be able to push forward with this quickly now.

Last couple of outstanding comments:

  • Remove global variables.
  • Backup the profiler output to the job.status file.
    • This PR is using the cylc.flow.send_messages interface to send the message back to the scheduler.
    • However, we should probably be using the record_messages interface (MB I think, sry).
    • This interface will additionally write the message to the job.status file.
    • This way, the message will not be lost if the scheduler is stopped at the time the message is sent, or if a network issue prevents the transmission of the message.
    • Note, this interface does not presently support a comms_timeout option. This isn't a biggie, we can drop this.

Then we can get Dave to ok the cgroup stuff and we're away...

@ChrisPaulBennett
Copy link
Contributor Author

ChrisPaulBennett commented Oct 21, 2025

* Pass the `Process` object into the `stop_profiler` method.

How do I do that? I couldn't see a way to do it. Since I'm not calling the function, its the registered function for sigkill
The use of globals is the only way I could see to get around. I'd love to get rid of them.

*EDIT. Sorry, I didn't see the pull request, I'll go through it now

Comment on lines 175 to 183
try:
# Get the cgroup information for the current process
with open('/proc/' + str(pid) + '/cgroup', 'r') as f:
result = f.read()
result = PID_REGEX.search(result).group()
return result
except FileNotFoundError as err:
raise FileNotFoundError(
'/proc/' + str(pid) + '/cgroup not found') from err
Copy link
Member

@oliver-sanders oliver-sanders Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is catching a FileNotFoundError and raising a near identical FileNotFoundError in it's place.

I think the intention was to replace a scary looking traceback with a more informative error. If so, try this out (note you'll likely need to import cylc.exceptions.CylcError first):

     try:
         # Get the cgroup information for the current process
         with open('/proc/' + str(pid) + '/cgroup', 'r') as f:
             result = f.read()
         result = PID_REGEX.search(result).group()
         return result
     except FileNotFoundError as err:
-        raise FileNotFoundError(
-            '/proc/' + str(pid) + '/cgroup not found') from err
+        raise CylcError('CGroup file not found: {err}') from None
  • CylcError is the class that (almost) all Cylc exceptions inherit from.
  • tldr; If you want a short clean error message, use CylcError or a subclass of it. If you want a scary traceback, use a plain exception.
  • CylcErrors get special treatment, the str(exc) gets written to stderr in red text. The traceback is not displayed unless running with --debug.
  • Note the from None hides the parent exception, preventing it from appearing in the traceback.

Copy link
Contributor

@dpmatthews dpmatthews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cgroups usage looks sensible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants