-
Notifications
You must be signed in to change notification settings - Fork 204
Add Finetuning LLMs on CPUs Blog Post #1748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Finetuning LLMs on CPUs Blog Post #1748
Conversation
803329f to
c6b73a6
Compare
c6b73a6 to
b3e28fb
Compare
MarcusSorealheis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to see an explanation for how work is distributed or accelerated by Nativelink. My guess is the scheduler but you should make that clear to the reader in a subsequent iteration. For now, you can ship this one.
Reviewable status: 0 of 2 LGTMs obtained, and 0 of 3 files reviewed, and 1 discussions need to be resolved
MarcusSorealheis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and 2 of 3 files reviewed, and 1 discussions need to be resolved
aaronmondal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and 14 discussions need to be resolved
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 37 at r2 (raw file):
1\. Create a directory called `finetune-repo` and open your IDE from within it. Also create a `setup.sh` script. Alternatively, you can use the following shell commands <div style="max-height: 250px; max-width: 50vw; overflow: auto;">
All of these setup instructions seem redundant. We can cut down on reading and setup time significantly if we create a repo for this that the user can clone. Then the setup becomes a single git clone github:TraceMachina/finetune-repo.
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 132 at r2 (raw file):
# # This file is autogenerated by pip-compile via the following command: # pip-compile --output-file=bazel_requirements_lock.txt requirements.txt
If you use pip-compile, run with the hash generation option. The better approach is to use uv which is about 7x faster than pip. Use pyproject.toml to specify the versions and create the lockfile with uv lock.
See also: bazel-contrib/rules_python#1975
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 335 at r2 (raw file):
### Prologue - Docker Image For Remote Execution By default, NativeLink provides a minimal Ubuntu 22.04 image *without any* dependencies installed for remote execution. For this project, we created our own custom publicly-accessible docker image (Docker Hub reference: `"container-image=docker://docker.io/evaanahmed2001/python-bazel-env:amd64-v2”`) and we’ve made this free to use.
Use the sha256 when pulling container images from private sources to protect against compromised container registries. In this case it's docker.io/evaanahmed2001/python-bazel-env@sha256:8de13199d587964b218c0b671272b42031cf4944b2f426e6eee7d7542802bf7c as displayed on this page:
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 378 at r2 (raw file):
## **Important Note About Bazel’s Remote Execution Support** Bazel supports remote execution for building (compiling) and testing only, **NOT** for running the built executables. To use remote servers for remote build execution via Bazel, a roundabout way is to design tests that just execute the Python binary we want to run. If the binary terminates without any errors, the test passes.
This is incorrect. Bazel can run executables. What you probably meant to say here is that bazel run uses the host platform as it's target platform, meaning that the executable will be invoked on whichever platform is marked as host. You could work around this by wrapping the binary in a test since bazel test runs on the target platform.
See also: bazelbuild/bazel#21805
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 384 at r2 (raw file):
1\. `py_test` uses the `pytest` package to create tests (convenient, user-friendly) 2\. `sh_test` involves coming up with a shell script that executes the test file (old-school but functional and robust)
There is also native_test from bazel-skylib: https://github.com/bazelbuild/bazel-skylib/blob/454b25912a8ddf3d90eb47f25260befd5ee274a8/rules/native_binary.bzl#L88
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 392 at r2 (raw file):
**DRAWBACK:** Print statements that track progress (like "Starting to fine tune model" or "Exiting this function") behave differently in remote execution. Unlike local runs where these appear in real-time, remote testing collects all logs on the server and only displays them after test completion when control returns to the local machine. The expected output is still fully preserved - just delayed until the process finishes running.
I believe you can control this via flags (i.e. --test_output=all): https://bazel.build/reference/command-line-reference#build-flag--test_output
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 427 at r2 (raw file):
If you use a different model, you may need to adjust the output shape """ print("\nEvaluating model accuracy on test set...")
Don't recommend the use of print for "production examples". Use logger instead. See: https://docs.astral.sh/ruff/rules/print/
For bazel specifically this is relevant as you might want to use environment variables in your build to control log output.
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 464 at r2 (raw file):
false_negatives = np.sum((predicted_classes == 0) & (labels == 1)) print(f"\nConfusion Matrix (calculated with NumPy):")
Use multiline strings for subsequent log statements.
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 506 at r2 (raw file):
output_dir = os.path.expanduser(output_dir) # model_dir = os.path.join(output_dir, model_name.replace("/", "_"))
Remove comment
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 521 at r2 (raw file):
total_training_samples = len(dataset["train"]) print(f"Total training samples: {total_training_samples}") # Reserve the last 1000 samples for testing
nit: potentially missing newline above
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 693 at r2 (raw file):
```bash #!/bin/bash
Use #!/usr/bin/env bash for POSIX compliance. If you need to deviate from this, add a coment to explain why that's necessary.
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 736 at r2 (raw file):
) sh_test(
The builtin shell rules are deprecated. Use rules_shell instead: https://github.com/bazelbuild/rules_shell
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 985 at r2 (raw file):
<div style="max-height: 250px; max-width: 50vw; overflow: auto;"> ```bash
Duplicating this kind of logic is not scalable and not a recommended practice. Instead, use the intended rules for this as you don't need any specific functionality between this and the run_training.sh script.
Head branch was pushed to by a user without write access
b3e28fb to
d864e5f
Compare
d496821 to
a9ee9bb
Compare
a9ee9bb to
7d54526
Compare
aaronmondal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The layout seems inconsistent. I.e. why do we need the 1\. instead of just 1. and why do we use bold fonts in the titles when they're already titles?
Reviewed 1 of 2 files at r3, 2 of 2 files at r5, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and pending CI: Bazel Dev / macos-15, Cargo Dev / macos-15, Cargo Dev / ubuntu-24.04, Installation / macos-14, Installation / macos-15, Local / lre-rs / macos-15, NativeLink.com Cloud / Remote Cache / macos-15, Publish image, Publish nativelink-worker-init, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, Web Platform Deployment / macos-15, buildstream, windows-2022 / stable, and 10 discussions need to be resolved
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 26 at r5 (raw file):
### Prerequisites 1\. Bazel ([installation instructions](https://bazel.build/install)). This demo uses Bazel 8.1.1
nit: The demo uses Bazel 8.2.1. A more future proof way in general might be to just use 1. A recent version of Bazel ([installation...).
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 50 at r5 (raw file):
2\. `requirements.lock` - Ensures consistent Python dependencies across all environments<br> 3\. `.bazelrc` - Main Bazel configuration file setting global options for hermetic builds and remote execution<br> 4\. `MODULE.bazel` - Configures the project as a Bazel module, tells Bazel we'll need python, pip and CPU-only PyTorch, and manages external dependencies<br>
nit:
Suggestion:
Python, `pip`web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 75 at r5 (raw file):
### The NativeLink Difference:
nit: Headers probably shouldn't end in a colon.
web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 77 at r5 (raw file):
### The NativeLink Difference: To demonstrate NativeLink’s efficacy, consistency, and reliability, we ran the same fine-tuning job on the CPU of an M1 Pro MacBook Pro, the free version of Google Colab on CPU, and [NativeLink](https://github.com/TraceMachina/nativelink),which is free and open-source. We executed the fine-tuning task 5 times and this is what we observed:
Suggestion:
, wweb/platform/src/content/posts/Finetune_LLM_On_CPU.md line 98 at r5 (raw file):
For forward-thinking AI teams, this infrastructure stack represents a shift from the "bigger is better" hardware arms race toward thoughtful resource utilization. The competitive advantage increasingly belongs to those who can extract maximum value from available compute rather than those who deploy more powerful hardware.
nit: Duplicate newline
web/platform/src/content/docs/docs/config/production-config.mdx line 18 at r5 (raw file):
To run NativeLink, you just pass the path to a single JSON5 Configuration file, such as: ```bash
nit: This is a nice fix, but unrelated to this PR. Consider pulling it out into a separate mini fix PR.
.github/styles/config/vocabularies/TraceMachina/accept.txt line 19 at r5 (raw file):
[Hh]ermeticity JDK json
This shouldn't be in here. JSON should alwasy be written in capital letters. If you need the file ending .json the corresponding text should be in backticks.
.github/styles/config/vocabularies/TraceMachina/accept.txt line 38 at r5 (raw file):
OSSF Reclient [Rr]osetta
Rosetta should always be capitalized.
.github/styles/config/vocabularies/TraceMachina/accept.txt line 39 at r5 (raw file):
Reclient [Rr]osetta [Rr]epo
Write out repository instead.
Evaan2001
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the backslash in 1\., the 1. doesn't show in bun preview (screenshot attached)
I'll remove the bold from the titles (just thought it looked better).
Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and 10 discussions need to be resolved
web/platform/src/content/docs/docs/config/production-config.mdx line 18 at r5 (raw file):
Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…
nit: This is a nice fix, but unrelated to this PR. Consider pulling it out into a separate mini fix PR.
This was causing vale-related errors barring me from committing changes.
7d54526 to
92074b9
Compare
aaronmondal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r6, all commit messages.
Dismissed @MarcusSorealheis from a discussion.
Reviewable status: 2 of 2 LGTMs obtained, and all files reviewed, and pending CI: Cargo Dev / macos-15, Local / lre-rs / macos-15, NativeLink.com Cloud / Remote Cache / macos-15, buildstream, and 1 discussions need to be resolved
aaronmondal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 2 LGTMs obtained, and all files reviewed

Description
This PR:
Type of change
How Has This Been Tested?
The content and design of this web page are heavily inspired by the web page I made for my full AI Customer Service Agent blog post, which has already received LGTMs from Marcus and Aaron, though the latter has not yet been added to production. Once the page development was complete and the page was examined via
rm -rd dist && bun preview, everything looked as expected, and no errors were thrown.This change is