Add Finetuning LLMs on CPUs Blog Post #1748

Evaan2001 · 2025-05-05T02:41:25Z

Description

This PR:

Creates a web page for my blog on Fine-tunuing an LLM on CPUs using Bazel and NativeLink
Adds some new words to vale vocab
Makes a minor formatting change to nativelink/web/platform/src/content/docs/docs/config/production-config.mdx that is also presented in PR Add AI Customer Service Blog Post #1743. This formatting change needs to be merged into production before I can create a new blog post. Since PR Add AI Customer Service Blog Post #1743 will probably get merged later, I had to commit that change in this PR as well to make all the CI tests pass.

Type of change

Documentation update

How Has This Been Tested?

The content and design of this web page are heavily inspired by the web page I made for my full AI Customer Service Agent blog post, which has already received LGTMs from Marcus and Aaron, though the latter has not yet been added to production. Once the page development was complete and the page was examined via rm -rd dist && bun preview, everything looked as expected, and no errors were thrown.

This change is

web/platform/src/content/posts/Finetune_LLM_On_CPU.md

MarcusSorealheis

I'd love to see an explanation for how work is distributed or accelerated by Nativelink. My guess is the scheduler but you should make that clear to the reader in a subsequent iteration. For now, you can ship this one.

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 3 files reviewed, and 1 discussions need to be resolved

MarcusSorealheis

keep the crazy ideas coming.

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and 2 of 3 files reviewed, and 1 discussions need to be resolved

aaronmondal

Reviewed 1 of 3 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and 14 discussions need to be resolved

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 37 at r2 (raw file):

1\. Create a directory called `finetune-repo` and open your IDE from within it. Also create a `setup.sh` script. Alternatively, you can use the following shell commands
<div style="max-height: 250px; max-width: 50vw; overflow: auto;">

All of these setup instructions seem redundant. We can cut down on reading and setup time significantly if we create a repo for this that the user can clone. Then the setup becomes a single git clone github:TraceMachina/finetune-repo.

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 132 at r2 (raw file):

#
# This file is autogenerated by pip-compile via the following command:
#    pip-compile --output-file=bazel_requirements_lock.txt requirements.txt

If you use pip-compile, run with the hash generation option. The better approach is to use uv which is about 7x faster than pip. Use pyproject.toml to specify the versions and create the lockfile with uv lock.

See also: bazel-contrib/rules_python#1975

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 335 at r2 (raw file):

### Prologue - Docker Image For Remote Execution

By default, NativeLink provides a minimal Ubuntu 22.04 image *without any* dependencies installed for remote execution. For this project, we created our own custom publicly-accessible docker image (Docker Hub reference: `"container-image=docker://docker.io/evaanahmed2001/python-bazel-env:amd64-v2”`) and we’ve made this free to use.

Use the sha256 when pulling container images from private sources to protect against compromised container registries. In this case it's docker.io/evaanahmed2001/python-bazel-env@sha256:8de13199d587964b218c0b671272b42031cf4944b2f426e6eee7d7542802bf7c as displayed on this page:

https://hub.docker.com/layers/evaanahmed2001/python-bazel-env/amd64-v2/images/sha256-6d8058b6b44ee34860297321f62b2fe99afae21c8594d499998105b3b699c9dd

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 378 at r2 (raw file):

## **Important Note About Bazel’s Remote Execution Support**

Bazel supports remote execution for building (compiling) and testing only, **NOT** for running the built executables. To use remote servers for remote build execution via Bazel, a roundabout way is to design tests that just execute the Python binary we want to run. If the binary terminates without any errors, the test passes.

This is incorrect. Bazel can run executables. What you probably meant to say here is that bazel run uses the host platform as it's target platform, meaning that the executable will be invoked on whichever platform is marked as host. You could work around this by wrapping the binary in a test since bazel test runs on the target platform.

See also: bazelbuild/bazel#21805

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 384 at r2 (raw file):

1\. `py_test` uses the `pytest` package to create tests (convenient, user-friendly)

2\. `sh_test` involves coming up with a shell script that executes the test file (old-school but functional and robust)

There is also native_test from bazel-skylib: https://github.com/bazelbuild/bazel-skylib/blob/454b25912a8ddf3d90eb47f25260befd5ee274a8/rules/native_binary.bzl#L88

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 392 at r2 (raw file):

**DRAWBACK:**

Print statements that track progress (like "Starting to fine tune model" or "Exiting this function") behave differently in remote execution. Unlike local runs where these appear in real-time, remote testing collects all logs on the server and only displays them after test completion when control returns to the local machine. The expected output is still fully preserved - just delayed until the process finishes running.

I believe you can control this via flags (i.e. --test_output=all): https://bazel.build/reference/command-line-reference#build-flag--test_output

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 427 at r2 (raw file):

        If you use a different model, you may need to adjust the output shape
        """
        print("\nEvaluating model accuracy on test set...")

Don't recommend the use of print for "production examples". Use logger instead. See: https://docs.astral.sh/ruff/rules/print/

For bazel specifically this is relevant as you might want to use environment variables in your build to control log output.

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 464 at r2 (raw file):

        false_negatives = np.sum((predicted_classes == 0) & (labels == 1))

        print(f"\nConfusion Matrix (calculated with NumPy):")

Use multiline strings for subsequent log statements.

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 506 at r2 (raw file):

            output_dir = os.path.expanduser(output_dir)

        # model_dir = os.path.join(output_dir, model_name.replace("/", "_"))

Remove comment

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 521 at r2 (raw file):

        total_training_samples = len(dataset["train"])
        print(f"Total training samples: {total_training_samples}")
        # Reserve the last 1000 samples for testing

nit: potentially missing newline above

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 693 at r2 (raw file):

    ```bash
    #!/bin/bash

Use #!/usr/bin/env bash for POSIX compliance. If you need to deviate from this, add a coment to explain why that's necessary.

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 736 at r2 (raw file):

    )

    sh_test(

The builtin shell rules are deprecated. Use rules_shell instead: https://github.com/bazelbuild/rules_shell

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 985 at r2 (raw file):

    <div style="max-height: 250px; max-width: 50vw; overflow: auto;">

    ```bash

Duplicating this kind of logic is not scalable and not a recommended practice. Instead, use the intended rules for this as you don't need any specific functionality between this and the run_training.sh script.

aaronmondal

nit: The layout seems inconsistent. I.e. why do we need the 1\. instead of just 1. and why do we use bold fonts in the titles when they're already titles?

Reviewed 1 of 2 files at r3, 2 of 2 files at r5, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and pending CI: Bazel Dev / macos-15, Cargo Dev / macos-15, Cargo Dev / ubuntu-24.04, Installation / macos-14, Installation / macos-15, Local / lre-rs / macos-15, NativeLink.com Cloud / Remote Cache / macos-15, Publish image, Publish nativelink-worker-init, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, Web Platform Deployment / macos-15, buildstream, windows-2022 / stable, and 10 discussions need to be resolved

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 26 at r5 (raw file):

### Prerequisites

1\. Bazel ([installation instructions](https://bazel.build/install)). This demo uses Bazel 8.1.1

nit: The demo uses Bazel 8.2.1. A more future proof way in general might be to just use 1. A recent version of Bazel ([installation...).

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 50 at r5 (raw file):

2\. `requirements.lock` - Ensures consistent Python dependencies across all environments<br>
3\. `.bazelrc` - Main Bazel configuration file setting global options for hermetic builds and remote execution<br>
4\. `MODULE.bazel` - Configures the project as a Bazel module, tells Bazel we'll need python, pip and CPU-only PyTorch, and manages external dependencies<br>

nit:

Suggestion:

Python, `pip`

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 75 at r5 (raw file):

### The NativeLink Difference:

nit: Headers probably shouldn't end in a colon.

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 77 at r5 (raw file):

### The NativeLink Difference:

To demonstrate NativeLink’s efficacy, consistency, and reliability, we ran the same fine-tuning job on the CPU of an M1 Pro MacBook Pro, the free version of Google Colab on CPU, and [NativeLink](https://github.com/TraceMachina/nativelink),which is free and open-source. We executed the fine-tuning task 5 times and this is what we observed:

Suggestion:

, w

web/platform/src/content/posts/Finetune_LLM_On_CPU.md line 98 at r5 (raw file):

For forward-thinking AI teams, this infrastructure stack represents a shift from the "bigger is better" hardware arms race toward thoughtful resource utilization. The competitive advantage increasingly belongs to those who can extract maximum value from available compute rather than those who deploy more powerful hardware.

nit: Duplicate newline

web/platform/src/content/docs/docs/config/production-config.mdx line 18 at r5 (raw file):

To run NativeLink, you just pass the path to a single JSON5 Configuration file, such as:

```bash

nit: This is a nice fix, but unrelated to this PR. Consider pulling it out into a separate mini fix PR.

.github/styles/config/vocabularies/TraceMachina/accept.txt line 19 at r5 (raw file):

[Hh]ermeticity
JDK
json

This shouldn't be in here. JSON should alwasy be written in capital letters. If you need the file ending .json the corresponding text should be in backticks.

.github/styles/config/vocabularies/TraceMachina/accept.txt line 38 at r5 (raw file):

OSSF
Reclient
[Rr]osetta

Rosetta should always be capitalized.

.github/styles/config/vocabularies/TraceMachina/accept.txt line 39 at r5 (raw file):

Reclient
[Rr]osetta
[Rr]epo

Write out repository instead.

Evaan2001

Without the backslash in 1\., the 1. doesn't show in bun preview (screenshot attached)

I'll remove the bold from the titles (just thought it looked better).

Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and 10 discussions need to be resolved

web/platform/src/content/docs/docs/config/production-config.mdx line 18 at r5 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

nit: This is a nice fix, but unrelated to this PR. Consider pulling it out into a separate mini fix PR.

This was causing vale-related errors barring me from committing changes.

aaronmondal

Reviewed 2 of 2 files at r6, all commit messages.
Dismissed @MarcusSorealheis from a discussion.
Reviewable status: 2 of 2 LGTMs obtained, and all files reviewed, and pending CI: Cargo Dev / macos-15, Local / lre-rs / macos-15, NativeLink.com Cloud / Remote Cache / macos-15, buildstream, and 1 discussions need to be resolved

aaronmondal

Reviewable status: complete! 2 of 2 LGTMs obtained, and all files reviewed

Evaan2001 temporarily deployed to production May 5, 2025 02:41 — with GitHub Actions Inactive

Evaan2001 temporarily deployed to production May 5, 2025 02:42 — with GitHub Actions Inactive

Evaan2001 force-pushed the finetune_llm_cpu_blog branch from 803329f to c6b73a6 Compare May 5, 2025 03:32

Evaan2001 temporarily deployed to production May 5, 2025 03:32 — with GitHub Actions Inactive

MarcusSorealheis requested changes May 5, 2025

View reviewed changes

web/platform/src/content/posts/Finetune_LLM_On_CPU.md Outdated Show resolved Hide resolved

Evaan2001 force-pushed the finetune_llm_cpu_blog branch from c6b73a6 to b3e28fb Compare May 5, 2025 05:44

Evaan2001 temporarily deployed to production May 5, 2025 05:44 — with GitHub Actions Inactive

Evaan2001 temporarily deployed to production May 5, 2025 05:45 — with GitHub Actions Inactive

MarcusSorealheis reviewed May 5, 2025

View reviewed changes

MarcusSorealheis approved these changes May 5, 2025

View reviewed changes

MarcusSorealheis enabled auto-merge (squash) May 5, 2025 14:11

aaronmondal suggested changes May 5, 2025

View reviewed changes

Evaan2001 mentioned this pull request May 8, 2025

Add code files for Fine-tuning LLMs on CPUs blog TraceMachina/nativelink-blogs#1

Closed

auto-merge was automatically disabled May 8, 2025 19:46
Head branch was pushed to by a user without write access

Evaan2001 force-pushed the finetune_llm_cpu_blog branch from b3e28fb to d864e5f Compare May 8, 2025 19:46

Evaan2001 temporarily deployed to production May 8, 2025 19:46 — with GitHub Actions Inactive

Evaan2001 had a problem deploying to production May 14, 2025 16:04 — with GitHub Actions Error

Evaan2001 force-pushed the finetune_llm_cpu_blog branch from d496821 to a9ee9bb Compare May 14, 2025 16:04

Evaan2001 temporarily deployed to production May 14, 2025 16:04 — with GitHub Actions Inactive

Evaan2001 temporarily deployed to production May 14, 2025 16:05 — with GitHub Actions Inactive

Evaan2001 requested a review from aaronmondal May 14, 2025 16:05

Evaan2001 force-pushed the finetune_llm_cpu_blog branch from a9ee9bb to 7d54526 Compare May 14, 2025 21:30

Evaan2001 temporarily deployed to production May 14, 2025 21:30 — with GitHub Actions Inactive

Evaan2001 temporarily deployed to production May 14, 2025 21:31 — with GitHub Actions Inactive

aaronmondal suggested changes May 14, 2025

View reviewed changes

Evaan2001 commented May 14, 2025

View reviewed changes

Add Finetuning LLMs on CPUs Blog Post

92074b9

Evaan2001 force-pushed the finetune_llm_cpu_blog branch from 7d54526 to 92074b9 Compare May 14, 2025 22:32

Evaan2001 temporarily deployed to production May 14, 2025 22:32 — with GitHub Actions Inactive

Evaan2001 temporarily deployed to production May 14, 2025 22:33 — with GitHub Actions Inactive

aaronmondal approved these changes May 14, 2025

View reviewed changes

aaronmondal enabled auto-merge (squash) May 14, 2025 22:55

aaronmondal approved these changes May 14, 2025

View reviewed changes

aaronmondal merged commit 18eb9e1 into TraceMachina:main May 14, 2025
37 checks passed

aaronmondal mentioned this pull request May 15, 2025

Add AI Customer Service Blog Post #1743

Open

1 task

MarcusSorealheis pushed a commit to MarcusSorealheis/nativelink that referenced this pull request Nov 3, 2025

Add Finetuning LLMs on CPUs Blog Post (TraceMachina#1748)

96b5338

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Finetuning LLMs on CPUs Blog Post #1748

Add Finetuning LLMs on CPUs Blog Post #1748

Uh oh!

Evaan2001 commented May 5, 2025 •

edited by aaronmondal

Loading

Uh oh!

Uh oh!

MarcusSorealheis left a comment

Uh oh!

MarcusSorealheis left a comment

Uh oh!

aaronmondal left a comment

Uh oh!

aaronmondal left a comment

Uh oh!

Evaan2001 left a comment

Uh oh!

aaronmondal left a comment

Uh oh!

aaronmondal left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Finetuning LLMs on CPUs Blog Post #1748

Add Finetuning LLMs on CPUs Blog Post #1748

Uh oh!

Conversation

Evaan2001 commented May 5, 2025 • edited by aaronmondal Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Uh oh!

Uh oh!

MarcusSorealheis left a comment

Choose a reason for hiding this comment

Uh oh!

MarcusSorealheis left a comment

Choose a reason for hiding this comment

Uh oh!

aaronmondal left a comment

Choose a reason for hiding this comment

Uh oh!

aaronmondal left a comment

Choose a reason for hiding this comment

Uh oh!

Evaan2001 left a comment

Choose a reason for hiding this comment

Uh oh!

aaronmondal left a comment

Choose a reason for hiding this comment

Uh oh!

aaronmondal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Evaan2001 commented May 5, 2025 •

edited by aaronmondal

Loading