Skip to content

Commit 18eb9e1

Browse files
authored
Add Finetuning LLMs on CPUs Blog Post (#1748)
1 parent 7a8f561 commit 18eb9e1

File tree

3 files changed

+109
-2
lines changed

3 files changed

+109
-2
lines changed

.github/styles/config/vocabularies/TraceMachina/accept.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,22 @@
1+
[Aa]gentic
12
Ansys
3+
Anthropic
24
AMI
35
Astro
46
Bazel
57
Bazelisk
8+
[Bb]oolean
69
Cloudflare
10+
Colab
11+
CPUs
712
ELB
813
FFI
914
FFIs
1015
GPUs
1116
Goma
1217
[Hh]ermeticity
1318
JDK
19+
JSON
1420
Kustomization
1521
Kustomizations
1622
LLD
@@ -28,6 +34,8 @@ NativeLink
2834
OCI
2935
OSSF
3036
Reclient
37+
Rosetta
38+
[Rr]epository
3139
SPDX
3240
Starlark
3341
Tokio

web/platform/src/content/docs/docs/config/production-config.mdx

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,11 @@ which both provide helpful reference for customers looking to deploy NativeLink
1313
## Production CAS Overview
1414
At NativeLink we offer CAS-as-a-Service running on all the major cloud providers (AWS, GCP, Azure, etc). This allows our customers to get started with NativeLink to improve build & test performance with minimal effort. Behind the scenes, each CAS service runs in a Kubernetes namespace with a dedicated ActionCache store and a shared CAS store. In this article, we take a deep dive into how we Configure the CAS service in our cloud. Even if you’re not using our hosted CAS service, the insights covered here will help you Configure your own CAS to achieve high performance and scalability.
1515

16-
To run NativeLink, you just pass the path to a single JSON Configuration file, such as:
16+
To run NativeLink, you just pass the path to a single JSON5 Configuration file, such as:
1717

18-
/bin/NativeLink /etc/Config/cas.json
18+
```bash
19+
/bin/NativeLink /etc/Config/cas.json5
20+
```
1921

2022
The entire JSON file we use for the cloud service is included at the end of this document.
2123
NativeLink Servers
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
title: "Fine-tune a Language Model on x86 CPUs using Bazel and NativeLink"
3+
tags: ["news", "blog-posts"]
4+
image: https://github.com/user-attachments/assets/ddfb5684-327b-4af9-9618-be707eab894f
5+
slug: finetune-with-bazel-nativelink
6+
pubDate: 2025-05-15
7+
readTime: 20 minutes
8+
---
9+
10+
## Introduction
11+
12+
The future of AI development belongs not necessarily to those with the most powerful infrastructure but to those who can extract maximum value from available resources. This tutorial emphasizes CPU-based fine-tuning demonstrating that with intelligent resource management through NativeLink, impressive results can be achieved without expensive GPU or TPU infrastructure. As compute becomes increasingly costly, competitive advantage will shift toward teams that optimize resource efficiency rather than those deploying state-of-the-art hardware.
13+
14+
This guide demonstrates how to establish an optimized AI development pipeline by integrating several key technologies:<br>
15+
1\. **Bazel Build System**: For efficient repository management. A repository managed by Bazel allows your team to work in a unified codebase while maintaining clean separation of concerns.<br>
16+
2\. **NativeLink**: A remote execution system hosted in your cloud. With NativeLink's remote execution capabilities, you can leverage your cloud resources optimally without wasteful duplication of work.<br>
17+
3\. **Hugging Face Transformers**: For integrating with the rich ecosystem of open-source models that you can run locally or deploy anywhere. The transformers library also provides a sophisticated caching mechanism for optimizing loading model weights.
18+
19+
20+
## Setting Up Your Repository With Bazel
21+
22+
Bazel is a build system designed for repositories that allows you to organize code into logical components while maintaining dependency relationships. For AI workloads, this is particularly valuable as it lets you separate model definitions, data processing pipelines, training code, and inference services.
23+
24+
### Prerequisites
25+
26+
1\. A recent version of Bazel ([installation instructions](https://bazel.build/install)). <br>
27+
2\. NativeLink [Cloud Account](https://app.nativelink.com/) (it’s free to get started, and secure) or [NativeLink 0.6.0](https://github.com/TraceMachina/nativelink/releases/tag/v0.6.0) (Apache-licensed, open source is hard-mode)
28+
29+
### Initial Setup
30+
31+
First, let's download all the files. From the folder where you want to download the files, run the following commands:
32+
33+
<div style="max-height: 250px; max-width: 50vw; overflow: auto;">
34+
35+
```bash
36+
# Clone the entire repository
37+
git clone https://github.com/TraceMachina/nativelink-blogs.git
38+
39+
# Navigate to the subdirectory
40+
cd nativelink-blogs/finetuning_on_cpu
41+
```
42+
</div>
43+
44+
45+
Here's a description of some of the files:<br>
46+
1\. `README.md` - Instructions on how to connect/use NativeLink Cloud and how to run the code locally as well as remotely<br>
47+
2\. `requirements.lock` - Ensures consistent Python dependencies across all environments<br>
48+
3\. `.bazelrc` - Main Bazel configuration file setting global options for hermetic builds and remote execution<br>
49+
4\. `MODULE.bazel` - Configures the project as a Bazel module, tells Bazel we'll need Python, `pip` and CPU-only PyTorch, and manages external dependencies<br>
50+
5\. `pyproject.toml` - Python package configuration specifying dependencies and development tools<br>
51+
6\. `BUILD.bazel` (root) - Root build file defining lock targets for Python dependency management<br>
52+
7\. `platforms/BUILD` - Defines Linux x86_64 execution platform running in Ubuntu 24.04 for remote builds<br>
53+
8\. `training/BUILD` - Defines the model training targets and their dependencies<br>
54+
9\. `training/main.py` - Main script that handles fine-tuning of language models on CPU using efficient training techniques<br>
55+
56+
57+
## Important Note About Bazel’s Remote Execution Support
58+
59+
Bazel supports remote execution for building (compiling) and testing, but `bazel run` uses the host platform as its target platform, meaning the executable will be invoked locally rather than on remote machines. To execute binaries on remote servers, a workaround is to design tests that execute the binary, effectively leveraging `bazel test` which runs on the target platform.
60+
61+
**DRAWBACK:**
62+
63+
Logging and print statements that track progress (like "Starting to fine tune model" or "Exiting this function") behave differently in remote execution. Unlike local runs where these appear in real-time, remote testing collects all logs on the server and only displays them after test completion when control returns to the local machine. The expected output is still fully preserved - just delayed until the process finishes running.
64+
65+
66+
## Aside - Configuring Remote Execution For ARM (Apple Silicon)
67+
68+
69+
Most cloud infrastructure (including NativeLink) runs on x86_64 processors, while Apple Silicon Macs use ARM64. Dependencies such as PyTorch are built for specified architectures and setting up Bazel to handle platform-specific dependencies is complex.
70+
71+
If you want to run this code and you only have a Mac, the simplest way would be to run this code via a cloud-based Linux VM (GCP/AWS). If you don't want to use a cloud server, you could create a minimal docker container for x86_64 (`FROM --platform=linux/amd64 ubuntu:22.04` with just `curl` and `Bazelisk` installed) with **Rosetta enabled** and run from your Mac's terminal using this container. However, we highly recommend against this approach; the correct approach here would be to use toolchain transitions from a local mac platform to the remote Linux runner, but that's outside of the scope of this article.
72+
73+
74+
## The NativeLink Difference
75+
76+
To demonstrate NativeLink’s efficacy, consistency, and reliability, we ran the same fine-tuning job on the CPU of an M1 Pro MacBook Pro, the free version of Google Colab on CPU, and [NativeLink](https://github.com/TraceMachina/nativelink), which is free and open-source. We executed the fine-tuning task 5 times and this is what we observed:
77+
78+
1\. The Mac: the quickest run took 18 minutes while the slowest/longest took 20 minutes
79+
80+
2\. Free version of Google Colab: the quickest run took 10 minutes while the slowest/longest took 20 minutes. The execution time was widely varied. We suspect varying traffic on Google’s servers and how Colab allocates its compute resources played a part in this variability.
81+
82+
3\. Free NativeLink: the quickest run took 4 minutes of compute time while the slowest/longest took 6 minutes. NativeLink Cloud provided the quickest execution times by far.
83+
84+
<img src="https://github.com/user-attachments/assets/c2bf8e0e-1500-4ee7-a0f7-6b4ca2196257" width="1000" alt="Model Fine-Tuning Times">
85+
86+
87+
## Conclusion: Optimizing AI Development Through Resource Efficiency
88+
89+
As demonstrated through this tutorial, the integration of Bazel's repository management, NativeLink's CPU-optimized remote execution, and Hugging Face's transformers library creates a development ecosystem that prioritizes computational efficiency over raw processing power. This approach addresses several critical challenges facing modern AI teams:
90+
91+
1\. **Resource Optimization**: By leveraging NativeLink's intelligent scheduling and optimization on CPU infrastructure, teams can achieve impressive fine-tuning results without the capital expenditure of specialized GPU/TPU hardware.<br>
92+
2\. **Strategic Advantage**: This CPU-focused approach provides a competitive edge through efficient resource utilization, enabling teams to allocate budget toward innovation rather than hardware acquisition.<br>
93+
3\. **Sustainable Scaling**: As models grow in size and complexity, the ability to efficiently distribute workloads across existing CPU infrastructure provides a more sustainable path to scale than continuously upgrading to the latest accelerators.<br>
94+
95+
For forward-thinking AI teams, this infrastructure stack represents a shift from the "bigger is better" hardware arms race toward thoughtful resource utilization. The competitive advantage increasingly belongs to those who can extract maximum value from available compute rather than those who deploy more powerful hardware.
96+
97+
The journey from experimental AI projects to production-grade systems demands both technical sophistication and resource awareness. By adopting this CPU-optimized approach with Bazel and NativeLink, your team can focus less on infrastructure limitations and more on the creative potential of fine-tuned models—developing applications that deliver genuine value while maintaining computational efficiency.

0 commit comments

Comments
 (0)