Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
e848c02
Add make_dynamic_open_dataflow_graph_from_pcg.
elliottslaughter Feb 4, 2026
40c5609
Empty skeleton of the realm-execution backend.
elliottslaughter Feb 4, 2026
14c1b94
More Realm execution skeleton.
elliottslaughter Feb 4, 2026
a9a365d
Stub creation.
elliottslaughter Feb 4, 2026
788300c
More passes.
elliottslaughter Feb 4, 2026
7628271
Add Realm manager and test it.
elliottslaughter Feb 5, 2026
d2b3f01
Do not expose raw runtime and properly wait in test.
elliottslaughter Feb 5, 2026
37f1d20
Sketch more Realm manager APIs.
elliottslaughter Feb 5, 2026
1fe90c1
Add controller functionality.
elliottslaughter Feb 5, 2026
150d9f4
Fix Realm tests.
elliottslaughter Feb 5, 2026
9fcc76e
Support passing closure arguments to controllers.
elliottslaughter Feb 5, 2026
98c6053
Move task IDs into Realm and assign IDs to remaining tasks.
elliottslaughter Feb 5, 2026
f8ab575
Avoid pulling in the entire invocation.
elliottslaughter Feb 5, 2026
c9c2b18
Conversion into Realm task IDs.
elliottslaughter Feb 5, 2026
2a0bdd9
Add a top-level PRealm switch.
elliottslaughter Feb 5, 2026
05eeada
Some work on Realm task registry.
elliottslaughter Feb 6, 2026
621814b
Split out the Realm context.
elliottslaughter Feb 6, 2026
362b6c0
Switch to mapped PCG.
elliottslaughter Feb 6, 2026
b39058c
Add shard expansion pass (and implement shard expansion pass).
elliottslaughter Feb 6, 2026
32c4f61
Add instance field to dynamic graph, more task IDs.
elliottslaughter Feb 6, 2026
5bd089e
Fix filename.
elliottslaughter Feb 6, 2026
6bd47e0
Some work in instance allocation and registry/manager.
elliottslaughter Feb 6, 2026
8e4cd09
Instance allocation.
elliottslaughter Feb 6, 2026
ad53671
Simplify dims and use constructors.
elliottslaughter Feb 6, 2026
876ccc0
Refactor.
elliottslaughter Feb 6, 2026
2401234
Sketch out device mapping.
elliottslaughter Feb 6, 2026
d718012
Move instance backing to a separate map, remove realm from task-spec.
elliottslaughter Feb 6, 2026
1da6450
Implement processor queries.
elliottslaughter Feb 7, 2026
a507ce1
Enable PRealm.
elliottslaughter Feb 7, 2026
7b60556
Move tasks to dedicated file, stub out device state init, shuffle dir…
elliottslaughter Feb 10, 2026
950e6e8
Make use of task args struct.
elliottslaughter Feb 10, 2026
901f0cb
Use task args struct.
elliottslaughter Feb 10, 2026
0535c34
Refactor task APIs.
elliottslaughter Feb 10, 2026
1d65648
Finish implementation of device init task.
elliottslaughter Feb 10, 2026
95df073
Finish implementation of device state initialization.
elliottslaughter Feb 10, 2026
9a41fb4
Block on initialization.
elliottslaughter Feb 10, 2026
de338ae
Wire up rest of Realm implementation.
elliottslaughter Feb 11, 2026
46b7053
Implement Realm device idx.
elliottslaughter Feb 11, 2026
4563454
Updates to compile against latest local-execution.
elliottslaughter Feb 12, 2026
6daf370
Fix up function arguments.
elliottslaughter Feb 12, 2026
f7e58bd
Rename PCGInstance and add dependency set.
elliottslaughter Feb 12, 2026
bb5a54a
Dependency tracking.
elliottslaughter Feb 12, 2026
8588e36
Add event argument to controller.
elliottslaughter Feb 12, 2026
eacdc8c
Implement the allocator.
elliottslaughter Feb 12, 2026
6828cfa
Implement device handle.
elliottslaughter Feb 12, 2026
03cda52
Distributed device handle initialization.
elliottslaughter Feb 12, 2026
a10b35a
Distributed device handle initialization.
elliottslaughter Feb 13, 2026
2fc992c
Test distributed device handle.
elliottslaughter Feb 13, 2026
939c49a
Guard the kinds of procs we run on.
elliottslaughter Feb 13, 2026
d21558a
Switch to own DeviceSpecific implementation with raw pointers.
elliottslaughter Feb 13, 2026
1beaa05
Separate device handle test.
elliottslaughter Feb 13, 2026
68ce681
More work on Realm tests.
elliottslaughter Feb 13, 2026
2476d92
JSON serialization of a bunch of data types.
elliottslaughter Feb 14, 2026
9c6de3c
Make more stuff serializable.
elliottslaughter Feb 14, 2026
1d1586f
To-do notes.
elliottslaughter Feb 14, 2026
8e9cefc
More serialization routines.
elliottslaughter Feb 14, 2026
365dca0
Most of serializer finished.
elliottslaughter Feb 14, 2026
2c19493
Finish serialization of device init task.
elliottslaughter Feb 14, 2026
d05b73e
Switch over to explicit DTGs for task arguments and serialization.
elliottslaughter Feb 14, 2026
6a380ce
Convert op task args.
elliottslaughter Feb 14, 2026
a46dd46
Map the PCG for test.
elliottslaughter Feb 15, 2026
056312f
Fix a bug in shard expansion.
elliottslaughter Feb 15, 2026
c44035f
Finish body of instance allocation.
elliottslaughter Feb 15, 2026
b9417d0
Fix some bugs in loss insertion, instance allocation.
elliottslaughter Feb 17, 2026
aec4a19
Fixes for PCG initialization.
elliottslaughter Feb 17, 2026
6adf137
Fix a bug in device state handling.
elliottslaughter Feb 17, 2026
27660ec
Implement most of tensor backing in task.
elliottslaughter Feb 17, 2026
afad03b
Refactor and finish tensor instance backing.
elliottslaughter Feb 17, 2026
9da0b94
Don't execute tasks on input or weight nodes.
elliottslaughter Feb 17, 2026
ee32e03
Refactor device specific managed handle.
elliottslaughter Feb 18, 2026
f7bb5ec
Refactor per-device op state backing.
elliottslaughter Feb 18, 2026
0fc66ba
Register loss task.
elliottslaughter Feb 18, 2026
5ba6a61
Test loss in Realm.
elliottslaughter Feb 18, 2026
8b13e27
Test CPU model parallelism.
elliottslaughter Feb 18, 2026
a05fa06
Use Realm's own allocator in test.
elliottslaughter Feb 18, 2026
a59ba1e
Fix typo.
elliottslaughter Feb 18, 2026
ea76b0f
Add Realm top-level README.
elliottslaughter Feb 19, 2026
8be3fdc
Add and fix GPU test (no loss so far).
elliottslaughter Feb 20, 2026
bd0227c
Add a GPU distributed handle test.
elliottslaughter Feb 20, 2026
9b726fb
Test GPU loss values.
elliottslaughter Feb 20, 2026
0b75bbd
Update Realm to include build fixes.
elliottslaughter Feb 20, 2026
91904e8
Ensure that Realm tests do not leak instances.
elliottslaughter Feb 23, 2026
2f5decb
Update Realm allocator to follow pattern of other allocators.
elliottslaughter Feb 24, 2026
1db1448
Remove explicit deallocation which is not required by updated allocator.
elliottslaughter Feb 24, 2026
7a66f5a
Support for PRealm.
elliottslaughter Feb 23, 2026
1dff7af
Update to Realm main commit for PRealm.
elliottslaughter Feb 24, 2026
2abfb8d
Add a switch to control PRealm.
elliottslaughter Feb 25, 2026
9dd1f12
Update rect constructor.
elliottslaughter Feb 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 0 additions & 48 deletions .flake/pkgs/legion.nix

This file was deleted.

46 changes: 46 additions & 0 deletions .flake/pkgs/realm.nix
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{ lib
, stdenv
, fetchFromGitHub
, cmake
, cudaPackages ? { }
, zlib
, maxDim ? 5
}:

let
inherit (cudaPackages) cudatoolkit;
in

stdenv.mkDerivation rec {
pname = "realm";
version = "2026-02-24";

src = fetchFromGitHub {
owner = "StanfordLegion";
repo = "realm";
rev = "42f7484a80e0bdacaf47d9a758822f5327348dd0";
sha256 = "sha256-IHiokPmTjEV5df3fr1Xubuyt2N1CFI2fA7Q2TsbxS3Y=";
};

nativeBuildInputs = [
cmake
];

cmakeFlags = [
"-DBUILD_SHARED_LIBS=ON"
"-DREALM_ENABLE_CUDA=ON"
"-DREALM_ENABLE_PREALM=ON"
"-DREALM_MAX_DIM=${toString maxDim}"
];

buildInputs = [
cudatoolkit
zlib
];

meta = with lib; {
description = "Realm is a distributed, event–based tasking runtime for building high-performance applications that span clusters of CPUs, GPUs, and other accelerators";
homepage = "https://legion.stanford.edu/realm";
license = licenses.asl20;
};
}
7 changes: 7 additions & 0 deletions .proj.toml
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,13 @@ has-cpu-only-benchmarks = false
has-cuda-tests = true
has-cuda-benchmarks = false

[targets.realm-execution]
type = "lib"
has-cpu-only-tests = true
has-cpu-only-benchmarks = false
has-cuda-tests = true
has-cuda-benchmarks = false

# [targets.local-pcg-execution]
# type = "lib"
# has-cpu-only-tests = true
Expand Down
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ set(FF_MAX_NUM_TASK_REGIONS "20" CACHE STRING
set(FF_MAX_NUM_TASK_ARGUMENTS "5" CACHE STRING
"Maximum number of arguments that can be declared in a TaskSignature")
option(FF_USE_NCCL "Run FlexFlow with NCCL" OFF)
option(FF_USE_PREALM "Build with PRealm profiling interface" ON)
option(FF_USE_ALL_PREBUILT_LIBRARIES "Enable use of all pre-compiled libraries, if available" OFF)
option(FF_USE_PYTHON "Enable Python" ON)
option(FF_BUILD_FROM_PYPI "Build from pypi" OFF)
Expand Down
1 change: 1 addition & 0 deletions cmake/flexflow-utils.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ function(define_ff_vars target)
MAX_NUM_FUSED_TENSORS=${FF_MAX_NUM_FUSED_TENSORS}
MAX_NUM_WORKERS=${FF_MAX_NUM_WORKERS}
FF_USE_NCCL=${FF_USE_NCCL}
FF_USE_PREALM=${FF_USE_PREALM}
MAX_TENSOR_DIM=${FF_MAX_DIM}
MAX_NUM_TASK_REGIONS=${FF_MAX_NUM_TASK_REGIONS}
MAX_NUM_TASK_ARGUMENTS=${FF_MAX_NUM_TASK_ARGUMENTS}
Expand Down
21 changes: 10 additions & 11 deletions flake.nix
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@
};
};

outputs = { self, nixpkgs, flake-utils, proj-repo, nixGL, ... }: flake-utils.lib.eachSystem [ "x86_64-linux" ] (system:
let
outputs = { self, nixpkgs, flake-utils, proj-repo, nixGL, ... }: flake-utils.lib.eachSystem [ "x86_64-linux" ] (system:
let
pkgs = import nixpkgs {
inherit system;
config.allowUnfree = true;
Expand All @@ -41,21 +41,21 @@
mkShell = attrs: pkgs.mkShell.override {
stdenv = pkgs.cudaPackages.backendStdenv;
} (attrs // {
hardeningDisable = ["all"]; # disable nixpkgs default compiler arguments, otherwise ubsan doesn't catch
# signed overflows due to the signedoverflow hardening setting.
# for more details, see the following (long-running) nixpkgs github issues:
hardeningDisable = ["all"]; # disable nixpkgs default compiler arguments, otherwise ubsan doesn't catch
# signed overflows due to the signedoverflow hardening setting.
# for more details, see the following (long-running) nixpkgs github issues:
# - https://github.com/NixOS/nixpkgs/issues/18995
# - https://github.com/NixOS/nixpkgs/issues/60919
});

proj = proj-repo.packages.${system}.proj;
in
in
{
packages = rec {
libdwarf-lite = pkgs.callPackage ./.flake/pkgs/libdwarf-lite.nix { };
cpptrace = pkgs.callPackage ./.flake/pkgs/cpptrace.nix { inherit libdwarf-lite; };
libassert = pkgs.callPackage ./.flake/pkgs/libassert.nix { inherit cpptrace; };
legion = pkgs.callPackage ./.flake/pkgs/legion.nix { };
realm = pkgs.callPackage ./.flake/pkgs/realm.nix { };
bencher-cli = pkgs.callPackage ./.flake/pkgs/bencher-cli.nix { };
ffdb = pkgs.callPackage ./.flake/pkgs/ffdb { inherit proj; };
hpp2plantuml = pkgs.python3Packages.callPackage ./.flake/pkgs/hpp2plantuml.nix { };
Expand Down Expand Up @@ -83,8 +83,7 @@
shellHook = ''
export PATH="$HOME/ff/.scripts/:$PATH"
export RC_PARAMS="max_discard_ratio=100"
export CMAKE_FLAGS="-DFF_USE_EXTERNAL_LEGION=ON \
-DFF_USE_EXTERNAL_NCCL=ON \
export CMAKE_FLAGS="-DFF_USE_EXTERNAL_NCCL=ON \
-DFF_USE_EXTERNAL_JSON=ON \
-DFF_USE_EXTERNAL_FMT=ON \
-DFF_USE_EXTERNAL_SPDLOG=ON \
Expand All @@ -94,7 +93,7 @@
-DFF_USE_EXTERNAL_GBENCHMARK=ON \
-DFF_USE_EXTERNAL_LIBASSERT=ON"
'';

buildInputs = builtins.concatLists [
(with pkgs; [
zlib
Expand Down Expand Up @@ -125,7 +124,7 @@
])
(with self.packages.${system}; [
libassert
legion
realm
rapidcheckFull
doctest
])
Expand Down
1 change: 1 addition & 0 deletions lib/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ add_subdirectory(op-attrs)
add_subdirectory(kernels)
add_subdirectory(local-execution)
add_subdirectory(local-pcg-execution)
add_subdirectory(realm-execution)
add_subdirectory(task-spec)
add_subdirectory(utils)
add_subdirectory(ffi)
Expand Down
3 changes: 3 additions & 0 deletions lib/kernels/include/kernels/device_handle_t.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ namespace FlexFlow {
device_handle_t device_handle_t_from_managed_handle(
std::optional<ManagedPerDeviceFFHandle> const &managed_handle);

device_handle_t device_handle_t_from_managed_handle_ptr(
std::optional<ManagedPerDeviceFFHandle *> const &managed_handle);

device_handle_t gpu_make_device_handle_t(PerDeviceFFHandle const &ff_handle);
device_handle_t cpu_make_device_handle_t();

Expand Down
9 changes: 9 additions & 0 deletions lib/kernels/src/kernels/device_handle_t.cc
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,15 @@ device_handle_t device_handle_t_from_managed_handle(
}
}

device_handle_t device_handle_t_from_managed_handle_ptr(
std::optional<ManagedPerDeviceFFHandle *> const &managed_handle) {
if (managed_handle.has_value()) {
return gpu_make_device_handle_t(managed_handle.value()->raw_handle());
} else {
return cpu_make_device_handle_t();
}
}

device_handle_t gpu_make_device_handle_t(PerDeviceFFHandle const &ff_handle) {
return device_handle_t{
ff_handle,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,8 @@ ComputationGraphInstance create_computation_graph_instance(
auto [loss_inserted_dg, label_v, logit_grad_v] = perform_loss_insertion(
dg,
assert_unwrap(loss_attrs),
dynamic_tensor_guid_t{assert_unwrap(logit_tensor)});
dynamic_tensor_guid_t{assert_unwrap(logit_tensor)},
std::nullopt);
dg = loss_inserted_dg;
logit_grad_value = logit_grad_v;
inputs.insert(std::pair{label_v, assert_unwrap(label_tensor)});
Expand Down
4 changes: 2 additions & 2 deletions lib/local-execution/test/src/local-execution/test_e2e.cc
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@

using namespace ::FlexFlow;

bool did_loss_decrease(GenericTensorAccessorR const &first_epoch,
GenericTensorAccessorR const &last_epoch) {
static bool did_loss_decrease(GenericTensorAccessorR const &first_epoch,
GenericTensorAccessorR const &last_epoch) {
Allocator cpu_allocator = create_local_cpu_memory_allocator();

return tensor_accessor_all(
Expand Down
1 change: 1 addition & 0 deletions lib/pcg/include/pcg/layer_guid_t.dtg.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ features = [
"ord",
"hash",
"fmt",
"json",
]

includes = [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#include "pcg/machine_space_coordinate.dtg.h"
#include "pcg/mapped_parallel_computation_graph/operator_atomic_task_shard_binding.dtg.h"
#include "utils/bidict/bidict.h"
#include <nlohmann/json.hpp>

namespace FlexFlow {

Expand Down Expand Up @@ -45,4 +46,15 @@ struct hash<::FlexFlow::MappedOperatorTaskGroup> {
};

} // namespace std

namespace nlohmann {

template <>
struct adl_serializer<::FlexFlow::MappedOperatorTaskGroup> {
static ::FlexFlow::MappedOperatorTaskGroup from_json(json const &j);
static void to_json(json &j, ::FlexFlow::MappedOperatorTaskGroup const &t);
};

} // namespace nlohmann

#endif
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ ParallelLayerAddedResult add_parallel_layer(
ParallelLayerAddedResult pcg_add_input_layer(ParallelComputationGraph &pcg,
TensorShape const &tensor_shape);

ParallelLayerAddedResult
pcg_add_input_layer_with_grad(ParallelComputationGraph &pcg,
TensorShape const &tensor_shape);

OperatorTaskSpace get_operator_task_space(ParallelComputationGraph const &pcg,
parallel_layer_guid_t const &layer);

Expand All @@ -54,6 +58,9 @@ std::unordered_map<TensorSlotName, ParallelComputationGraphEdge>
std::unordered_set<parallel_layer_guid_t>
get_initial_layers(ParallelComputationGraph const &);

std::unordered_map<TensorSlotName, parallel_tensor_guid_t>
get_outgoing_tensors(ParallelComputationGraph const &,
parallel_layer_guid_t const &);
std::unordered_map<TensorSlotName, parallel_tensor_guid_t>
get_incoming_tensors(ParallelComputationGraph const &,
parallel_layer_guid_t const &);
Expand Down Expand Up @@ -107,6 +114,9 @@ ParallelTensorShape get_parallel_tensor_shape(ParallelComputationGraph const &,
std::vector<parallel_layer_guid_t>
topological_ordering(ParallelComputationGraph const &);

std::unordered_map<parallel_layer_guid_t, ParallelLayerAttrs>
get_parallel_layer_attrs_mapping(ParallelComputationGraph const &pcg);

parallel_layer_guid_t
get_parallel_layer_by_name(ParallelComputationGraph const &pcg,
std::string const &name);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ features = [
"ord",
"hash",
"fmt",
"json",
]

includes = [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ features = [
"ord",
"hash",
"fmt",
"json",
]

includes = [
Expand Down
1 change: 1 addition & 0 deletions lib/pcg/include/pcg/tensor_guid_t.dtg.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ features = [
"ord",
"hash",
"fmt",
"json",
]

includes = [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,20 @@ size_t hash<::FlexFlow::MappedOperatorTaskGroup>::operator()(
}

} // namespace std

namespace nlohmann {

::FlexFlow::MappedOperatorTaskGroup
adl_serializer<::FlexFlow::MappedOperatorTaskGroup>::from_json(
json const &j) {
return ::FlexFlow::MappedOperatorTaskGroup{j.template get<
::FlexFlow::bidict<::FlexFlow::MachineSpaceCoordinate,
::FlexFlow::OperatorAtomicTaskShardBinding>>()};
}

void adl_serializer<::FlexFlow::MappedOperatorTaskGroup>::to_json(
json &j, ::FlexFlow::MappedOperatorTaskGroup const &t) {
j = t.get_shard_bindings();
}

} // namespace nlohmann
Loading
Loading