Skip to content

Conversation

uurien
Copy link

@uurien uurien commented Sep 17, 2025

When a garbage collector is executed, the callback of FinalizationRegistry it is not executed synchronously with the GC, it is executed later, in the next event loop.

That means that there is a corner case in the WeakRefMap object in diagnostics channel. Eventually could happen that an event is GC and created again before the execution of the callback of FinalizationRegistry. When this happens, the key object is deleted from the WeakRefMap even when it has a valid value.

This behavior can be reproduced with this code added in tests:

const assert = require('assert');
const { channel } = require('diagnostics_channel');

function test () {
  const testChannel = channel('test-gc');

  setTimeout(() => {
    const testChannel2 = channel('test-gc');

    assert.ok(testChannel === testChannel2, 'Channel instances must be the same');
  });
}

test();

setTimeout(() => {
  global.gc();
  test();
}, 10);

This code fails in main branch, but it works as expected in the current branch.

@nodejs-github-bot nodejs-github-bot added diagnostics_channel Issues and PRs related to diagnostics channel needs-ci PRs that need a full CI run. labels Sep 17, 2025
@IlyasShabi IlyasShabi added the request-ci Add this label to start a Jenkins CI on a PR. label Sep 17, 2025
@github-actions github-actions bot added request-ci-failed An error occurred while starting CI via request-ci label, and manual interventon is needed. and removed request-ci Add this label to start a Jenkins CI on a PR. labels Sep 17, 2025
Copy link
Contributor

Failed to start CI
   ⚠  No approving reviews found
   ✘  Refusing to run CI on potentially unsafe PR
https://github.com/nodejs/node/actions/runs/17791170306

@simon-id
Copy link
Contributor

simon-id commented Sep 17, 2025

Secondary point

diagnostics_channel has always been subject to GC uintuitiveness, and some of its API was thus deprecated because of it.
This issue and this PR, revoked the deprecation notice because @Qard apparently fixed the GC problems caused by WeakRefs.
But it looks like that it wasn't completely fixed since we ran into more problems, that the present PR is trying to fix.
I propose to also update the deprecation notice in the docs to be bumped to the version of node where this PR will land.

Copy link
Member

@Qard Qard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one non-blocking suggestion.

Copy link

codecov bot commented Sep 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.29%. Comparing base (f1a8f44) to head (4ddedfd).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main   #59910    +/-   ##
========================================
  Coverage   88.28%   88.29%            
========================================
  Files         702      702            
  Lines      206904   207030   +126     
  Branches    39808    39829    +21     
========================================
+ Hits       182665   182789   +124     
+ Misses      16256    16246    -10     
- Partials     7983     7995    +12     
Files with missing lines Coverage Δ
lib/diagnostics_channel.js 99.09% <100.00%> (+0.01%) ⬆️

... and 41 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@uurien uurien force-pushed the ugaitz/fix-diagnostics_channel-and-gc-race-condition branch from e84ab55 to a40e16b Compare September 17, 2025 10:19
@Renegade334
Copy link
Contributor

Nice spot.

This issue and this PR, revoked the deprecation notice because @Qard apparently fixed the GC problems caused by WeakRefs.
But it looks like that it wasn't completely fixed since we ran into more problems, that the present PR is trying to fix.
I propose to also update the deprecation notice in the docs to be bumped to the version of node where this PR will land.

The issue raised here relates to the behaviour of the channel API as a whole, not specifically to that part of the API that was subject to the deprecation notice. (Indeed, the deprecation notice's previous "safe" suggestion of using the top-level subscribe functions is equally affected.) I don't think that the rationale for DEP0163's revocation is affected by this PR.

Copy link
Member

@BridgeAR BridgeAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might misread some code, I just believe we might be able to use SafeWeakRef directly and not increment and decrement at all in here after the change. I believe that is effectively the same. That would simplify the code quite a bit.

@BridgeAR BridgeAR removed the request-ci-failed An error occurred while starting CI via request-ci label, and manual interventon is needed. label Sep 17, 2025
@Renegade334
Copy link
Contributor

I might misread some code, I just believe we might be able to use SafeWeakRef directly and not increment and decrement at all in here after the change. I believe that is effectively the same. That would simplify the code quite a bit.

The channel object may never strongly exist in user space, hence the rationale for WeakReference to ensure that the channel object is strongly held if the subscriber count is nonzero. Just using WeakRefs would expose the channel object to being immediately GC'd if it weren't referenced by the user, no?

@BridgeAR
Copy link
Member

@Renegade334 I mean the following:

diff --git a/lib/diagnostics_channel.js b/lib/diagnostics_channel.js
index 312bd258f58..6d6e392446d 100644
--- a/lib/diagnostics_channel.js
+++ b/lib/diagnostics_channel.js
@@ -17,6 +17,7 @@ const {
   ReflectApply,
   SafeFinalizationRegistry,
   SafeMap,
+  SafeWeakRef,
   SymbolHasInstance,
 } = primordials;
 
@@ -31,30 +32,25 @@ const {
 
 const { triggerUncaughtException } = internalBinding('errors');
 
-const { WeakReference } = require('internal/util');
-
-// Can't delete when weakref count reaches 0 as it could increment again.
 // Only GC can be used as a valid time to clean up the channels map.
 class WeakRefMap extends SafeMap {
   #finalizers = new SafeFinalizationRegistry((key) => {
-    this.delete(key);
+    if (!this.has(key)) {
+      this.delete(key);
+    }
   });
 
   set(key, value) {
     this.#finalizers.register(value, key);
-    return super.set(key, new WeakReference(value));
+    return super.set(key, new SafeWeakRef(value));
   }
 
   get(key) {
-    return super.get(key)?.get();
-  }
-
-  incRef(key) {
-    return super.get(key)?.incRef();
+    return super.get(key)?.deref();
   }
 
-  decRef(key) {
-    return super.get(key)?.decRef();
+  has(key) {
+    return !!this.get(key);
   }
 }
 
@@ -101,7 +97,6 @@ class ActiveChannel {
     validateFunction(subscription, 'subscription');
     this._subscribers = ArrayPrototypeSlice(this._subscribers);
     ArrayPrototypePush(this._subscribers, subscription);
-    channels.incRef(this.name);
   }
 
   unsubscribe(subscription) {
@@ -113,15 +108,12 @@ class ActiveChannel {
     this._subscribers = before;
     ArrayPrototypePushApply(this._subscribers, after);
 
-    channels.decRef(this.name);
     maybeMarkInactive(this);
 
     return true;
   }
 
   bindStore(store, transform) {
-    const replacing = this._stores.has(store);
-    if (!replacing) channels.incRef(this.name);
     this._stores.set(store, transform);
   }
 
@@ -132,7 +124,6 @@ class ActiveChannel {
 
     this._stores.delete(store);
 
-    channels.decRef(this.name);
     maybeMarkInactive(this);
 
     return true;
+++ b/test/fixtures/source-map/output/source_map_assert_source_line.snapshot
@@ -7,7 +7,7 @@ AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:
     *
     *
     *
-    at TracingChannel.traceSync (node:diagnostics_channel:322:14)
+    at TracingChannel.traceSync (node:diagnostics_channel:313:14)
     *
     *
     *

This passes all tests and I believe this is correct, since we never actually use the increment / decrement counter in this code.

I didn't check: do we have a similar issue with domain? That is also using the increment / decrement counter at the moment.

@Qard
Copy link
Member

Qard commented Sep 17, 2025

The incRef was needed because we want the channel object itself to be possible to garbage collect if there are no references, however we also need the subscribe to hold it alive for as long as a subscriber is present.

The reason for this is that one can do channel(name).subscribe(subscriber) which would immediately lose the channel reference because it's only pushing a subscriber into it without storing it somewhere that it could be referenced in the future.

@uurien
Copy link
Author

uurien commented Sep 17, 2025

If I understand your message @Qard, the reason of having a WeakReference is to be safe of this type of things:

dc.channel('mychannel').subscribe(() => { /* my function */ }) // no references to the channel

// => later. after some Garbage collections
dc.channel('mychannel').hasSubscriptions // without the ref count, this can be false and the subscription can be lost

@Qard
Copy link
Member

Qard commented Sep 17, 2025

Yes, without the incRef the second channel(...) call would construct a new channel object and the subscriber would be lost.

@BridgeAR
Copy link
Member

Thanks, that makes sense! Should we add another test that covers this case as well? That way it would be more obvious. The example by @uurien would just need a gc call in-between.

Copy link
Member

@BridgeAR BridgeAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

It would be nice to get the other test we spoke about and I am fine either way about adding the has method or not.

@uurien
Copy link
Author

uurien commented Sep 17, 2025

Should we add another test that covers this case as well? That way it would be more obvious. The example by @uurien would just need a gc call in-between.

I'm going to add another test to this use case

I am fine either way about adding the has method or not

I'm also changing it to has

@uurien uurien force-pushed the ugaitz/fix-diagnostics_channel-and-gc-race-condition branch from a40e16b to 4ddedfd Compare September 17, 2025 14:48
@uurien
Copy link
Author

uurien commented Sep 17, 2025

I've addressed the changes, it'd be nice if some maintainer could run the CI. Thanks!

@BridgeAR BridgeAR added request-ci Add this label to start a Jenkins CI on a PR. author ready PRs that have at least one approval, no pending requests for changes, and a CI started. labels Sep 17, 2025
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Sep 17, 2025
@nodejs-github-bot
Copy link
Collaborator

@IlyasShabi IlyasShabi added the commit-queue Add this label to land a pull request using GitHub Actions. label Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. commit-queue Add this label to land a pull request using GitHub Actions. diagnostics_channel Issues and PRs related to diagnostics channel needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants