Skip to content

Conversation

eswar2001
Copy link

Previously, caches that failed during initialization were silently dropped from the substituters list, causing unnecessary rebuilds from source. This change preserves temporarily failed caches and implements smart error classification to distinguish recoverable errors (network timeouts, DNS failures) from permanent failures.

Key improvements:

  • Add isRecoverableStoreError() for intelligent error classification
  • Preserve caches with temporary failures for retry during substitution
  • Enhanced user feedback with clear warning messages
  • Maintain cache parallelism for working substituters

Fixes scenarios where network issues would disable entire cache system instead of falling back to remaining working caches.

Motivation

This change addresses a critical reliability issue in Nix's cache system that significantly impacts user experience and system efficiency.

Currently, when individual binary caches fail during store initialization due to temporary network issues (DNS failures, connection timeouts, network unreachability), Nix silently removes these caches from the substituters list entirely. This causes several problems:

  1. Unnecessary rebuilds from source: Packages that could be downloaded from working caches are instead rebuilt locally, taking hours instead of minutes for large packages like Firefox
  2. Poor user experience: Users receive no clear indication why their configured caches aren't being used
  3. Reliability issues in unstable networks: Mobile users, corporate environments, and areas with poor connectivity frequently experience this problem
  4. Wasted resources: Excessive CPU usage, bandwidth consumption, and build times when binary substitution should work

This particularly affects users with multiple cache configurations (local caches + remote caches) where one cache being temporarily unavailable shouldn't disable the entire substitution system. The current behavior defeats the purpose of having multiple caches for redundancy.

Context

This change fixes a well-documented issue tracked in #6901 ("Nix fails if my local cache (substituer) is offline. Even when everything is available on the next one: cache.nixos.org"), which has few reactions showing widespread community impact. Related issues include #7127, #3514, #4383, and #2661, all describing similar cache fallback problems.

Implementation Strategy

The solution implements intelligent cache preservation through two key changes:

  1. Smart Error Classification: A new isRecoverableStoreError() function distinguishes between:

    • Recoverable errors (temporary): Network timeouts, DNS failures, connection issues, HTTP 5xx errors, specific curl error codes (6, 7, 28, 56)
    • Non-recoverable errors (permanent): Authentication failures, certificate errors, malformed URLs, permission issues
  2. Cache Preservation Logic: Modified getDefaultSubstituters() in src/libstore/store-registration.cc to preserve caches that fail with recoverable errors during initialization, allowing retry during actual substitution phase.

Alternative Approaches Considered

  1. Exponential backoff during initialization: Would slow down all operations unnecessarily
  2. Background cache health monitoring: Adds complexity and resource overhead
  3. User-configurable "optional" vs "required" caches: Requires configuration changes and doesn't solve the problem by default
  4. Lazy cache initialization: Major architectural change with performance implications

The chosen approach is optimal because it requires no configuration changes, has minimal performance overhead, and provides immediate improvement for existing setups.

Instructions for Reviewers

The diff focuses on:

  1. Core Logic (src/libstore/store-registration.cc):
    • New isRecoverableStoreError() function (lines 45-80)
    • Modified getDefaultSubstituters() cache handling (lines 120-150)
    • Enhanced error messaging for user transparency

The change maintains full backward compatibility - all existing cache configurations continue to work unchanged, but now with improved reliability during temporary failures.


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

Previously, caches that failed during initialization were silently
dropped from the substituters list, causing unnecessary rebuilds from
source. This change preserves temporarily failed caches and implements
smart error classification to distinguish recoverable errors (network
timeouts, DNS failures) from permanent failures.

Key improvements:
- Add isRecoverableStoreError() for intelligent error classification
- Preserve caches with temporary failures for retry during substitution
- Enhanced user feedback with clear warning messages
- Maintain cache parallelism for working substituters

Fixes scenarios where network issues would disable entire cache system
instead of falling back to remaining working caches.
@github-actions github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Sep 12, 2025
std::transform(lower_msg.begin(), lower_msg.end(), lower_msg.begin(), ::tolower);

// Network timeout errors - be more specific to avoid false positives
if (lower_msg.find("connection timeout") != std::string::npos ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is curl not also providing some error codes? This might be a more robust way of checking the root cause of certain errors.

@@ -0,0 +1,161 @@
#include <iostream>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you review the code yourself as well? The whole file seems to be unused.

@@ -0,0 +1,100 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is not properly integrated int the build system and also wouldn't work in the nix sandbox because it's tries to reach the network. We might something like a NixOS test for this.

Copy link
Member

@edolstra edolstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scanning the text of exceptions for certain keywords does not strike me as a good approach since it's very ad hoc.

We already have logic for retrying transient errors (which doesn't require parsing error messages), e.g.

// We treat most errors as transient, but won't retry when hopeless
Error err = Transient;
if (httpStatus == 404 || httpStatus == 410 || code == CURLE_FILE_COULDNT_READ_FILE) {
// The file is definitely not there
err = NotFound;
} else if (httpStatus == 401 || httpStatus == 403 || httpStatus == 407) {
// Don't retry on authentication/authorization failures
err = Forbidden;
} else if (httpStatus == 429) {
// 429 means too many requests, so we retry (with a substantially longer delay)
retryTimeMs = RETRY_TIME_MS_TOO_MANY_REQUESTS;
} else if (httpStatus >= 400 && httpStatus < 500 && httpStatus != 408) {
// Most 4xx errors are client errors and are probably not worth retrying:
// * 408 means the server timed out waiting for us, so we try again
err = Misc;
} else if (httpStatus == 501 || httpStatus == 505 || httpStatus == 511) {
// Let's treat most 5xx (server) errors as transient, except for a handful:
// * 501 not implemented
// * 505 http version not supported
// * 511 we're behind a captive portal
err = Misc;
} else {

Can you be more specific about what sort of transient errors you've encountered that are currently not being retried?

@edolstra
Copy link
Member

Note that most of the cited issues (e.g. #6901) are about a different problem: namely, wanting Nix to continue if a binary cache is offline. Presumably, for those users, retrying would make things worse.

@eswar2001
Copy link
Author

Note that most of the cited issues (e.g. #6901) are about a different problem: namely, wanting Nix to continue if a binary cache is offline. Presumably, for those users, retrying would make things worse.

if we have multiple binary caches configured and one of them fail the we are not trying other caches ,we are falling back to build from source with default binary cache (when we pass --fallback)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
with-tests Issues related to testing. PRs with tests have some priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants