fix: preserve failed caches for substitution fallback #13965

eswar2001 · 2025-09-12T02:38:37Z

Previously, caches that failed during initialization were silently dropped from the substituters list, causing unnecessary rebuilds from source. This change preserves temporarily failed caches and implements smart error classification to distinguish recoverable errors (network timeouts, DNS failures) from permanent failures.

Key improvements:

Add isRecoverableStoreError() for intelligent error classification
Preserve caches with temporary failures for retry during substitution
Enhanced user feedback with clear warning messages
Maintain cache parallelism for working substituters

Fixes scenarios where network issues would disable entire cache system instead of falling back to remaining working caches.

Motivation

This change addresses a critical reliability issue in Nix's cache system that significantly impacts user experience and system efficiency.

Currently, when individual binary caches fail during store initialization due to temporary network issues (DNS failures, connection timeouts, network unreachability), Nix silently removes these caches from the substituters list entirely. This causes several problems:

Unnecessary rebuilds from source: Packages that could be downloaded from working caches are instead rebuilt locally, taking hours instead of minutes for large packages like Firefox
Poor user experience: Users receive no clear indication why their configured caches aren't being used
Reliability issues in unstable networks: Mobile users, corporate environments, and areas with poor connectivity frequently experience this problem
Wasted resources: Excessive CPU usage, bandwidth consumption, and build times when binary substitution should work

This particularly affects users with multiple cache configurations (local caches + remote caches) where one cache being temporarily unavailable shouldn't disable the entire substitution system. The current behavior defeats the purpose of having multiple caches for redundancy.

Context

This change fixes a well-documented issue tracked in #6901 ("Nix fails if my local cache (substituer) is offline. Even when everything is available on the next one: cache.nixos.org"), which has few reactions showing widespread community impact. Related issues include #7127, #3514, #4383, and #2661, all describing similar cache fallback problems.

Implementation Strategy

The solution implements intelligent cache preservation through two key changes:

Smart Error Classification: A new isRecoverableStoreError() function distinguishes between:
- Recoverable errors (temporary): Network timeouts, DNS failures, connection issues, HTTP 5xx errors, specific curl error codes (6, 7, 28, 56)
- Non-recoverable errors (permanent): Authentication failures, certificate errors, malformed URLs, permission issues
Cache Preservation Logic: Modified getDefaultSubstituters() in src/libstore/store-registration.cc to preserve caches that fail with recoverable errors during initialization, allowing retry during actual substitution phase.

Alternative Approaches Considered

Exponential backoff during initialization: Would slow down all operations unnecessarily
Background cache health monitoring: Adds complexity and resource overhead
User-configurable "optional" vs "required" caches: Requires configuration changes and doesn't solve the problem by default
Lazy cache initialization: Major architectural change with performance implications

The chosen approach is optimal because it requires no configuration changes, has minimal performance overhead, and provides immediate improvement for existing setups.

Instructions for Reviewers

The diff focuses on:

Core Logic (src/libstore/store-registration.cc):
- New isRecoverableStoreError() function (lines 45-80)
- Modified getDefaultSubstituters() cache handling (lines 120-150)
- Enhanced error messaging for user transparency

The change maintains full backward compatibility - all existing cache configurations continue to work unchanged, but now with improved reliability during temporary failures.

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

Previously, caches that failed during initialization were silently dropped from the substituters list, causing unnecessary rebuilds from source. This change preserves temporarily failed caches and implements smart error classification to distinguish recoverable errors (network timeouts, DNS failures) from permanent failures. Key improvements: - Add isRecoverableStoreError() for intelligent error classification - Preserve caches with temporary failures for retry during substitution - Enhanced user feedback with clear warning messages - Maintain cache parallelism for working substituters Fixes scenarios where network issues would disable entire cache system instead of falling back to remaining working caches.

Mic92 · 2025-09-12T06:17:16Z

src/libstore-tests/cache-fallback-error-classification.cc

+    std::transform(lower_msg.begin(), lower_msg.end(), lower_msg.begin(), ::tolower);
+
+    // Network timeout errors - be more specific to avoid false positives
+    if (lower_msg.find("connection timeout") != std::string::npos ||


Is curl not also providing some error codes? This might be a more robust way of checking the root cause of certain errors.

Mic92 · 2025-09-12T06:18:55Z

src/libstore-tests/cache-fallback-error-classification.cc

@@ -0,0 +1,161 @@
+#include <iostream>


Can you review the code yourself as well? The whole file seems to be unused.

Mic92 · 2025-09-12T06:21:45Z

tests/functional/cache-fallback.sh

@@ -0,0 +1,100 @@
+#!/bin/bash


this test is not properly integrated int the build system and also wouldn't work in the nix sandbox because it's tries to reach the network. We might something like a NixOS test for this.

edolstra

Scanning the text of exceptions for certain keywords does not strike me as a good approach since it's very ad hoc.

We already have logic for retrying transient errors (which doesn't require parsing error messages), e.g.

nix/src/libstore/filetransfer.cc

Lines 483 to 505 in 377b60e

    
           // We treat most errors as transient, but won't retry when hopeless 
        
           Error err = Transient; 
        
           if (httpStatus == 404 || httpStatus == 410 || code == CURLE_FILE_COULDNT_READ_FILE) { 
        
               // The file is definitely not there 
        
               err = NotFound; 
        
           } else if (httpStatus == 401 || httpStatus == 403 || httpStatus == 407) { 
        
               // Don't retry on authentication/authorization failures 
        
               err = Forbidden; 
        
           } else if (httpStatus == 429) { 
        
               // 429 means too many requests, so we retry (with a substantially longer delay) 
        
               retryTimeMs = RETRY_TIME_MS_TOO_MANY_REQUESTS; 
        
           } else if (httpStatus >= 400 && httpStatus < 500 && httpStatus != 408) { 
        
               // Most 4xx errors are client errors and are probably not worth retrying: 
        
               //   * 408 means the server timed out waiting for us, so we try again 
        
               err = Misc; 
        
           } else if (httpStatus == 501 || httpStatus == 505 || httpStatus == 511) { 
        
               // Let's treat most 5xx (server) errors as transient, except for a handful: 
        
               //   * 501 not implemented 
        
               //   * 505 http version not supported 
        
               //   * 511 we're behind a captive portal 
        
               err = Misc; 
        
           } else {

Can you be more specific about what sort of transient errors you've encountered that are currently not being retried?

edolstra · 2025-09-12T12:40:29Z

Note that most of the cited issues (e.g. #6901) are about a different problem: namely, wanting Nix to continue if a binary cache is offline. Presumably, for those users, retrying would make things worse.

eswar2001 · 2025-09-12T13:19:12Z

Note that most of the cited issues (e.g. #6901) are about a different problem: namely, wanting Nix to continue if a binary cache is offline. Presumably, for those users, retrying would make things worse.

if we have multiple binary caches configured and one of them fail the we are not trying other caches ,we are falling back to build from source with default binary cache (when we pass --fallback)

eswar2001 requested review from Ericson2314 and edolstra as code owners September 12, 2025 02:38

github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Sep 12, 2025

Mic92 reviewed Sep 12, 2025

View reviewed changes

edolstra requested changes Sep 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: preserve failed caches for substitution fallback #13965

fix: preserve failed caches for substitution fallback #13965

eswar2001 commented Sep 12, 2025

Uh oh!

Mic92 Sep 12, 2025

Uh oh!

Mic92 Sep 12, 2025

Uh oh!

Mic92 Sep 12, 2025

Uh oh!

edolstra left a comment

Uh oh!

edolstra commented Sep 12, 2025

Uh oh!

eswar2001 commented Sep 12, 2025

Uh oh!

Uh oh!

	// We treat most errors as transient, but won't retry when hopeless
	Error err = Transient;

	if (httpStatus == 404 \|\| httpStatus == 410 \|\| code == CURLE_FILE_COULDNT_READ_FILE) {
	// The file is definitely not there
	err = NotFound;
	} else if (httpStatus == 401 \|\| httpStatus == 403 \|\| httpStatus == 407) {
	// Don't retry on authentication/authorization failures
	err = Forbidden;
	} else if (httpStatus == 429) {
	// 429 means too many requests, so we retry (with a substantially longer delay)
	retryTimeMs = RETRY_TIME_MS_TOO_MANY_REQUESTS;
	} else if (httpStatus >= 400 && httpStatus < 500 && httpStatus != 408) {
	// Most 4xx errors are client errors and are probably not worth retrying:
	// * 408 means the server timed out waiting for us, so we try again
	err = Misc;
	} else if (httpStatus == 501 \|\| httpStatus == 505 \|\| httpStatus == 511) {
	// Let's treat most 5xx (server) errors as transient, except for a handful:
	// * 501 not implemented
	// * 505 http version not supported
	// * 511 we're behind a captive portal
	err = Misc;
	} else {

Uh oh!

fix: preserve failed caches for substitution fallback #13965

Are you sure you want to change the base?

fix: preserve failed caches for substitution fallback #13965

Conversation

eswar2001 commented Sep 12, 2025

Motivation

Context

Implementation Strategy

Alternative Approaches Considered

Instructions for Reviewers

Uh oh!

Mic92 Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Mic92 Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Mic92 Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

edolstra left a comment

Choose a reason for hiding this comment

Uh oh!

edolstra commented Sep 12, 2025

Uh oh!

eswar2001 commented Sep 12, 2025

Uh oh!

Uh oh!