-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: preserve failed caches for substitution fallback #13965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Previously, caches that failed during initialization were silently dropped from the substituters list, causing unnecessary rebuilds from source. This change preserves temporarily failed caches and implements smart error classification to distinguish recoverable errors (network timeouts, DNS failures) from permanent failures. Key improvements: - Add isRecoverableStoreError() for intelligent error classification - Preserve caches with temporary failures for retry during substitution - Enhanced user feedback with clear warning messages - Maintain cache parallelism for working substituters Fixes scenarios where network issues would disable entire cache system instead of falling back to remaining working caches.
std::transform(lower_msg.begin(), lower_msg.end(), lower_msg.begin(), ::tolower); | ||
|
||
// Network timeout errors - be more specific to avoid false positives | ||
if (lower_msg.find("connection timeout") != std::string::npos || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is curl not also providing some error codes? This might be a more robust way of checking the root cause of certain errors.
@@ -0,0 +1,161 @@ | |||
#include <iostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you review the code yourself as well? The whole file seems to be unused.
@@ -0,0 +1,100 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is not properly integrated int the build system and also wouldn't work in the nix sandbox because it's tries to reach the network. We might something like a NixOS test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scanning the text of exceptions for certain keywords does not strike me as a good approach since it's very ad hoc.
We already have logic for retrying transient errors (which doesn't require parsing error messages), e.g.
nix/src/libstore/filetransfer.cc
Lines 483 to 505 in 377b60e
// We treat most errors as transient, but won't retry when hopeless | |
Error err = Transient; | |
if (httpStatus == 404 || httpStatus == 410 || code == CURLE_FILE_COULDNT_READ_FILE) { | |
// The file is definitely not there | |
err = NotFound; | |
} else if (httpStatus == 401 || httpStatus == 403 || httpStatus == 407) { | |
// Don't retry on authentication/authorization failures | |
err = Forbidden; | |
} else if (httpStatus == 429) { | |
// 429 means too many requests, so we retry (with a substantially longer delay) | |
retryTimeMs = RETRY_TIME_MS_TOO_MANY_REQUESTS; | |
} else if (httpStatus >= 400 && httpStatus < 500 && httpStatus != 408) { | |
// Most 4xx errors are client errors and are probably not worth retrying: | |
// * 408 means the server timed out waiting for us, so we try again | |
err = Misc; | |
} else if (httpStatus == 501 || httpStatus == 505 || httpStatus == 511) { | |
// Let's treat most 5xx (server) errors as transient, except for a handful: | |
// * 501 not implemented | |
// * 505 http version not supported | |
// * 511 we're behind a captive portal | |
err = Misc; | |
} else { |
Can you be more specific about what sort of transient errors you've encountered that are currently not being retried?
Note that most of the cited issues (e.g. #6901) are about a different problem: namely, wanting Nix to continue if a binary cache is offline. Presumably, for those users, retrying would make things worse. |
if we have multiple binary caches configured and one of them fail the we are not trying other caches ,we are falling back to build from source with default binary cache (when we pass --fallback) |
Previously, caches that failed during initialization were silently dropped from the substituters list, causing unnecessary rebuilds from source. This change preserves temporarily failed caches and implements smart error classification to distinguish recoverable errors (network timeouts, DNS failures) from permanent failures.
Key improvements:
Fixes scenarios where network issues would disable entire cache system instead of falling back to remaining working caches.
Motivation
This change addresses a critical reliability issue in Nix's cache system that significantly impacts user experience and system efficiency.
Currently, when individual binary caches fail during store initialization due to temporary network issues (DNS failures, connection timeouts, network unreachability), Nix silently removes these caches from the substituters list entirely. This causes several problems:
This particularly affects users with multiple cache configurations (local caches + remote caches) where one cache being temporarily unavailable shouldn't disable the entire substitution system. The current behavior defeats the purpose of having multiple caches for redundancy.
Context
This change fixes a well-documented issue tracked in #6901 ("Nix fails if my local cache (substituer) is offline. Even when everything is available on the next one: cache.nixos.org"), which has few reactions showing widespread community impact. Related issues include #7127, #3514, #4383, and #2661, all describing similar cache fallback problems.
Implementation Strategy
The solution implements intelligent cache preservation through two key changes:
Smart Error Classification: A new
isRecoverableStoreError()
function distinguishes between:Cache Preservation Logic: Modified
getDefaultSubstituters()
insrc/libstore/store-registration.cc
to preserve caches that fail with recoverable errors during initialization, allowing retry during actual substitution phase.Alternative Approaches Considered
The chosen approach is optimal because it requires no configuration changes, has minimal performance overhead, and provides immediate improvement for existing setups.
Instructions for Reviewers
The diff focuses on:
src/libstore/store-registration.cc
):isRecoverableStoreError()
function (lines 45-80)getDefaultSubstituters()
cache handling (lines 120-150)The change maintains full backward compatibility - all existing cache configurations continue to work unchanged, but now with improved reliability during temporary failures.
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.