Skip to content

Conversation

@katrinafyi
Copy link
Contributor

@katrinafyi katrinafyi commented Oct 4, 2025

previously, errors which happened while collecting links (e.g., due to invalid base join) would print a terse message to the console and then they would be forgotten. this looked like:

[WARN] Error creating request: InvalidPathToUri("/relative-link/")
🔍 0 Total (in 0s) ✅ 0 OK 🚫 0 Errors

this is unhelpful because it only shows the internal enum name and it's not reported in the final error count, so it's very easy to miss.

now, these "early" errors get propagated into the final error list and displayed alongside the HTTP failures.

[a.md]:
   [ERROR] error:// | Error building URL for "/relative-link/" (Attribute: Some("href")): Cannot convert path '/relative-link/' to a URI: To resolve relative links in local files, provide a root dir

🔍 1 Total (in 0s) ✅ 0 OK 🚫 1 Error

this makes them more obvious and lets us make use of the existing functions for displaying error details and suggested fixes.

this is implemented by changing the request-constructing functions to return a Result<_, RequestError> rather than Result<_, ErrorKind>. RequestError is a new error type which contains information about the URL or input source which caused the error. it also contains the underlying ErrorKind.

/// An error which occurs while trying to construct a [`Request`] object.
/// That is, an error which happens while trying to load links from an input
/// source.
#[derive(Error, Debug, PartialEq, Eq, Hash)]
pub enum RequestError {
/// Unable to construct a URL for a link appearing within the given source.
#[error("Error building URL for {0}: {2}")]
CreateRequestItem(RawUri, ResolvedInputSource, #[source] Box<ErrorKind>),
/// Unable to load the content of an input source.
#[error("Error reading input '{0}': {1}")]
GetInputContent(InputSource, #[source] Box<ErrorKind>),
}

i think it's nice to have it as a new error type because it makes it clear to the user that they only have to handle these two errors, and it avoids making the ErrorKind type recursive.

this makes it easier to handle the returned error. previously, if an Err did occur in the collected request stream, this would lead to an unexpected early exit and a tokio error. so, i think that changing this to Result<_, RequestError> is no great burden because lychee-bin did not have very good handling of the old ErrorKind.

some more examples of the new error messages:

$ cargo run -- https://google.com 'noasdjfi/[' non-existing-fjdaifdsjai.com no-perms https://example.com 
[http://non-existing-fjdaifdsjai.com/]:
   [ERROR] error:// | Error reading input 'http://non-existing-fjdaifdsjai.com/': Network error: Connection failed. Check network connectivity and firewall settings

[https://google.com/]:
     [404] https://google.com/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png | Rejected status code (this depends on your "accept" configuration): Not Found

[no-perms]:
   [ERROR] error:// | Error reading input 'no-perms': Cannot traverse input directory: no-perms: IO error for operation on no-perms: Permission denied (os error 13): Directory traversal failed: no-perms: IO error for operation on no-perms: Permission denied (os error 13). Check directory permissions

[noasdjfi/[]:
   [ERROR] error:// | Error reading input 'noasdjfi/[': UNIX glob pattern is invalid: Invalid glob pattern: Pattern syntax error near position 9: invalid range pattern. Check pattern syntax

🔍 33 Total (in 4s) ✅ 5 OK 🚫 4 Errors 🔀 24 Redirects

compared to before (below). also note that the old behaviour is nondeterministic. any one of the errors could be the one which is printed, and it aborts the program with no other output.

$ cargo run -- https://google.com 'noasdjfi/[' non-existing-fjdaifdsjai.com no-perms 
thread 'tokio-runtime-worker' panicked at lychee-bin/src/commands/check.rs:256:18:
cannot send response to queue: SendError { .. }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: UNIX glob pattern is invalid

Caused by:
    Pattern syntax error near position 9: invalid range pattern

this implements step (2) of the plan here #1624 (comment)

outdated commentary. i changed it to a new RequestError type now

however, i don't like the way this is implemented. it has to "smuggle" the early errors through a new error case,

ErrorKind::CreateRequestItem(RawUri, ResolvedInputSource, Box<ErrorKind>)

and all of the commands have to deliberately handle this case by bypassing the check logic and directly constructing a failed Response. it would be very easy for a user to forget to do this. originally, i wanted to make this new error separate from the usual ErrorKind, but the lychee_lib::Result<T> type with ErrorKind is so pervasive and it would've needed extensive changes.

maybe, instead, i could embed this into the Request type by making its Uri field into a Result<Uri, CreatRequestError> but that seems not great too.

another downside of the current approach is it uses the fake error:// URL to display these messages. this is because the ResponseBody needs a "valid" Uri.

TODO:

  • tests. fix old and add new.
  • move check handling into handle function

also, along the way, this changes the ErrorKind::details cases for invalid path and invalid base join. these details previously just repeated the error message and would look like "Cannot convert path '/../../' to a URI: Cannot convert path to URI: '/../../'. Check path format".

related to #1265

@katrinafyi katrinafyi changed the title feat: propagate URL parsing/resolving errors to top level feat: propagate URL parsing/joining errors to top level Oct 4, 2025
@katrinafyi
Copy link
Contributor Author

I see that this PR contradicts an existing test case which says the current behaviour is intentional.

lychee/lychee-bin/tests/cli.rs

Lines 1458 to 1462 in 54e425c

/// If `base-dir` is not set, don't throw an error in case we encounter
/// an absolute local link (e.g. `/about`) within a file.
/// Instead, simply ignore the link.
#[test]
fn test_ignore_absolute_local_links_without_base() -> Result<()> {

Maybe there should be an option about whether to treat absolute links as an error when root-dir is not specified? At the moment, it would not be possible to get the old "ignoring" behaviour after this PR. Or, we just change the test

@katrinafyi katrinafyi changed the title feat: propagate URL parsing/joining errors to top level feat: propagate input loading/collecting errors to top level Oct 7, 2025
katrinafyi and others added 2 commits October 7, 2025 16:53
this shouldn't be needed anymore since we no longer use a recursive case in ErrorKind. instead, we use the new RequestError type.
@thomas-zahner
Copy link
Member

Thanks for the PR and thanks for the extensive description and thoughts. I like it and this should improve UX quite a bit.

Since we now pass errors through the Response struct we might consider making Uri an Option<Uri> so we could get rid of this sentinel value error://. But I'm not sure if it's really worth it.

Then it would also be great to add a CLI test which covers the example you provided in this PR cargo run -- https://google.com 'noasdjfi/[' non-existing-fjdaifdsjai.com no-perms https://example.com. To assert that lychee reports the errors as expected etc.

@katrinafyi
Copy link
Contributor Author

I can try to change Response and report back on how it goes. I should add a CLI test as well.

Did you have thoughts about whether absolute links should be flagged as errors in the absence of root-dir/base-url? I noticed in the feature table, there is a tick for the "Skip relative URLs" feature in Lychee. Making root-relative URLs into failures by default would also conflict with this, in addition to the unit test I mentioned earlier.

@thomas-zahner
Copy link
Member

Regarding the handling of absolute URLs I think it probably make sense to flag them as errors, as you proposed. We are now at quite a different stage compared to when Skip relative URLs was added to the feature table. This Skip relative URLs "feature" in the table predates the root-dir feature by many years so it makes sense to reconsider the behaviour. I think an error is much clearer than a warning that users easily glance over and I personally prefer that behaviour.

Maybe there should be an option about whether to treat absolute links as an error when root-dir is not specified? At the moment, it would not be possible to get the old "ignoring" behaviour after this PR. Or, we just change the test

So for me it makes sense to just change that behaviour, update/remove the test and update/remove the "feature" in the README.

@thomas-zahner
Copy link
Member

@mre and I had a quick chat about this. We realised that this PR improves the current situation of panicking and crashing (among improving other things like skipping over relative URLs) but also that this new behaviour is not ideal. In our opinion erroneous user inputs should not be handled the same way as link check errors. For example take a look at the exit codes and how we differentiate between user errors and link check errors. We think user provided erroneous arguments should fail fast and make lychee exit before link checking.

Ideally we handle errors properly (which happens in this PR but not on master) while exiting early upon encountering user errors. The output could look something like this:

$ lychee https://google.com 'noasdjfi/[' non-existing-fjdaifdsjai.com no-perms https://example.com 

Problems reading these inputs:
- 'http://non-existing-fjdaifdsjai.com/': Network error: Connection failed. Check network connectivity and firewall settings
- 'no-perms': Cannot traverse input directory: no-perms: IO error for operation on no-perms: Permission denied (os error 13)
- 'noasdjfi/[': UNIX glob pattern is invalid: Pattern syntax error near position 9: invalid range pattern. Check pattern syntax

Please check for typos and try again.

So, could you try to keep user errors from making the program fail early instead of moving and converting user errors into link check errors?

@katrinafyi
Copy link
Contributor Author

katrinafyi commented Oct 9, 2025

It's an interesting perspective, and thanks for discussing it and giving your thoughts. I agree it could do with more thinking.

With this PR, however, I think the errors that arise within an input file when resolving a URL are reasonably flagged as link checking errors. I'm interested in getting this part of the feature merged sooner, so would it be okay to revert the input reading errors to the old behaviour and merge the "Error building URL" errors on their own? That said, the changes to the Result type are such that the input reading errors would still have to be passed through the new RequestError::GetInputContent. So, I would make it behave the same on the outside while going through different logic on the inside.

About the erroneous inputs case, I think there is a scale. While some inputs are obviously wrong, like an invalid glob syntax, I think that the HTTP and filesystem errors are more nuanced. Unlike an invalid glob, these errors are affected by external state and this has some ramifications.

  • For one, they take time and resources to check - this may noticeably delay starting up Lychee.
  • For two, this means that, even if the inputs are checked preemptively, you must always account for the possibility that the error state happens after the check (for example, a laptop might lose wifi) (this is a TOCTOU-style problem). Lychee must still handle errors sensibly in these situations.
  • As another nuance, directories can obviously contain other directories. Since using a directory input walks recursively, it's possibly for a deep subdirectory to have no permissions. Would an early check recurse through all subdirectories?
  • Also, there is the possibility that a Lychee library user does not go through the usual CLI parsing and validating logic.

All this is to say, regardless of whether the inputs are validated beforehand, the link checking phase will need some way to sensibly handle these transient errors. I think this PR achieves that, and that it could complement - rather than replace - any extra early input checking which you want to add. (The glob syntax errors, however, should be handled through stricter checking in the InputSource constructor).

To summarise, I want URL building errors the most. For the input loading errors, I believe the current implementation doesn't conflict with your plans, but I'm also happy to scale it back. About implementing the early checks, I think that is best done in another PR (and I'm not the best person to do it, because I don't think special early checking is super beneficial (aside from the glob syntax)).

Sorry for the long text, let me know what you want :))

katrinafyi added a commit to rina-forks/lychee that referenced this pull request Oct 9, 2025
now, when providing an input with glob characters but an invalid glob
pattern, this will cause a "cannot parse inputs" error rather than
erroring out later during link checking. this makes use of the
existing ErrorKind::InvalidGlobPattern error case.

for example, a glob with unbalanced brackets is invalid and now looks
like this:
```console
$ cargo run -- 'noasdjfi/['
Error: Cannot parse inputs from arguments

Caused by:
    0: UNIX glob pattern is invalid
    1: Pattern syntax error near position 9: invalid range pattern
```

i've also added some TODO notes which would make the logic neater.
i didn't do those refactors in this PR because i am mindful of another
currently-open (and much bigger) PR which touches these lines
(https://www.github.com/lycheeverse/lychee/pull/1837)

i wonder if the reason FsGlob stores strings rather than a Pattern is
for flexibility to switch to different glob packages.

this PR was prompted by findings in
lycheeverse#1864 (comment)
@thomas-zahner
Copy link
Member

thomas-zahner commented Oct 10, 2025

Thanks for the thoughtful comment.

With this PR, however, I think the errors that arise within an input file when resolving a URL are reasonably flagged as link checking errors. I'm interested in getting this part of the feature merged sooner, so would it be okay to revert the input reading errors to the old behaviour and merge the "Error building URL" errors on their own?

Yes, splitting this PR up into two separate PRs would be great, if that's possible. Errors that arise within an input file should definitely be flagged as link check errors, as indicated by my previous comment. So it's just the direct user input which we are discussing about and as this might be getting somewhat complicated this would deserve a separate issue and PR. This would allow us to quickly merge the part you are most interested in and then we could continue on discussions with user input handling where you didn't need to participate when you mention you aren't as interested in that area.

About the erroneous inputs case, I think there is a scale. While some inputs are obviously wrong, like an invalid glob syntax, I think that the HTTP and filesystem errors are more nuanced. Unlike an invalid glob, these errors are affected by external state and this has some ramifications.

Thanks, that makes sense. Maybe I was focusing too much on "early exit" in my previous comment. It's true that some errors are impossible to catch "early". However, it is still possible to differentiate between user errors and link-check errors, which makes sense from a UX perspective. For example running lychee https://invalid.example.com should result in a clear user error with exit code 1 because the user supplied an invalid argument to lychee. (note that this is not the same as echo https://invalid.example.com | lychee -) But this PR moves/converts the error into a link check error with exit code 2 which means that the currently documented exit codes are no longer accurate. We would prefer to keep the separation between the two different types of errors with their respective exit codes.

I realise that the output example of my previous comment might not be worth the effort. It would be much simpler to exit on the first user error, as is the current behaviour, rather than listing all user errors.

In any case it would make sense to split up this PR if possible and potentially create a new issue.

@katrinafyi
Copy link
Contributor Author

katrinafyi commented Oct 10, 2025

For example running lychee https://invalid.example.com should result in a clear user error with exit code 1 because the user supplied an invalid argument to lychee.

Right, yeah, I see. The old code would've exited with 1 because rust uses 1 for an Err returned from the main function. I think the error handling just looked so noisy that I hadn't considered that some aspects of it could be desired.

I'll see what I can do. It'll need some way to distinguish input input errors from discovered input errors - maybe an extra case in RequestError.

@thomas-zahner thomas-zahner force-pushed the master branch 2 times, most recently from fcdf77c to e0912ab Compare October 21, 2025 12:53
…rrors

Conflicts:
lychee-lib/src/utils/request.rs
should this just be a boolean in GetInputContent case?
`make lint` fails

error[E0658]: `let` expressions in this position are unstable
   --> /home/x/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wiremock-0.6.5/src/matchers.rs:214:12
    |
214 |         if let Ok(url) = Url::parse(&path)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
@katrinafyi katrinafyi force-pushed the propagate-early-errors branch from 2cb4afd to e22f13c Compare October 24, 2025 05:25
adjusted existing test case
`test_ignore_absolute_local_links_without_base`
should it be replaced with something else instead?
@katrinafyi
Copy link
Contributor Author

Bloop! I think the review comments should be addressed now. I've brought back the early exit with code 1 for failing to load CLI inputs, and I've updated the tests and readme to match the new error behaviour for InvalidBaseJoin.

In the implementation, the early user errors should behave the same as before, but the path they go through is changed. Now, they are packaged into a RequestError::UserInputContent and they get sent to the request channel task before being sent back to the main thread as an error.

Some other misc notes:

  • the Response inputsource is changed from ResolvedInputSource to InputSource, slightly reverting refactor!: use ResolvedInputSource downstream of InputContent #1840. this is because Response has to represent both real HTTP responses and fake "responses" which arise from failed input resolution. this is kind of undesirable
  • similar to above, i find it's too hard to change ResponseBody's uri field to an Option, just because of how often it's used. so, we have to use the fake error:// URL to represent URL resolving failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants