[Feature Request]: requeue url if it comes from a different context (parent_url) #1544
pivuong
started this conversation in
Feature requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
What needs to be done?
I can appreciate the request for #843 where users do not want to recrawl a URL that has already been crawled for efficiency and to avoid loops.
In my case, I am crawling websites that may assume many different structures so I first crawl an entire website and then try to filter down the URLs of interest based on the parent url, rather than using a filterchain upfront. I cannot use the filterchain upfront because I don't know how far down the search tree I need to go before I find the parent_url pattern of interest.
Since URLs already added to the visited set aren’t re-crawled, even when encountered under a new parent, I miss some URLs in my post-filtering process.
Feature request: add an option to re-crawl a URL only when it’s reached from a different parent URL or the ability to use a ParentFilterChain which filters by parent_url.
What problem does this solve?
Allows for filtering by parent_url post-hoc when same URL can be reached through multiple parents.
Target users/beneficiaries
No response
Current alternatives/workarounds
I could try to capture parent-child relationships on my own, but this is cumbersome when crawl4ai should already have this information.
Proposed approach
Capture both the visited_url and its parent in the visited set and then add an argument that checks either the visited_url alone or the (visited_url, parent) combination.
Beta Was this translation helpful? Give feedback.
All reactions