-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
For broad web crawling we probably do not need any concurrency within a single job... which means we can save up a bunch of resources and annoy site owners less...
Additionally I'm considering using this in a so called "breeder" - a dedicated non-concurrent web crawler which purposes are
- download && parse robots.txt, while
- resolving redirects
- resolving additional DNS requests(if any) as long as it falls within the same
addr_key, see Async channel based DNS resolver #14 headindex page to figure out if there are any redirects(if allowed by robots.txt)- Jobs that resolved all DNS(within our restrictions) and successfully
HEADindex page are considered "breeded"
all jobs extracted from JobQ will be added to a breeder first and only then to a typical web crawler(if they survive the breeding process) with a StaticDnsResolver(breeder and regular web crawler will have quite different rules and settings)
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request