Skip to content

Implement non concurrent crawler #12

@let4be

Description

@let4be

For broad web crawling we probably do not need any concurrency within a single job... which means we can save up a bunch of resources and annoy site owners less...
Additionally I'm considering using this in a so called "breeder" - a dedicated non-concurrent web crawler which purposes are

  • download && parse robots.txt, while
  • resolving redirects
  • resolving additional DNS requests(if any) as long as it falls within the same addr_key, see Async channel based DNS resolver #14
  • head index page to figure out if there are any redirects(if allowed by robots.txt)
  • Jobs that resolved all DNS(within our restrictions) and successfully HEAD index page are considered "breeded"

all jobs extracted from JobQ will be added to a breeder first and only then to a typical web crawler(if they survive the breeding process) with a StaticDnsResolver(breeder and regular web crawler will have quite different rules and settings)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions