[Features]: Enable sitemap.xml-based crawling and following domain redirects#14
[Features]: Enable sitemap.xml-based crawling and following domain redirects#14K-Mistele wants to merge 2 commits intoegoist:mainfrom
Conversation
This is useful in cases where the top-level site e.g. mywebsite.com redirects to www.mywebsite.com
|
New and updated dependencies detected. Learn more about Socket for GitHub ↗︎
|
| await this.#crawlSitemap(url) | ||
| } | ||
|
|
||
| await this.#fetchPage(url, { |
There was a problem hiding this comment.
we probably don't need to continue normal fetching if sitemap is enabled, ie:
if (this.options.enableSitemap) {
await this.#crawlSitemap(url)
} else {
await this.#fetchPage(url, {})
}There was a problem hiding this comment.
So, while sitemap.xml fetching is a great thing to have for larger sites or search-optimized sites, a lot of sites don't have it still.
I was viewing the enableSitemap options as "we are enabling sitemap-based crawling if it's available" but also assuming that if someone calls siteFetch() they want the site fetched regardless of whether or not there is a sitemap available. E.g. if someone is batch-fetching a bunch of sites, they probably want the sites fetched and to use sitemap if possible, but don't want the site skipped just because there's not a sitemap.
| crawled as normal: | ||
| ``` | ||
| sitefetch https://nextjs.org --enable-sitemap | ||
| ``` |
There was a problem hiding this comment.
can it simply be:
sitefetch https://example.com/sitemap.xml
There was a problem hiding this comment.
This requires that you know ahead of time if the site you're trying to crawl has a sitemap which requires making an additional request, see above comment. Happy to change it if you feel strongly about it.
There was a problem hiding this comment.
we can just check if the input url ends with /sitemap.xml
|
btw we use |
Oops! I will nuke the package-lock then. I actually use Bun for all of my personal stuff but most projects don't so I'm used to reverting to npm for OSS contributions. |
This PR adds two features:
sitemap.xml. if it is present, all the specified URLs will be added to the crawl queue.