The home for the spider that supports Search.gov.
With the move away from using Bing to provide search results for some domains, we need a solution that can index sites that were previously indexed by Bing and/or that do not have standard sitemaps. Additionally, the Scrutiny desktop application is being run manually to provide coverage for a few dozen domains that cannot be otherwise indexed. The spider application is our solution to both the Bing problem and the removal of manual steps. The documentation here represents the most current state of the application and our design.
We currently run python 3.12. The spider is based on the open source scrapy framework. On top of that we use several other open source libraries and scrapy plugins. See our requirements file for more details.
*Note: Other files and directories are within the repository but the folders and files below relate to those needed for the scrapy framework.
├── search_gov_crawler # scrapy root
│ ├── dap # code for handling data from DAP
│ ├── domains # json files with domains to scrape
│ ├── elasticsearch # code related to indexing content in elasticsearch
│ ├── scheduling # code for job scheduling and storing schedules in redis
│ ├── search_gov_spider # scrapy project dir
│ │ ├── extensions # custom scrapy extensions
│ │ ├── helpers # common functions
│ │ ├── job_state # code related to storing job state in redis
│ │ ├── sitemaps # code related to indexing based on sitemap data
│ │ ├── spiders # all search_gov_spider spiders
│ │ │ ├── domain_spider.py # for html pages
│ │ │ ├── domain_spider_js.py # for js pages
│ │ ├── items.py # defines individual output of scrapes
│ │ ├── middlewares.py # custom middleware code
│ │ ├── monitors.py # custom spidermon monitors
│ │ ├── pipelines.py # custom item pipelines
│ │ ├── settings.py # settings that control all scrapy jobs
│ ├── scrapy.cfgDocker can be used to run spider from this repo or from search-services. If you want to run other SearchGov services besides spider and its dependencies, you should use the search services repo.
- Start docker:
The spider profile must be used to start the spider and its dependencies.
docker compose --profile spider up- Watch Logs and Check Output:
The default behavior is that the spider-scheduler and spider-sitemap containers start running based on our development schedule. It may be that no jobs are scheduled for a while so nothing will run. Likewise, the sitemap process may not detect changes and index any documents.
If a crawl does start watch the logs for information about records loaded to Elasticsearch and Opensearch. Then, visit Kibana and/or Opensearch Dashboards to view indexed documents.
- Run an on-demand crawl
To direct documents from a specific domain, use the helper script to trigger an on-demand crawl. Here the
spider crawlcommand can be used as a shortcut to trigger a non-js crawl starting athttps://www.gsa.govand limited to pages in thewww.gsa.govdomain.
docker compose run spider /bin/bash -c "spider crawl www.gsa.gov https://www.gsa.gov"- Insall and activate virtual environment:
python -m venv venv
source venv/bin/activate- Add required python modules:
pip install -r requirements.txt
# required for domains that need javascript
playwright install --with-deps
playwright install chrome --force- Start Required Infrastructure Using Docker:
docker compose up redis- Run A Spider:
cd search_gov_crawler
# to run for a non-js domain:
scrapy crawl domain_spider -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com -a output_target=csv
# or to run for a js domain
scrapy crawl domain_spider_js -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com/js -a output_target=csv- Check Output:
The output of this scrape is one or more csv files containing URLs in the output directory.
- Learn More:
For more advanced usage, see the Advanced Setup and Use Page
-
Scrapy Scheduler - Process that manages and runs spider crawls based on a schedule.
-
Sitemap Monitor - Process that monitors domains for changes in their sitemaps and triggers spider runs to capture changes.
-
DAP Extractor - Stand-alone job that handles extracting and loading DAP visits data for use in spider crawls.
-
Benchmark - Allows for manual testing and benchmarking using similar mechanisms as scheduled runs.