searchgov-spider

The home for the spider that supports Search.gov.

About

With the move away from using Bing to provide search results for some domains, we need a solution that can index sites that were previously indexed by Bing and/or that do not have standard sitemaps. Additionally, the Scrutiny desktop application is being run manually to provide coverage for a few dozen domains that cannot be otherwise indexed. The spider application is our solution to both the Bing problem and the removal of manual steps. The documentation here represents the most current state of the application and our design.

Technologies

We currently run python 3.12. The spider is based on the open source scrapy framework. On top of that we use several other open source libraries and scrapy plugins. See our requirements file for more details.

Core Scrapy File Structure

*Note: Other files and directories are within the repository but the folders and files below relate to those needed for the scrapy framework.

├── search_gov_crawler              # scrapy root
│   ├── dap                         # code for handling data from DAP
│   ├── domains                     # json files with domains to scrape
│   ├── elasticsearch               # code related to indexing content in elasticsearch
│   ├── scheduling                  # code for job scheduling and storing schedules in redis
│   ├── search_gov_spider           # scrapy project dir
│   │   ├── extensions              # custom scrapy extensions
│   │   ├── helpers                 # common functions
│   │   ├── job_state               # code related to storing job state in redis
│   │   ├── sitemaps                # code related to indexing based on sitemap data
│   │   ├── spiders                 # all search_gov_spider spiders
│   │   │   ├── domain_spider.py    # for html pages
│   │   │   ├── domain_spider_js.py # for js pages
│   │   ├── items.py                # defines individual output of scrapes
│   │   ├── middlewares.py          # custom middleware code
│   │   ├── monitors.py             # custom spidermon monitors
│   │   ├── pipelines.py            # custom item pipelines
│   │   ├── settings.py             # settings that control all scrapy jobs
│   ├── scrapy.cfg

Quick Start - Docker

Docker can be used to run spider from this repo or from search-services. If you want to run other SearchGov services besides spider and its dependencies, you should use the search services repo.

Start docker:

The spider profile must be used to start the spider and its dependencies.

docker compose --profile spider up

Watch Logs and Check Output:

The default behavior is that the spider-scheduler and spider-sitemap containers start running based on our development schedule. It may be that no jobs are scheduled for a while so nothing will run. Likewise, the sitemap process may not detect changes and index any documents.

If a crawl does start watch the logs for information about records loaded to Elasticsearch and Opensearch. Then, visit Kibana and/or Opensearch Dashboards to view indexed documents.

Run an on-demand crawl To direct documents from a specific domain, use the helper script to trigger an on-demand crawl. Here the spider crawl command can be used as a shortcut to trigger a non-js crawl starting at https://www.gsa.gov and limited to pages in the www.gsa.gov domain.

docker compose run spider /bin/bash -c "spider crawl www.gsa.gov https://www.gsa.gov"

Quick Start - Local Development

Insall and activate virtual environment:

python -m venv venv
source venv/bin/activate

Add required python modules:

pip install -r requirements.txt

# required for domains that need javascript
playwright install --with-deps
playwright install chrome --force

Start Required Infrastructure Using Docker:

docker compose up redis

Run A Spider:

cd search_gov_crawler

# to run for a non-js domain:
scrapy crawl domain_spider -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com -a output_target=csv

# or to run for a js domain
scrapy crawl domain_spider_js -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com/js -a output_target=csv

Check Output:

The output of this scrape is one or more csv files containing URLs in the output directory.

Learn More:

For more advanced usage, see the Advanced Setup and Use Page

Entrypoints

Scrapy Scheduler - Process that manages and runs spider crawls based on a schedule.
Sitemap Monitor - Process that monitors domains for changes in their sitemaps and triggers spider runs to capture changes.
DAP Extractor - Stand-alone job that handles extracting and loading DAP visits data for use in spider crawls.
Benchmark - Allows for manual testing and benchmarking using similar mechanisms as scheduled runs.

Name		Name	Last commit message	Last commit date
Latest commit History 548 Commits
.circleci		.circleci
.github		.github
cicd-scripts		cicd-scripts
docs		docs
scripts		scripts
search_gov_crawler		search_gov_crawler
tests		tests
.codeclimate.yml		.codeclimate.yml
.dockerignore		.dockerignore
.env.development		.env.development
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.profile		.profile
DockerFile.dev		DockerFile.dev
DockerFile.elasticsearch		DockerFile.elasticsearch
DockerFile.opensearch		DockerFile.opensearch
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
__init__.py		__init__.py
appspec.yml		appspec.yml
docker-compose.yml		docker-compose.yml
docker_create_database.sh		docker_create_database.sh
docker_create_index.sh		docker_create_index.sh
es_index_settings.json		es_index_settings.json
manifest.yml		manifest.yml
opensearch_index_settings.json		opensearch_index_settings.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
runtime.txt		runtime.txt
setup.cfg		setup.cfg
startcommand.sh		startcommand.sh
test-codeclimate.sh		test-codeclimate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

searchgov-spider

Table of contents

About

Technologies

Core Scrapy File Structure

Quick Start - Docker

Quick Start - Local Development

Entrypoints

Helpful Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

GSA-TTS/searchgov-spider

Folders and files

Latest commit

History

Repository files navigation

searchgov-spider

Table of contents

About

Technologies

Core Scrapy File Structure

Quick Start - Docker

Quick Start - Local Development

Entrypoints

Helpful Links

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages