Skip to content

USE 449 - support btrix-args-json#337

Draft
ghukill wants to merge 3 commits intomainfrom
USE-449-pickup-btrix-args
Draft

USE 449 - support btrix-args-json#337
ghukill wants to merge 3 commits intomainfrom
USE-449-pickup-btrix-args

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Mar 13, 2026

Purpose and background context

This PR allows passing of btrix-args-json in the StepFunction input payload, and having this propagate to the browsertrix-harvester CLI command.

How can a reviewer manually see the effects of these changes?

1- Build SAM image:

make sam-build

2- Run pre-existing sample that exercise this new input payload property:

make sam-example-libguides-extract

Payload sent with this:

{
  "next-step": "extract",
  "run-date": "2026-03-13",
  "run-type": "full",
  "source": "libguides",
  "verbose": "true",
  "btrix-config-yaml-file": "s3://timdex-extract-dev-222053980223/libguides/config/libguides.yaml",
  "btrix-sitemaps": [
    "https://libguides.mit.edu/sitemap.xml"
  ],
  "btrix-sitemap-urls-output-file": "s3://timdex-extract-dev-222053980223/libguides/last-sitemaps-urls.txt",
  --------> "btrix-args-json":"{\"--scopeType\":\"custom\",\"--include\":\".*libguides.mit.edu/(c.php\\?g=.*\\&p=.*)\"}"
}

Expected output with --btrix-args-json present:

{
  "next-step": "transform",
  "run-date": "2026-03-13",
  "run-type": "full",
  "run-id": "8afb9aa8-3605-4817-a114-0a4074aa4c8d",
  "source": "libguides",
  "verbose": true,
  "harvester-type": "browsertrix",
  "extract": {
    "extract-command": [
      "--verbose",
      "harvest",
      "--config-yaml-file=s3://timdex-extract-dev-222053980223/libguides/config/libguides.yaml",
      "--records-output-file=s3://timdex-extract-dev-222053980223/libguides/libguides-2026-03-13-full-extracted-records-to-index.jsonl",
      "--sitemap=https://libguides.mit.edu/sitemap.xml",
      "--sitemap-urls-output-file=s3://timdex-extract-dev-222053980223/libguides/last-sitemaps-urls.txt",
      ------> "--btrix-args-json={\"--scopeType\":\"custom\",\"--include\":\".*libguides.mit.edu/(c.php\\?g=.*\\&p=.*)\"}"
    ]
  }
}

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: StepFunction input payload can now pass browsertrix-harveset overrides

What are the relevant tickets?

ghukill added 2 commits March 13, 2026 15:37
How this addresses that need:
* If `btrix-args-json` present in input payload, add to browsertrix-harvester final CLI command
as `--btrix-args-json`

Side effects of this change:
* Allows runtime overrides as the `btrix-args-json` was designed to do.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-449
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the “format input / command generation” layer of the TIMDEX pipeline lambdas to accept a new Step Function input property (btrix-args-json) and include it in the generated Browsertrix harvester extract command.

Changes:

  • Add support for btrix-args-json when generating Browsertrix extract commands.
  • Update unit tests for Browsertrix extract command generation to include the new argument.
  • Update the SAM example fixture payload for libguides and refresh dependency lockfile entries.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File Description
lambdas/commands.py Appends btrix-args-json into the Browsertrix extract command argument list.
tests/test_commands.py Adds test coverage asserting the new argument is present in Browsertrix extract commands.
tests/fixtures/event_payloads/libguides-full-extract.json Updates the sample payload used for sam local invoke to include Browsertrix inputs and btrix-args-json.
Pipfile.lock Updates locked dependency versions and adds/updates transitive deps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

cmd.append(f"--previous-sitemap-urls-file={sitemap_urls_previous}")

if btrix_args_json := input_payload.raw.get("btrix-args-json"):
cmd.append(f"""--btrix-args-json='{btrix_args_json}'""")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was exactly correct. Force push coming shortly. Encountered this during a trial run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants