Skip to content

Add QuickStatements JSON output for software records#190

Merged
physikerwelt merged 4 commits intomainfrom
copilot/add-json-output-support
Feb 25, 2026
Merged

Add QuickStatements JSON output for software records#190
physikerwelt merged 4 commits intomainfrom
copilot/add-json-output-support

Conversation

Copy link
Contributor

Copilot AI commented Feb 24, 2026

  • Create src/zbmath_rest2oai/software_quickstatements.py with core logic to convert software result dicts to QuickStatements JSON rows
  • Create test/data/software/sagemath_825_raw.json fixture (real zbMATH API data for SageMath id=825)
  • Create test/data/software/sagemath_825_metadata.json expected QuickStatements metadata JSON
  • Create test/data/software/sagemath_825_references.json expected QuickStatements references JSON
  • Create test/test_software_quickstatements.py with 22 unit + integration tests (all passing)
  • Add CLI script src/zbmath_rest2oai/run_software_quickstatements.py to produce JSON files from the zbMath API
  • Update README.md to document the new JSON output feature
  • Verified key conventions from QuickStatements.php: syntax is P$xq$y, rows key is required
  • Fix all FLK-E501 (line too long) violations in the three new Python files
Original prompt

Create a PR in MaRDI4NFDI/python-zbMathRest2Oai to add JSON output support for software exports, replacing/augmenting the current XSLT->XML output.

Context:

  • The MediaWiki extension MathSearch QuickStatements job (includes/Graph/Job/QuickStatements.php) consumes a custom column encoding:
    • Keys like qP13 indicate lookup-by-PID: find the item whose property P13 equals the provided value (and optionally create missing).
    • Keys like P1458q13 indicate that the value is an external/software id that must be transformed to a Q-id by looking up the item via property P13 (the q<id> suffix indicates the PID used for lookup). After transformation, the statement is applied as property P1458 with a wikibase-item value.
    • L<lang> and D<lang> are labels and descriptions.
    • qalP<id> are qualifiers.

Goal:

  • In python-zbMathRest2Oai, implement output of software records into a custom JSON format suitable as input rows for the MathSearch QuickStatements job.
  • The repo already iterates through all software and currently outputs XML via XSLT; extend it to output JSON instead (either new mode/CLI flag, or additional output option), without breaking existing XML output.

Required JSON deliverables:

  1. Two JSON files generated by the tooling that correspond to the two update phases described in MaRDIRoadmap issue referncePublication deposit #173:

    • software_quickstatements_metadata.json: QuickStatements rows for software metadata fields (swmathID/index qP13, label/name Len, classifications P226 (many), homepage P29, related software IDs P1458q13 (many), standard articles P286q1459 (many), source code P339).
    • software_quickstatements_references.json: QuickStatements rows for references fields (reference qP1451 (many), software id link P1463q13).
  2. Add/update tests and test fixtures:

    • Add a fixture and expected JSON outputs for SageMath with swMathWork id 825.
    • The expected output should be compared against what is present on the portal item https://portal.mardi4nfdi.de/wiki/Item:Q13572 (at least ensuring that the produced row keys/values align with the portal’s intended mapping: qP13=825, label Len=SageMath, homepage P29, source code P339, plus any classifications/related software/standard articles/references available from the upstream API).

Implementation notes:

  • Define a clear JSON schema, e.g. { "job": "QuickStatements", "rows": [ {"qP13": "825", "Len": "SageMath", "P29": "...", "P226": "68", "P1458q13": "<id>", ... } ] } or { "rows": [ ... ] }. Rows should match the column-based structure expected by the QuickStatements job.
  • Ensure multi-valued fields become multiple rows or multiple statements per row in a way consistent with the QuickStatements job’s expectation (it uses a statement key counter based on encountering P* keys; consider using suffixes _1, _2 etc for multiple statements for the same property as supported by the job which strips _(\d+) suffix).
  • Preserve or document how to handle empty/self-closing XML elements.
  • Add CLI option(s) to select output format (xml vs json) and possibly choose the update phase (metadata vs references) so that the two JSON files can be produced deterministically.
  • Update README/docs if appropriate.

Acceptance criteria:

  • Running the software export over the test data produces the two JSON files.
  • Unit/integration tests pass; SageMath fixture test verifies deterministic JSON output for swMathWork=825.
  • Existing XML/XSLT output continues to work unchanged unless explicitly deprecated.

This pull request was created from Copilot chat.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: physikerwelt <2777736+physikerwelt@users.noreply.github.com>
Copilot AI changed the title [WIP] Add JSON output support for software exports Add QuickStatements JSON output for software records Feb 24, 2026
Copilot AI requested a review from physikerwelt February 24, 2026 15:44
@physikerwelt
Copy link
Member

@copilot the result of the zbMATH API for sage is that

{
    "articles_count": 3439,
    "authors": [
      "The Sage Developers",
      "Stein, William",
      "Joyner, David",
      "Kohel, David",
      "Cremona, John",
      "Eröcal, Burçin"
    ],
    "classification": [
      "05",
      "11",
      "14",
      "20",
      "68",
      "00",
      "01",
      "03",
      "06",
      "08",
      "12",
      "13",
      "15",
      "16",
      "17",
      "18",
      "19",
      "22",
      "26",
      "28",
      "30",
      "31",
      "32",
      "33",
      "34",
      "35",
      "37",
      "39",
      "40",
      "41",
      "42",
      "43",
      "44",
      "46",
      "47",
      "49",
      "51",
      "52",
      "53",
      "54",
      "55",
      "57",
      "58",
      "60",
      "62",
      "65",
      "70",
      "74",
      "76",
      "78",
      "80",
      "81",
      "82",
      "83",
      "85",
      "86",
      "90",
      "91",
      "92",
      "93",
      "94",
      "97"
    ],
    "dependencies": null,
    "description": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
    "homepage": "https://www.sagemath.org/",
    "id": 825,
    "keywords": [
      "orms",
      "Python",
      "Cython",
      "Sage",
      "Open Source",
      "Interfaces"
    ],
    "license_terms": "GPL (≥ 3)",
    "name": "SageMath",
    "operating_systems": null,
    "orms_id": "255",
    "programming_languages": null,
    "related_software": [
      {
        "id": 540,
        "name": "Magma"
      },
      {
        "id": 23170,
        "name": "GitHub"
      },
      {
        "id": 320,
        "name": "GAP"
      },
      {
        "id": 7248,
        "name": "OEIS"
      },
      {
        "id": 680,
        "name": "PARI/GP"
      },
      {
        "id": 866,
        "name": "SINGULAR"
      },
      {
        "id": 537,
        "name": "Macaulay2"
      },
      {
        "id": 554,
        "name": "Mathematica"
      },
      {
        "id": 16448,
        "name": "Sage-Combinat"
      },
      {
        "id": 7249,
        "name": "LMFDB"
      },
      {
        "id": 545,
        "name": "Maple"
      },
      {
        "id": 14460,
        "name": "Python"
      },
      {
        "id": 27596,
        "name": "ecdata"
      },
      {
        "id": 611,
        "name": "nauty"
      },
      {
        "id": 7823,
        "name": "Traces"
      },
      {
        "id": 23728,
        "name": "MathOverflow"
      },
      {
        "id": 4968,
        "name": "DLMF"
      },
      {
        "id": 724,
        "name": "polymake"
      },
      {
        "id": 560,
        "name": "Maxima"
      },
      {
        "id": 6874,
        "name": "SnapPy"
      }
    ],
    "source_code": "https://github.com/sagemath/sage",
    "standard_articles": [
      {
        "authors": [],
        "id": 7045592,
        "source": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "title": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "year": "2019"
      },
      {
        "authors": [],
        "id": 6418360,
        "source": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "title": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "year": "2015"
      },
      {
        "authors": [],
        "id": 6532438,
        "source": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "title": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "year": "2013"
      },
      {
        "authors": [],
        "id": 6462232,
        "source": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "title": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "year": "2013"
      },
      {
        "authors": [],
        "id": 6114556,
        "source": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "title": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "year": "2012"
      },
      {
        "authors": [],
        "id": 5785538,
        "source": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "title": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "year": "2010"
      },
      {
        "authors": [],
        "id": 5380273,
        "source": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "title": "zbMATH Open Web Interface contents unavailable due to conflicting licenses.",
        "year": "2008"
      }
    ],
    "zbmath_url": "https://zbmath.org/software/825"
  },

Can you use that as example data for the unit test and describe which data would be inserted to which fields in https://portal.mardi4nfdi.de/wiki/Item:Q13572 double check if the syntax for linking items based on external identifiers is P$xq$y or P$xqP$y based on the code in the mathsearch extension. Also check based on MaRDI4NFDI/mediawiki-extensions-MathSearch#2 if the rows key is needed in the result

…d outputs

Co-authored-by: physikerwelt <2777736+physikerwelt@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 25, 2026

@copilot the result of the zbMATH API for sage is that

{

...

Updated in commit 6c969e3. Here's a summary of what changed and what I verified:

Fixture updated (sagemath_825_raw.json) with the real zbMATH API data you provided: 6 authors, 62 MSC classifications, 20 related software items, 7 standard articles, and 3 sample citing article references (fetched separately via add_references_to_software).

Field mapping to https://portal.mardi4nfdi.de/wiki/Item:Q13572:

JSON key Portal field Value
qP13 Item lookup via swMath ID (P13) "825"
Len Label (English) "SageMath"
P29 Homepage "https://www.sagemath.org/"
P339 Source code repository "https://github.com/sagemath/sage"
P226_1P226_62 MSC classifications (P226) "05", "11", …, "97"
P1458q13_1P1458q13_20 Related software (P1458, looked up via swMath ID P13) "540" (Magma), "23170" (GitHub), …
P286q1459_1P286q1459_7 Standard articles (P286, looked up via zbMath ID P1459) "7045592", "6418360", …

For references (software_quickstatements_references.json):

JSON key Portal field Value
qP1451 Find article item where zbMath ID (P1451) equals value e.g. "6383667"
P1463q13 Property P1463 (citing software), value looked up via P13 "825"

On P$xq$y vs P$xqP$y: The PHP regex in QuickStatements.php is /P(?P<p>\d+)q(?P<q>\d+)/i — digits only after q, no P prefix. So P1458q13 is correct.

On the rows key: The PHP run() method iterates $this->params['rows'], so the {"rows": [...]} wrapper is required.

@physikerwelt
Copy link
Member

@copilot fix FLK-E501 warnings and discuss the pros and cons of this pr beeing part of this repo or if it should be a new github repo?

Co-authored-by: physikerwelt <2777736+physikerwelt@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 25, 2026

@copilot fix FLK-E501 warnings and discuss the pros and cons of this pr beeing part of this repo or if it should be a new g...

E501 violations fixed in commit e54393e — all three new Python files now pass flake8 with default 79-char line length.

On whether this belongs here or in a separate repo:

Arguments for keeping it here (current approach):

  • The conversion logic is tightly coupled to the zbMath REST API data structures already handled by getAsXml.py — sharing the same apply_zbmath_api_fixes, add_references_to_software, and data-dict format avoids duplication.
  • The existing test infrastructure, fixtures and CI are already in place — adding a module here requires no new repo setup.
  • The QuickStatements output is an alternative serialisation of the same data this repo already fetches; it's a format concern, not a different domain.
  • Fewer repos to maintain and coordinate releases across.

Arguments for a separate repo:

  • The QuickStatements consumer (MathSearch MediaWiki extension) is conceptually a different project from the OAI-PMH ingest pipeline; separating them would clarify responsibility.
  • A dedicated repo could have its own pyproject.toml with only the minimal dependencies (just requests), whereas this repo requires lxml, aiohttp, swh.deposit, etc.
  • It would be easier to version and release the QuickStatements exporter independently if portal property IDs ever change.
  • The CLI entry point (run_software_quickstatements.py) is oriented towards batch pipeline use rather than the OAI server use-case of this repo.

My recommendation: keep it here for now since the data coupling is tight, but if the QuickStatements exporter grows significantly (e.g. adds wikibase write-back, article exports, or its own test fixtures) it would be worth extracting into a dedicated python-mardi-quickstatements repo.

@physikerwelt physikerwelt marked this pull request as ready for review February 25, 2026 11:26
@physikerwelt physikerwelt merged commit 5e100c9 into main Feb 25, 2026
1 of 2 checks passed
@physikerwelt physikerwelt deleted the copilot/add-json-output-support branch February 25, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants