[Feature Request]: support for custom transforms via function calling #1476
PorridgeBear
started this conversation in
Feature requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What needs to be done?
Firstly - great library. I've hit upon a couple of use cases that would benefit from the XPath/CSS extractors being able to call into custom code to "do something with the value" the XPath/CSS gets hold of today:
When scraping an AEM website's web components e.g.
<foo-bar data-model="<stringified json>" />
I have needed to use thetype=attribute, attribute=data-model
to get ahold of the JSON string, but it would be amazing to be able to route this into a further function for further processing that returns a final value to the field based on custom logicWhen scraping a blog and obtaining e.g.a date string like 23 June 2025, I would rather have an ISO date string. Here, it seems like the
type=transform
would be handy except it's a fairly limited set. So much like (1) if it were possible to send the result of the XPath/CSS to a function, then I could return the ISO string.Some further observations - I noticed in the code that there is a
type=computed
field which then looks forexpression
orfunction
and so I played with this. It looks like thefunction
idea here is what I need, except it is not passed the current XPath/CSS selected value to do something with - it is only passed already processed fields.What problem does this solve?
It opens up the schema-based extractors to produce highly custom output that is far more useful for onward processing.
Target users/beneficiaries
Anyone wanting to extract structured data from the scrape.
Current alternatives/workarounds
A custom processing step that needs to read all the schema JSON output files to perform the transforms/customisations rather than doing it with the existing library features.
Proposed approach
Code appears to be present/have been started but did not work for me as expected.
What is key is that the standard extraction can still work i.e. extracting an attribute, or extracting html or text, but then it should be possible to route this value into a custom function to get a final value.
Beta Was this translation helpful? Give feedback.
All reactions