Using same crawl4ai code for various websites. #860

VAIBHAVAGARWAL12 · 2025-03-19T09:23:38Z

VAIBHAVAGARWAL12
Mar 19, 2025

Hi Everyone! I am trying to build an application with the help of Crawl4ai. If you have any suggestion please help me with them as I'm confused on how should I proceed. The application would do the following:-

Crawl4ai can receive any website link on which it will have to do specific scrapping.
It is for sure that the website will be related to products, meaning it will have products and then product related pdfs eg. installation, specification etc.
Now I want crawl4ai to scrape those pdfs link for me so that I can download those pdfs in a formatted way.
This is more or less what I want to do

Now I have done something similar but that only works for a one website which is hartzell, how I did it is
The link I am providing is hartzell products page link from there with help of crawl4ai I scrapped all the category links and stored them in a list.
Then i ran crawl4ai on all those category links to scrape the product links
Now finally i ran crawl4ai on all the product links and stored all the product related pdfs
Finally through which i got the link of all the pdfs
Now to scrape specific things from the website eg. category links, product links and pdf links, I used [extracted_json.json](https://github.com/user-attachments/files/19337330/extracted_json.json) present in Crawl4ai and gave a schema in this extraction strategy which I had to create manually for category links, product links and pdf links.
Note:- I'm also attaching the json file which I receive right now as an output after doing all the crawling and combining the result I received from the crawling.

What I want to do is I would want it to be dynamic so that if works with any website the user might provide of course related to a product company like JCB, Hartzell
Tell me if you want me to add anything else.
Thanks

aravindkarnam · 2025-03-19T09:36:52Z

aravindkarnam
Mar 19, 2025
Collaborator

Glad to hear that you are using Crawl4AI and your use case sounds very interesting.

I can give you a rough idea on how to proceed about this generic data extraction. However you'll have to experiment and do trial and error.

Since you have fixed output format for extraction. Use the JsonCssExtractionStrategy.generate_schema() utility. Whenever you encounter an entirely new domain/page, pass the raw/cleaned html. Pass your target_json_example as follows

schema = JsonCssExtractionStrategy.generate_schema(
            result.html,
            llm_config=LLMConfig(
                provider="gemini/gemini-2.0-flash", api_token="env:GEMINI_API_KEY"
            ),
            target_json_example={
                "name": "McDonald's",
                "menu": [
                    {
                        "name": "Chicken Surprise Burger",
                        "price": 76,
                        "rating": 3.3
                    },
                    {
                        "name": "Chicken Surprise Burger Combo",
                        "price": 238,
                        "rating": 3.3
                    },
                ],
            },
        )

Now proceed with using this schema to extract the product details from the page using the JsonCssExtractionStrategy.
Cache the schema, so next time a query comes for same domain(say JCB.com), use the cached schema with JsonCssExtractionStrategy

This way you make LLM to generate schema(from html) but only once per site. Does this make sense? Try once and let us know how this worked for you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using same crawl4ai code for various websites. #860

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Using same crawl4ai code for various websites. #860

Uh oh!

VAIBHAVAGARWAL12 Mar 19, 2025

Replies: 1 comment

Uh oh!

aravindkarnam Mar 19, 2025 Collaborator

VAIBHAVAGARWAL12
Mar 19, 2025

aravindkarnam
Mar 19, 2025
Collaborator