Using same crawl4ai code for various websites. #860
Unanswered
VAIBHAVAGARWAL12
asked this question in
Forums - Q&A
Replies: 1 comment
-
Glad to hear that you are using Crawl4AI and your use case sounds very interesting. I can give you a rough idea on how to proceed about this generic data extraction. However you'll have to experiment and do trial and error.
schema = JsonCssExtractionStrategy.generate_schema(
result.html,
llm_config=LLMConfig(
provider="gemini/gemini-2.0-flash", api_token="env:GEMINI_API_KEY"
),
target_json_example={
"name": "McDonald's",
"menu": [
{
"name": "Chicken Surprise Burger",
"price": 76,
"rating": 3.3
},
{
"name": "Chicken Surprise Burger Combo",
"price": 238,
"rating": 3.3
},
],
},
)
This way you make LLM to generate schema(from html) but only once per site. Does this make sense? Try once and let us know how this worked for you. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Everyone! I am trying to build an application with the help of Crawl4ai. If you have any suggestion please help me with them as I'm confused on how should I proceed. The application would do the following:-
Crawl4ai can receive any website link on which it will have to do specific scrapping.
It is for sure that the website will be related to products, meaning it will have products and then product related pdfs eg. installation, specification etc.
Now I want crawl4ai to scrape those pdfs link for me so that I can download those pdfs in a formatted way.
This is more or less what I want to do
Now I have done something similar but that only works for a one website which is hartzell, how I did it is
The link I am providing is hartzell products page link from there with help of crawl4ai I scrapped all the category links and stored them in a list.
Then i ran crawl4ai on all those category links to scrape the product links
Now finally i ran crawl4ai on all the product links and stored all the product related pdfs
Finally through which i got the link of all the pdfs
Now to scrape specific things from the website eg. category links, product links and pdf links, I used
[extracted_json.json](https://github.com/user-attachments/files/19337330/extracted_json.json)
present in Crawl4ai and gave a schema in this extraction strategy which I had to create manually for category links, product links and pdf links.Note:- I'm also attaching the json file which I receive right now as an output after doing all the crawling and combining the result I received from the crawling.
What I want to do is I would want it to be dynamic so that if works with any website the user might provide of course related to a product company like JCB, Hartzell
Tell me if you want me to add anything else.
Thanks
Beta Was this translation helpful? Give feedback.
All reactions