Crawling PDF does... nothing? #844
Unanswered
myconite
asked this question in
Forums - Q&A
Replies: 1 comment 1 reply
-
@myconite Here's the example for PDF processing in the library from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
import asyncio
async def main():
async with AsyncWebCrawler(crawler_strategy=PDFCrawlerStrategy()) as crawler:
result = await crawler.arun(
"https://arxiv.org/pdf/2310.06825.pdf",
config=CrawlerRunConfig(
scraping_strategy=PDFContentScrapingStrategy()
)
)
print(result.markdown) # Access extracted text
print(result.metadata) # Access PDF metadata (title, author, etc.)
asyncio.run(main()) As you can there's new crawler and scraper strategies that are to be passed in for the PDF processing. Right now based on the docs here for CLI it appears we haven't supported them yet in the CLI. I guess the best outcome is "based on the file extension in URL the CLI should automatically switch to the PDF crawler and scraper strategies". Anyway I'll flag this for development in upcoming release. Thanks for pointing this out. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I tried within a script as well as with the new CLI interface... can't seem to get any results out of a pdf file. Am I missing something?
Here an example:
$ crwl https://bitcoin.org/bitcoin.pdf
[INIT].... → Crawl4AI 0.5.0.post4
[FETCH]... ↓ https://bitcoin.org/bitcoin.pdf... | Status: True | Time: 2.48s
[SCRAPE].. ◆ https://bitcoin.org/bitcoin.pdf... | Time: 0.002s
[COMPLETE] ● https://bitcoin.org/bitcoin.pdf... | Status: True | Total: 2.49s
{
"url": "https://bitcoin.org/bitcoin.pdf",
"html": "<body style="height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);"><embed name="F01336A35C7ACC4FE528B2905DD22915" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="F01336A35C7ACC4FE528B2905DD22915">",
"success": true,
"cleaned_html": "",
"media": {
"images": [],
"videos": [],
"audios": []
},
"links": {
"internal": [],
"external": []
},
"downloaded_files": null,
"js_execution_result": null,
"screenshot": null,
"pdf": null,
"extracted_content": null,
"metadata": {
"title": null,
"description": null,
"keywords": null,
"author": null
},
"error_message": "",
"session_id": null,
"response_headers": {
"accept-ranges": "bytes",
"alt-svc": "h3=":443"; ma=86400",
"cf-cache-status": "DYNAMIC",
"cf-ray": "921ab9f629265db9-VIE",
"content-length": "184292",
"content-type": "application/pdf",
"date": "Mon, 17 Mar 2025 07:20:12 GMT",
"etag": ""6783dca2-2cfe4"",
"last-modified": "Sun, 12 Jan 2025 15:15:46 GMT",
"nel": "{"success_fraction":0,"report_to":"cf-nel","max_age":604800}",
"report-to": "{"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v4?s=4yKO3L7Y2YOayzE0QB3R72%2BPbrN50ubF2qVCgJLcsgcTQ5Mf%2BzEBA0nZX6SB3DJcZVwkFUF5rBdwE94SjOEUsY5pOtudmtMQOrNqxrXXkpRLoi5Ht7PjxrHLeXBbqw%3D%3D"}],"group":"cf-nel","max_age":604800}",
"server": "cloudflare",
"server-timing": "cfL4;desc="?proto=TCP&rtt=8405&min_rtt=6875&rtt_var=3661&sent=7&recv=9&lost=0&retrans=0&sent_bytes=3977&recv_bytes=2564&delivery_rate=309760&cwnd=253&unsent_bytes=0&cid=dc37bd6174bc1726&ts=203&x=0"",
"strict-transport-security": "max-age=31536000; includeSubDomains; preload"
},
"status_code": 200,
"ssl_certificate": null,
"dispatch_result": null,
"redirected_url": "https://bitcoin.org/bitcoin.pdf",
"markdown": {
"raw_markdown": "\n",
"markdown_with_citations": "\n",
"references_markdown": "\n\n## References\n\n",
"fit_markdown": "",
"fit_html": ""
}
}
Beta Was this translation helpful? Give feedback.
All reactions