[Bug]: PruningContentFilter strips out <a> and <strong> tags completely #629
-
crawl4ai versionCrawl4AI 0.4.247 Expected BehaviorPruningContentFilter works perfect for my use case, just keeps the meat / pure content. Is this reproducible?Yes Inputs Causing the BugSteps to ReproduceCode snippets### INPUT:
###<p>If you want to examine this <a href="https://www.domain.com">Raspberry Pi project</a> in greater detail and see how it's constructed, visit <strong>GitHub</strong> where you can also follow on X for updates.</p>
### OUTPUT ("Raspberry Pi project" and "GitHub" are missing):
### If you want to examine this in greater detail and see how it's constructed, visit where you can also follow on X for updates.
def get_md_conf():
prune_filter = PruningContentFilter(
threshold_type="dynamic",
min_word_threshold=20,
)
md_generator = DefaultMarkdownGenerator(
content_filter=prune_filter,
)
return md_generator
def get_crawl_conf():
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=get_md_conf(),
excluded_tags=["form", "header", "footer", "nav"],
word_count_threshold=20,
)
return crawl_config
async def process_url(
url: str,
crawler: AsyncWebCrawler,
):
crawl_config = get_crawl_conf()
result = await crawler.arun(
url=url,
config=crawl_config,
)
if result.success:
print(result.markdown_v2.fit_markdown)OSWindows Python version3.13 BrowserNo response Browser versionNo response Error logs & Screenshots (if applicable)No response |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
|
@qoder1337 Thanks for reporting this. I'll investigate this tomorrow. |
Beta Was this translation helpful? Give feedback.
-
RCA@qoder1337 This can't exactly be classified as a bug. If you set the prune_filter = PruningContentFilter(
threshold_type="fixed",
min_word_threshold=20,
threshold=-1
)Issue with this setting, is it will allow all other useless tags also to move along causing your output to become very heavy downstream, Now I think this(suppressing anchor and strong tags) is rather the feature, than a bug 🤓. In pruning algorithm there's a score that's being calculated based on various factors among which there's self.tag_importance = {
"article": 1.5,
"main": 1.4,
"section": 1.3,
"p": 1.2,
"h1": 1.4,
"h2": 1.3,
"h3": 1.2,
"div": 0.7,
"span": 0.6,
}
self.tag_weights = {
"div": 0.5,
"p": 1.0,
"article": 1.5,
"section": 1.0,
"span": 0.3,
"li": 0.5,
"ul": 0.5,
"ol": 0.5,
"h1": 1.2,
"h2": 1.1,
"h3": 1.0,
"h4": 0.9,
"h5": 0.8,
"h6": 0.7,
}As you can see both So if we are to say tags like @unclecode What's your view on this? |
Beta Was this translation helpful? Give feedback.
-
|
hi thanks for your answer and sorry for late reply. BM25ContentFilter doesnt work as intended for my use case here. i went with the following setting, where markdown syntax gets shown in a and strong tags, but it seems to be a good compromise for my scenario to keep the other noise out: |
Beta Was this translation helpful? Give feedback.
-
|
@qoder1337 Cool then! Since changing the threshold was able to get you your desired output. I'm closing this issue. Also I'm moving this to forums, so that it may be of use for others as well. |
Beta Was this translation helpful? Give feedback.
RCA
@qoder1337 This can't exactly be classified as a bug. If you set the
PruningContentFilteras follows then it will capture the<a>and<strong>tags also in fit_markdown (ie you'll see the semantic meaning conserved in the markdown)Issue with this setting, is it will allow all other useless tags also to move along causing your output to become very heavy downstream,
Now I think this(suppressing anchor and strong tags) is rather the feature, than a bug 🤓. In pruning algorithm there's a score that's being calculated based on various factors among which t…