[Bug]: PruningContentFilter strips out <a> and tags completely #629

qoder1337 · 2025-01-29T10:24:46Z

qoder1337
Jan 29, 2025

crawl4ai version

Crawl4AI 0.4.247

Expected Behavior

PruningContentFilter works perfect for my use case, just keeps the meat / pure content.
BUT it removes a and strong tags completely.

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

### INPUT:
###<p>If you want to examine this <a href="https://www.domain.com">Raspberry Pi project</a> in greater detail and see how it's constructed, visit <strong>GitHub</strong> where you can also follow on X for updates.</p>

### OUTPUT ("Raspberry Pi project" and "GitHub" are missing):
### If you want to examine this in greater detail and see how it's constructed, visit where you can also follow on X for updates.

def get_md_conf():
    prune_filter = PruningContentFilter(        
        threshold_type="dynamic",
        min_word_threshold=20,
    )
    md_generator = DefaultMarkdownGenerator(
        content_filter=prune_filter,
    )
    return md_generator


def get_crawl_conf():
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=get_md_conf(),
        excluded_tags=["form", "header", "footer", "nav"],
        word_count_threshold=20,
    )
    return crawl_config

async def process_url(
    url: str,
    crawler: AsyncWebCrawler,
):
    crawl_config = get_crawl_conf()
    result = await crawler.arun(
                url=url,
                config=crawl_config,
              )
    if result.success:
        print(result.markdown_v2.fit_markdown)

OS

Windows

Python version

3.13

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Answered by aravindkarnam

Jan 31, 2025

RCA

@qoder1337 This can't exactly be classified as a bug. If you set the PruningContentFilter as follows then it will capture the <a> and  tags also in fit_markdown (ie you'll see the semantic meaning conserved in the markdown)

prune_filter = PruningContentFilter(        
        threshold_type="fixed",
        min_word_threshold=20,
        threshold=-1
    )

Issue with this setting, is it will allow all other useless tags also to move along causing your output to become very heavy downstream,

Now I think this(suppressing anchor and strong tags) is rather the feature, than a bug 🤓. In pruning algorithm there's a score that's being calculated based on various factors among which t…

View full answer

aravindkarnam · 2025-01-31T12:31:38Z

aravindkarnam
Jan 31, 2025
Collaborator

@qoder1337 Thanks for reporting this. I'll investigate this tomorrow.

0 replies

aravindkarnam · 2025-01-31T18:02:18Z

aravindkarnam
Jan 31, 2025
Collaborator

RCA

@qoder1337 This can't exactly be classified as a bug. If you set the PruningContentFilter as follows then it will capture the <a> and  tags also in fit_markdown (ie you'll see the semantic meaning conserved in the markdown)

prune_filter = PruningContentFilter(        
        threshold_type="fixed",
        min_word_threshold=20,
        threshold=-1
    )

Issue with this setting, is it will allow all other useless tags also to move along causing your output to become very heavy downstream,

Now I think this(suppressing anchor and strong tags) is rather the feature, than a bug 🤓. In pruning algorithm there's a score that's being calculated based on various factors among which there's tag_importance(for dynamic threshold) and tag_weights. If the score is above threshold the tags are preserved for markdown conversion, else they are decomposed. Here they are

       self.tag_importance = {
            "article": 1.5,
            "main": 1.4,
            "section": 1.3,
            "p": 1.2,
            "h1": 1.4,
            "h2": 1.3,
            "h3": 1.2,
            "div": 0.7,
            "span": 0.6,
        }


        self.tag_weights = {
            "div": 0.5,
            "p": 1.0,
            "article": 1.5,
            "section": 1.0,
            "span": 0.3,
            "li": 0.5,
            "ul": 0.5,
            "ol": 0.5,
            "h1": 1.2,
            "h2": 1.1,
            "h3": 1.0,
            "h4": 0.9,
            "h5": 0.8,
            "h6": 0.7,
        }

As you can see both <a> and  are not even in the list. They get a very small score, hence discarded. (-1.0 actually, that's why I had to set negative threshold in earlier example)

So if we are to say tags like <a> and text styling tags like , , are to be preserved for their semantic meaning in markdown, that's a wider discussion to be had. Did you try BM25ContentFilter? How's that working out in this scenario?

@unclecode What's your view on this?

0 replies

qoder1337 · 2025-02-06T18:46:00Z

qoder1337
Feb 6, 2025
Author

hi thanks for your answer and sorry for late reply. BM25ContentFilter doesnt work as intended for my use case here.

i went with the following setting, where markdown syntax gets shown in a and strong tags, but it seems to be a good compromise for my scenario to keep the other noise out:
prune_filter = PruningContentFilter(
threshold=0.6,
threshold_type="dynamic",
)

0 replies

aravindkarnam · 2025-02-07T03:10:03Z

aravindkarnam
Feb 7, 2025
Collaborator

@qoder1337 Cool then! Since changing the threshold was able to get you your desired output. I'm closing this issue. Also I'm moving this to forums, so that it may be of use for others as well.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: PruningContentFilter strips out <a> and <strong> tags completely #629

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

[Bug]: PruningContentFilter strips out <a> and <strong> tags completely #629

Uh oh!

Uh oh!

qoder1337 Jan 29, 2025

crawl4ai version

Expected Behavior

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

RCA

Replies: 4 comments

Uh oh!

aravindkarnam Jan 31, 2025 Collaborator

Uh oh!

aravindkarnam Jan 31, 2025 Collaborator

RCA

Uh oh!

qoder1337 Feb 6, 2025 Author

Uh oh!

aravindkarnam Feb 7, 2025 Collaborator

qoder1337
Jan 29, 2025

aravindkarnam
Jan 31, 2025
Collaborator

aravindkarnam
Jan 31, 2025
Collaborator

qoder1337
Feb 6, 2025
Author

aravindkarnam
Feb 7, 2025
Collaborator