Skip to content
Discussion options

You must be logged in to vote

RCA

@qoder1337 This can't exactly be classified as a bug. If you set the PruningContentFilter as follows then it will capture the <a> and <strong> tags also in fit_markdown (ie you'll see the semantic meaning conserved in the markdown)

prune_filter = PruningContentFilter(        
        threshold_type="fixed",
        min_word_threshold=20,
        threshold=-1
    )

Issue with this setting, is it will allow all other useless tags also to move along causing your output to become very heavy downstream,

Now I think this(suppressing anchor and strong tags) is rather the feature, than a bug 🤓. In pruning algorithm there's a score that's being calculated based on various factors among which t…

Replies: 4 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by aravindkarnam
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 participants
Converted from issue

This discussion was converted from issue #582 on February 07, 2025 03:10.