Great works, I'm wondering how to identify QA pairs from web data? is there any rule-based filter? Can you guys open source this part?
the origin paper mentioned:
"Q&A Extraction Question-and-answer data is inherently well-structured and embod-
ies a concentrated form of knowledge, making it valuable for problem-solving bench-
marks (Maini et al., 2024). Recent work reveal that these data can be found in pre-training
data with massive quantity (Yue et al., 2024). We thus integrate and further verify this in
MegaMath. Our pipeline contains two steps: (1) identify and extract Q&A pairs from the
raw documents; (2) refine the Q&A to make up or improve the intermediate reasoning steps."