A tool for automatically generating hierarchical structures from scientific paper collections using:
- Embeddings clustering techniques
- LLM intelligence
The goal of this project is to develop interpretable, hierarchical representation of science papers.
The requirements are listed in the requirements.txt. Use the following commands to build the environment for this project:
conda create -n science python=3.8
conda activate science
pip install -r requirements.txtWe have two paper collections available:
- The 2k paper collection SciPile
- The 10k paper collection SciPileLarge
You can use the following command to download:
cd download/
TODOThe process has two main steps:
First, make sure you have generated all the embeddings for your papers using:
python generate.py --input_folder /path/to/your/papers --output_file ./embeddings/your_embedding_name.pklThen you can start creating the hierarchy with:
python main.py \
--embedding_generator qwen \
--summary_generator llama \
--clustering_method kmeans \
--evaluator qwen \
--clustering_direction top_down \
--base_path /project/directory/ \
--cluster_sizes 276 40 6 \
--run_time 1 \
--evaluate_time 1 \
--test_count 5 \
--pre_generated_embeddings_file ./embedding_file.pkl \
--evaluate_type normal \
--embedding_source all- embedding_generator: Model used to generate embeddings (options: qwen, llama, etc.)
- summary_generator: Model used to generate summaries for clusters
- clustering_method: Algorithm for clustering (options: kmeans, hierarchical, etc.)
- clustering_direction: Direction of hierarchy building (top_down or bottom_up)
- cluster_sizes: Number of clusters at each level of the hierarchy
- embedding_source: Contribution type used to create the hierarchy:
- all: Use all paper content
- problem: Focus on problem statements
- solution: Focus on proposed solutions
- results: Focus on research results
fLMSci is an LLM-based scientific hierarchography creation pipeline that offers two approaches:
| Script | Pipeline type | Main steps |
|---|---|---|
| run_par.sh | Parallel | 1. Generate topics & rationales → 2. Place topics in parallel → 3. Merge chunked taxonomy → 4. Map papers → (optional) Evaluate |
| run_incr.sh | Incremental | 1. Generate topics & rationales → 2. Incrementally place each topic → 3. Map papers → (optional) Evaluate |
Before running the pipelines, you need to:
- Place JSON files inside the
jsonsfolder - Give the shell scripts execute permission (one-time step):
chmod +x run_par.sh run_incr.sh
bash run_par.sh # basic run
bash run_par.sh --evaluate # run + evaluationbash run_incr.sh # basic run
bash run_incr.sh --evaluate # run + evaluationYou can also customize the run with additional parameters:
bash run_incr.sh --batch_size 16 --max_depth 8 --evaluateNote: Each pipeline can also be run step by step by following their individual README files.
See this Huggingface demo: https://huggingface.co/spaces/jhu-clsp/ScienceHierarchography
@article{gao2025sciencehierarchographyhierarchicalorganization,
title={Science Hierarchography: Hierarchical Organization of Science Literature},
author={Muhan Gao and Jash Shah and Weiqi Wang and Daniel Khashabi},
year={2025},
eprint={2504.13834},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.13834},
}