🎨 SCIENCE HIERARCHOGRAPHY: Hierarchical Organization of Science Literature

A tool for automatically generating hierarchical structures from scientific paper collections using:

Embeddings clustering techniques
LLM intelligence

The goal of this project is to develop interpretable, hierarchical representation of science papers.

📋 Table of Contents

Requirements
Data Preparation
Approaches
- SciChic Hierarchy Generation
- fLMSci Pipeline
Parameters Explanation

💡 Requirements

The requirements are listed in the requirements.txt. Use the following commands to build the environment for this project:

conda create -n science python=3.8
conda activate science
pip install -r requirements.txt

🗂️ Data Preparation

We have two paper collections available:

The 2k paper collection SciPile
The 10k paper collection SciPileLarge

You can use the following command to download:

cd download/
TODO

🔬 Approaches

🔮 SciChic Hierarchy Generation

The process has two main steps:

Generate Embeddings

First, make sure you have generated all the embeddings for your papers using:

python generate.py --input_folder /path/to/your/papers --output_file ./embeddings/your_embedding_name.pkl

Create Hierarchy

Then you can start creating the hierarchy with:

python main.py \
  --embedding_generator qwen \
  --summary_generator llama \
  --clustering_method kmeans \
  --evaluator qwen \
  --clustering_direction top_down \
  --base_path /project/directory/ \
  --cluster_sizes 276 40 6 \
  --run_time 1 \
  --evaluate_time 1 \
  --test_count 5 \
  --pre_generated_embeddings_file ./embedding_file.pkl \
  --evaluate_type normal \
  --embedding_source all

Parameters Explanation

embedding_generator: Model used to generate embeddings (options: qwen, llama, etc.)
summary_generator: Model used to generate summaries for clusters
clustering_method: Algorithm for clustering (options: kmeans, hierarchical, etc.)
clustering_direction: Direction of hierarchy building (top_down or bottom_up)
cluster_sizes: Number of clusters at each level of the hierarchy
embedding_source: Contribution type used to create the hierarchy:
- all: Use all paper content
- problem: Focus on problem statements
- solution: Focus on proposed solutions
- results: Focus on research results

🧵 fLMSci Pipeline

fLMSci is an LLM-based scientific hierarchography creation pipeline that offers two approaches:

Pipeline Types

Script	Pipeline type	Main steps
run_par.sh	Parallel	1. Generate topics & rationales → 2. Place topics in parallel → 3. Merge chunked taxonomy → 4. Map papers → (optional) Evaluate
run_incr.sh	Incremental	1. Generate topics & rationales → 2. Incrementally place each topic → 3. Map papers → (optional) Evaluate

Setup & Execution

Before running the pipelines, you need to:

Place JSON files inside the jsons folder
Give the shell scripts execute permission (one-time step):
```
chmod +x run_par.sh run_incr.sh
```

Running the Parallel Pipeline

bash run_par.sh                # basic run
bash run_par.sh --evaluate     # run + evaluation

Running the Incremental Pipeline

bash run_incr.sh               # basic run
bash run_incr.sh --evaluate    # run + evaluation

You can also customize the run with additional parameters:

bash run_incr.sh --batch_size 16 --max_depth 8 --evaluate

Note: Each pipeline can also be run step by step by following their individual README files.

Demo of the resulting hierarchyies

See this Huggingface demo: https://huggingface.co/spaces/jhu-clsp/ScienceHierarchography

Relevant reference

@article{gao2025sciencehierarchographyhierarchicalorganization,
      title={Science Hierarchography: Hierarchical Organization of Science Literature}, 
      author={Muhan Gao and Jash Shah and Weiqi Wang and Daniel Khashabi},
      year={2025},
      eprint={2504.13834},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.13834}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
SCYCHIC		SCYCHIC
fLMSci		fLMSci
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎨 SCIENCE HIERARCHOGRAPHY: Hierarchical Organization of Science Literature

📋 Table of Contents

💡 Requirements

🗂️ Data Preparation

🔬 Approaches

🔮 SciChic Hierarchy Generation

Generate Embeddings

Create Hierarchy

Parameters Explanation

🧵 fLMSci Pipeline

Pipeline Types

Setup & Execution

Running the Parallel Pipeline

Running the Incremental Pipeline

Demo of the resulting hierarchyies

Relevant reference

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

JHU-CLSP/science-hierarchography

Folders and files

Latest commit

History

Repository files navigation

🎨 SCIENCE HIERARCHOGRAPHY: Hierarchical Organization of Science Literature

📋 Table of Contents

💡 Requirements

🗂️ Data Preparation

🔬 Approaches

🔮 SciChic Hierarchy Generation

Generate Embeddings

Create Hierarchy

Parameters Explanation

🧵 fLMSci Pipeline

Pipeline Types

Setup & Execution

Running the Parallel Pipeline

Running the Incremental Pipeline

Demo of the resulting hierarchyies

Relevant reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages