ProfOlaf

ProfOlaf was built to help researchers with literature reviews. It automates the process of snowballing articles through an initial seed, and helps raters through the process of screening. It also provides several LLM-assisted tools for article analysis.

ProfOlaf is described in our paper.

If you use our tool, please cite it:

@article{afonso2025profolaf,
  title={ProfOlaf: Semi-Automated Tool for Systematic Literature Reviews},
  author={Afonso, Martim and Saavedra, Nuno and Louren{\c{c}}o, Bruno and Mendes, Alexandra and Ferreira, Jo{\~a}o},
  journal={arXiv preprint arXiv:2510.26750},
  year={2025}
}

This document provides a walkthrough of ProfOlaf, demonstrating how the tool supports automated and semi-automated snowballing for literature reviews. The tool is available both as a web application and as a command-line interface. Here we describe the typical usage of the command-line version, which exposes the full pipeline.

Prerequisites and Input

Before running ProfOlaf, the user must prepare:

Seed file: A plain-text (.txt) or JSON file containing the titles of the seed articles. These articles represent the starting point of the snowballing process.
Python environment: Install dependencies with pip install -r requirements.txt.
LLM configuration (optional): For LLM-assisted screening and topic modeling, create utils/article_llm_analysis/llm_config.json with API keys for OpenAI, Gemini, or Anthropic.

Main Snowballing Pipeline

The snowballing workflow consists of the following steps, executed sequentially:

`utils/setup/generate_search_conf.py`

Generates the search configuration used to query the supported scholarly databases. It stores metadata filtering criteria, file paths, and search method in confs/search_conf.json. Run interactively—no arguments required.

Prompts for:

Year interval (start/end)
Venue rank list (A*, A, B, C, D, Q1–Q4, NA)
Proxy key (or env variable name) for web scraping
Initial seed file path
Database path
Output CSV path
Search method: google_scholar, semantic_scholar, or dblp
Annotations (inclusion criteria)

`0_generate_snowball_start.py`

Initializes the snowballing process using the seed file and stores the initial set of articles in the database.

Argument	Description	Default
`--input_file`	Path to seed file (txt or json)	`search_conf["initial_file"]`
`--delay`	Delay between API requests (seconds)	`1.0`
`--db_path`	Database path	`search_conf["db_path"]`
`--search_method`	`google_scholar`, `semantic_scholar`, or `dblp`	`search_conf["search_method"]`

`1_start_iteration.py`

Starts a new snowballing iteration, collecting forward and backward citations from the set of articles of the previous iteration (or the initial set when starting a new search). Currently, only Semantic Scholar provides both backward and forward snowballing; Google Scholar provides citations only. The CLI script also supports repairing broken or unidentified references.

Argument	Description	Default
`--iteration`	Iteration number	required
`--db_path`	Database path	`search_conf["db_path"]`
`--search_method`	`google_scholar`, `semantic_scholar`, or `dblp`	`search_conf["search_method"]`
`--repair`	Repair method for broken references: `remove` or `manual`	required
`--verbose`	Verbose output	flag

Note: The full citation collection is available through the web application (app.py). The CLI script focuses on reference repair.

`2_remove_duplicates.py`

Identifies and removes duplicate entries across databases based on title similarity.

Argument	Description	Default
`--db_path`	Database path	`search_conf["db_path"]`
`--iterations`	Iteration numbers to process (e.g. `1 2 3`)	—
`--similarity_threshold`	Title similarity threshold (0.0–1.0)	`0.8`
`--auto_remove`	Remove duplicates without confirmation	flag

`3_get_bibtex.py`

Retrieves BibTeX metadata for the collected articles. Without a web-scraping proxy, Semantic Scholar is recommended. Too many requests to Google Scholar may result in a block.

Argument	Description	Default
`--iteration`	Iteration number	—
`--db_path`	Database path	`search_conf["db_path"]`
`--batch_size`	Batch size for processing	`1`
`--max_workers`	Max parallel workers	`3`
`--parallel`	Disable parallel processing	flag
`--delay`	Delay between requests (seconds)	`1.0`
`--search_method`	Search method	`search_conf["search_method"]`

(Optional) `4_generate_conf_rank.py`

Filters articles based on venue ranking if you wish to restrict the corpus to specific publication venues. Uses the venue rank list from search_conf.json and fetches ranks from CORE and Scimago when needed.

Argument	Description	Default
`--iteration`	Iteration number	—
`--db_path`	Database path	`search_conf["db_path"]`

`5_filter_by_metadata.py`

Filters articles according to metadata attributes (year, venue, online availability, language). Configurable via search_conf.json.

Argument	Description	Default
`--iteration`	Iteration number	required
`--db_path`	Database path	`search_conf["db_path"]`
`--disable_venue_check`	Skip venue check	flag
`--disable_year_check`	Skip year check	flag
`--disable_english_check`	Skip language check	flag
`--disable_download_check`	Skip download availability check	flag

`6_filter_by_title.py`

Performs title-based screening. The user is interactively prompted to decide whether to keep or discard each article, with a brief justification. Users are encouraged to be conservative and only discard clearly irrelevant articles. Optionally, LLM-based screening can be enabled.

Argument	Description	Default
`--iteration`	Iteration number	required
`--db-path`	Database path	`search_conf["db_path"]`
`--rater`	Rater name or ID	required
`--llm`	Use LLM for screening	`False`
`--model`	LLM model (e.g. `gpt-4o`)	`gpt-4o`
`--api-key`	API key for LLM	—

`7_solve_title_disagreements.py`

Resolves disagreements between multiple raters. Presents articles for which raters disagreed, along with their reasoning, and prompts them to reach a consensus decision.

Argument	Description
`--iteration`	Iteration number
`--search_dbs`	One or more database paths (e.g. `db1.db db2.db`)

`8_filter_by_content.py`

Performs content-based screening using the full text (abstract and introduction) of the articles, following the same interaction model as title-based filtering.

Argument	Description	Default
`--iteration`	Iteration number	required
`--db-path`	Database path	`search_conf["db_path"]`
`--rater`	Rater ID	required
`--llm`	Use LLM for screening	`False`
`--model`	LLM model	`gpt-4o`
`--api-key`	API key for LLM	—
`--article_folder`	Folder to store downloaded PDFs (required for LLM content screening)	—

`9_solve_content_disagreements.py`

Resolves rater disagreements from content-based screening. The resulting set after this step marks the end of an iteration.

Argument	Description
`--iteration`	Iteration number
`--search_dbs`	One or more database paths

Iteration

Steps 1 through 9 are repeated until no new articles are discovered.

`10_generate_csv.py`

Produces the final CSV file containing the selected articles and their metadata.

Argument	Description	Default
`--iterations`	Iteration numbers to include (e.g. `1 2 3`)	—
`--db_path`	Database path	`search_conf["db_path"]`
`--output_path`	Output CSV path	`search_conf["csv_path"]`

Additional Analysis Scripts

For post hoc analysis of the final article set, run:

`utils/article_llm_analysis/generate_llm_analysis_conf.py`

The analysis configuration is typically created via the web application, which writes confs/analysis_conf.json with:

articles_folder: Path to store downloaded PDFs
csv_path: Path to the final CSV
seed_file: Seed file for topic modeling
output_path: Output directory for topic modeling results
topics_file: Topics file name
llm_config: Path to LLM configuration

For CLI-only usage, create confs/analysis_conf.json manually with these keys.

`11_download_pdfs.py`

Downloads all article PDFs to a folder for subsequent analysis.

Argument	Description	Default
`--csv_file`	Input CSV with article metadata	`search_conf["csv_path"]`
`--article_folder`	Output folder for PDFs	`analysis_conf["articles_folder"]`

Note: Requires confs/search_conf.json and confs/analysis_conf.json.

Topic Modeling

Topic modeling is run through five scripts using TopicGPT with LangChain LLM support. All scripts expect confs/analysis_conf.json and an LLM config (e.g. utils/article_llm_analysis/llm_config.json). Use --help-detailed on any topic modeling script for full documentation.

`11_topic_modeling_lvl1.py`

Generates high-level topics from the article set. Extracts text from PDFs automatically.

Argument	Description	Default
`pdf_folder`	Folder containing PDFs	—
`--output-dir`	Output directory	`analysis_conf["output_path"]`
`--config`	LLM config file	`paper_analysis/llm_config.json`
`--provider`	`openai`, `gemini`, `anthropic`	`openai`
`--seed-file`	Seed file for topic generation	`analysis_conf["output_path"]/analysis_conf["seed_file"]`
`--prompt-file`	Custom prompt file	—
`--max-workers`	Parallel workers	`4`

`11_topic_modeling_lvl2.py`

Generates more specific sub-topics from the level-1 topics.

Argument	Description	Default
`pdf_folder`	Folder containing PDFs	—
`--output-dir`	Output directory	`analysis_conf["output_path"]`
`--seed-file`	Level-1 topics file (e.g. `topics_lvl1.md`)	`output_dir/topics_lvl1.md`
`--prompt-file`	Custom prompt for level 2	—
`--config`, `--provider`, `--max-workers`	Same as level 1	—

`11_topic_modeling_refine.py`

Merges similar topics and removes overly specific or redundant topics (e.g. those in <1% of articles).

Argument	Description	Default
`pdf_folder`	Folder containing PDFs	—
`--topic-file`	Topic file to refine	required
`--generation-file`	Generation JSON file	required
`--out-file`	Refined topics output	`output_dir/topics_refined.md`
`--updated-file`	Updated generation JSON	`output_dir/refinement.json`
`--no-remove`	Do not remove topics during refinement	flag
`--prompt-file`	Custom refinement prompt	—

`11_topic_modeling_assign.py`

Assigns the generated topics to each article.

Argument	Description	Default
`pdf_folder`	Folder containing PDFs	—
`--topic-file`	Topic file (auto-detected if omitted)	—
`--prompt-file`	Assignment prompt	`utils/prompts/topic_modeling_prompts/assignment.txt`
`--data-file`	Data file (e.g. `data.jsonl`)	`output_dir/data.jsonl`
`--output-file`	Assignments output	`output_dir/assignments.jsonl`

`11_topic_modeling_correct.py`

Corrects hallucinated or erroneous topic assignments.

Argument	Description	Default
`pdf_folder`	Folder containing PDFs	—
`--data-path`	Assignments file (e.g. `assignments.jsonl`)	`output_dir/assignments.jsonl`
`--output-path`	Corrected assignments output	`output_dir/corrected_assignments.jsonl`
`--topic-path`	Topic file	auto-detected
`--prompt-path`	Correction prompt	—

Task Assistant

The task assistant module is run using 11_task_assistant.py. Add prompt files (.txt) under a folder specified in the analysis configuration; the script runs the LLM on each article for each prompt.

Argument	Description	Default
`prompts_folder`	Folder containing `.txt` prompt files	required
`--config`	LLM config file	`paper_analysis/llm_config.json`
`--provider`	`openai`, `gemini`, `anthropic`, `openai-completion`	`openai`
`--output`	Output file for results (JSON)	—
`--pdf-folder`	Folder containing PDFs	—
`--single-pdf`	Process a single PDF	—
`--max-workers`	Parallel workers	`1`

Web Application

Run the web application with:

python app.py

The app is available at http://localhost:5000 and provides a graphical interface for the entire pipeline, including snowball start generation, iteration collection, BibTeX retrieval, screening, and analysis configuration.

Docker

Build and run with Docker:

docker build -t profolaf .
docker run -p 5000:5000 profolaf

On startup, choose between the web application (port 5000) or an interactive shell for running CLI scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProfOlaf

Prerequisites and Input

Main Snowballing Pipeline

`utils/setup/generate_search_conf.py`

`0_generate_snowball_start.py`

`1_start_iteration.py`

`2_remove_duplicates.py`

`3_get_bibtex.py`

(Optional) `4_generate_conf_rank.py`

`5_filter_by_metadata.py`

`6_filter_by_title.py`

`7_solve_title_disagreements.py`

`8_filter_by_content.py`

`9_solve_content_disagreements.py`

Iteration

`10_generate_csv.py`

Additional Analysis Scripts

`utils/article_llm_analysis/generate_llm_analysis_conf.py`

`11_download_pdfs.py`

Topic Modeling

`11_topic_modeling_lvl1.py`

`11_topic_modeling_lvl2.py`

`11_topic_modeling_refine.py`

`11_topic_modeling_assign.py`

`11_topic_modeling_correct.py`

Task Assistant

Web Application

Docker

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
confs		confs
databases		databases
experiments		experiments
static		static
templates		templates
utils		utils
0_generate_snowball_start.py		0_generate_snowball_start.py
10_generate_csv.py		10_generate_csv.py
11_download_pdfs.py		11_download_pdfs.py
11_task_assistant.py		11_task_assistant.py
11_topic_modeling_assign.py		11_topic_modeling_assign.py
11_topic_modeling_correct.py		11_topic_modeling_correct.py
11_topic_modeling_lvl1.py		11_topic_modeling_lvl1.py
11_topic_modeling_lvl2.py		11_topic_modeling_lvl2.py
11_topic_modeling_refine.py		11_topic_modeling_refine.py
1_start_iteration.py		1_start_iteration.py
2_remove_duplicates.py		2_remove_duplicates.py
3_get_bibtex.py		3_get_bibtex.py
4_generate_conf_rank.py		4_generate_conf_rank.py
5_filter_by_metadata.py		5_filter_by_metadata.py
6_filter_by_title.py		6_filter_by_title.py
7_solve_title_disagreements.py		7_solve_title_disagreements.py
8_filter_by_content.py		8_filter_by_content.py
9_solve_content_disagreements.py		9_solve_content_disagreements.py
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
banner.webp		banner.webp
logo.png		logo.png
requirements.txt		requirements.txt

sr-lab/ProfOlaf

Folders and files

Latest commit

History

Repository files navigation

ProfOlaf

Prerequisites and Input

Main Snowballing Pipeline

utils/setup/generate_search_conf.py

0_generate_snowball_start.py

1_start_iteration.py

2_remove_duplicates.py

3_get_bibtex.py

(Optional) 4_generate_conf_rank.py

5_filter_by_metadata.py

6_filter_by_title.py

7_solve_title_disagreements.py

8_filter_by_content.py

9_solve_content_disagreements.py

Iteration

10_generate_csv.py

Additional Analysis Scripts

utils/article_llm_analysis/generate_llm_analysis_conf.py

11_download_pdfs.py

Topic Modeling

11_topic_modeling_lvl1.py

11_topic_modeling_lvl2.py

11_topic_modeling_refine.py

11_topic_modeling_assign.py

11_topic_modeling_correct.py

Task Assistant

Web Application

Docker

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

`utils/setup/generate_search_conf.py`

`0_generate_snowball_start.py`

`1_start_iteration.py`

`2_remove_duplicates.py`

`3_get_bibtex.py`

(Optional) `4_generate_conf_rank.py`

`5_filter_by_metadata.py`

`6_filter_by_title.py`

`7_solve_title_disagreements.py`

`8_filter_by_content.py`

`9_solve_content_disagreements.py`

`10_generate_csv.py`

`utils/article_llm_analysis/generate_llm_analysis_conf.py`

`11_download_pdfs.py`

`11_topic_modeling_lvl1.py`

`11_topic_modeling_lvl2.py`

`11_topic_modeling_refine.py`

`11_topic_modeling_assign.py`

`11_topic_modeling_correct.py`

Packages