Skip to content

Conversation

mmb78
Copy link

@mmb78 mmb78 commented Jun 24, 2025

MODIFICATION FOR PARALLEL EXECUTION

Minimal changes from v1.5.1 to allow safe execution as a SLURM job array.

1. File locking is used to prevent data corruption when writing to shared CSV files. # 2. A central counter file (_progress_counter.txt) tracks the total accepted designs,

allowing all processes to stop once the target is reached.

Example SLURM Usage:

#SBATCH --array=0-31

requires FileLock library - run "pip install filelock"

# MODIFICATION FOR PARALLEL EXECUTION
# Minimal changes from v1.5.1 to allow safe execution as a SLURM job array.
#
# 1. File locking is used to prevent data corruption when writing to shared CSV files.
# 2. A central counter file (_progress_counter.txt) tracks the total accepted designs,
#    allowing all processes to stop once the target is reached.
#
# Example SLURM Usage:
# #SBATCH --array=0-31
#
# requires FileLock library - run "pip install filelock"
@mmb78
Copy link
Author

mmb78 commented Jun 24, 2025

I ran the 1.5.1. version of BindCraft with SLURM as an array job running with the same settings JSON and realized that this is problematic as the same files (csv outputs) are potentially being accesses by the parallel processes. While rare, there is a chance that the same file is being accessed by two parallel processes at the same time, which leads to a crash of one of those processes. This is a minimal change of the code that handles writing to files to avoid crashes of the parallel processes (locking the files for access). In addition, the change adds one new file that monitors the progress of all workers, so all the parallel scripts stop when the required number of accepted designs is reached. Setting "number_of_final_designs" to 100 (in JSON) will allow to run the parallel processes until 100 accepted designs are reached together (not by the individual processes). I tested running the code on 32 GPUs for several hours without any crashes.

@cytokineking
Copy link

Is it possible to do this without the slurm dependency?

@mmb78
Copy link
Author

mmb78 commented Jun 25, 2025

The SLURM script does not do much more than simply executing the same script several times (array job). While I did not specifically test it, the modification that I added should allow you to just run the main BindCraft script multiple times as separate jobs .. as explained in the README.

You need to do this first:
conda activate BindCraft
cd /path/to/bindcraft/folder/

and then just start this several times as a background process:
python -u ./bindcraft.py --settings './settings_target/PDL1.json' --filters './settings_filters/default_filters.json' --advanced './settings_advanced/default_4stage_multimer.json'

One should also catch logs (STDOUT) and errors (STDERR) to text files. Those would have to be uniquely named for each process! Just consult how to handle background jobs in Linux.

@cytokineking
Copy link

Great. So this prevents collisions on the .csv files and it also avoids the race condition where multiple processes can rank the files at completion?

@mmb78
Copy link
Author

mmb78 commented Jun 25, 2025

It creates "locks" for all output files. So if one process is writing to a file, the file gets "locked" and all the other processes would have to wait until writing is finished. In addition, there is one file, where progress is tracked (just a number of found binders). Content of this file is checked at the start of the process and if the final number was already reached, the calculation does not start and the script stops. The process that finishes the last design will do all the ranking. Importantly, I really did not change much in the code! It is only this data handling issue!

However, while writing the answer, I realized that I had one bug in my patch. The final ranking and data processing after reaching the desired number of designs would have been never reached .. so I provide a fix for this. I also included an example command for starting 8 parallel background processes. I did not test it though as I cannot run these without SLURM.

The process that finds the last desired accepted binder will execute ranking and final processing of all data. All other parallel processes will stop after finishing their current job.
This version fixes the problem of my previous version where no processes would do the final ranking.
@cytokineking
Copy link

Is it possible to kill the other processes once ranking starts on any given process? It seems that you prevent new trajectories from starting if final designs reached but that doesn't address the issue that there could be several ongoing processes, all that will finish asynchronously and re rank. This has been the major problem I have had myself with multiple jobs. The ranked designs and .csv file gets badly jumbled. You need logic to pkill all ongoing concurrent Bindcraft processes for the same job once any one of them has reached the terminal ranking step.

@mmb78
Copy link
Author

mmb78 commented Jun 26, 2025

As it is now in the second updated version, the check if the final number of designs was reached only happens when a successful design is found and only then the ranking is done. Writing into the CSV files is with "lock" so no other process can be doing anything with the protected files (such as mpnn_design_stats.csv). If another process finishes after this final ranking is done and finds no binder, it starts new cycle and checks the "_progress_counter.txt", which is now at the desired number and therefore stops and does no ranking and data processing. If a running process finds a binder in its last calculation (unlikely), it will wait until currently ongoing writing to CSV file and ranking is done by the other process (because there is this "lock" on it) and only then adds its successful design and performs new reranking. Seems to me that even this "edge" scenario when two running processes find a binder in their last run would still finish correctly and you'd get simply one extra successful design in the final list. All processes still running when the last successful design was found will simply finish their on-going cycle of calculations and write their results (they are not killed right away). I think this is OK as each cycle does not take so long and even those processes may still find useful designs.

@cytokineking
Copy link

cytokineking commented Jun 26, 2025 via email

@mmb78
Copy link
Author

mmb78 commented Jun 26, 2025

To be perfectly clear, I did not test this specific scenario (hard to achieve as it is so unlikely) and I did not go line by line through all the ranking, final processing functions... but according to Gemini 2.5 Pro, it should run OK with the final outcome being simply for example 101 instead of 100 binders and all should be properly reranked and named as expected based on their position in the new longer list. Martin should confirm if he decides to implement this "patch".

There was one more bug in the final processing step. Order of final processing was rearranged to prevent crash during raking of binders and generating outputs.
@mmb78
Copy link
Author

mmb78 commented Jun 28, 2025

Just want to confirm that this last version finished with no errors for me. I asked for 20 designs. .. got them ranked. The processes that were running when the last design was found simply finished their last calculation and then did not start another and terminated normally (so resources were freed). After this, I changed the JSON file to 22 designs. Ran the SLURM script again .. got two more after some time, all reranked and saved properly. Now, I changed the JSON again to 24 designs, restarted the calculation and I again have calculations running and adding data/structures to the same output folder with no issues. Interestingly, this time, one process found the last binder (24), did the final outputs and stopped. A parallel process that was in the middle of its calculation also found two binders. So when that processed fished, I had 26 and all are ranked together.

I think this is quite useful as you can ask for less designs initially, look at what is going on and then restart later by changing just the JSON file. You'll still have all the data in one place as if you asked for more designs at the start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants