diff --git a/docs/source/_static/RNA_kaggle_fig3.png b/docs/source/_static/RNA_kaggle_fig3.png new file mode 100644 index 0000000..eb52e65 Binary files /dev/null and b/docs/source/_static/RNA_kaggle_fig3.png differ diff --git a/docs/source/_static/c2_pseudocycle_png.png b/docs/source/_static/c2_pseudocycle_png.png new file mode 100644 index 0000000..e7fe0a5 Binary files /dev/null and b/docs/source/_static/c2_pseudocycle_png.png differ diff --git a/docs/source/_static/example_4_new.png b/docs/source/_static/example_4_new.png new file mode 100644 index 0000000..4e0e09a Binary files /dev/null and b/docs/source/_static/example_4_new.png differ diff --git a/docs/source/_static/example_5_new.png b/docs/source/_static/example_5_new.png new file mode 100644 index 0000000..fd1c061 Binary files /dev/null and b/docs/source/_static/example_5_new.png differ diff --git a/docs/source/_static/holliday_junc_png.png b/docs/source/_static/holliday_junc_png.png new file mode 100644 index 0000000..39d64f9 Binary files /dev/null and b/docs/source/_static/holliday_junc_png.png differ diff --git a/docs/source/_static/rna_polyhedra.png b/docs/source/_static/rna_polyhedra.png new file mode 100644 index 0000000..4663aed Binary files /dev/null and b/docs/source/_static/rna_polyhedra.png differ diff --git a/docs/source/_static/rna_triple_helix.png b/docs/source/_static/rna_triple_helix.png new file mode 100644 index 0000000..74d8e05 Binary files /dev/null and b/docs/source/_static/rna_triple_helix.png differ diff --git a/docs/source/_static/rna_triple_tip.png b/docs/source/_static/rna_triple_tip.png new file mode 100644 index 0000000..604d170 Binary files /dev/null and b/docs/source/_static/rna_triple_tip.png differ diff --git a/docs/source/index.rst b/docs/source/index.rst index 3491f0a..8331343 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -17,4 +17,5 @@ RFDpoly documentation license_link.rst contributing_link.rst installation_guide.md + rfdpoly_usage_examples.md diff --git a/docs/source/rfdpoly_usage_examples.md b/docs/source/rfdpoly_usage_examples.md new file mode 100644 index 0000000..82862b1 --- /dev/null +++ b/docs/source/rfdpoly_usage_examples.md @@ -0,0 +1,364 @@ +# RFDpoly usage examples + +This document has several examples for how to use RFDpoly for RNA and DNA generation including: +- Unconditional generation of multiple biopolymers together +- Motif scaffolding tasks with multiple polymers +- Control of base-pairing secondary structure to design RNA pseudoknots or (mini) DNA origami + +```{note} +If you come across this error: + ```bash + Could note locate file '/current/working/directory/test_data/DBP035.pdb'. Tried... + ``` + You can fix it via the addition of + ```bash + inference.input_pdb=/path/to/RFDpoly/rf_diffusion/test_data/DBP035.pdb + ``` + to the end of your list of configuration options. +``` + +## Table of Contents +- [Example 1: Unconditional RNA generation](#unconditional-rna-generation) +- [Example 2: Example 2: Unconditional design of one protein chain and two DNA chains](#unconditional-protein-2-dna) +- [Example 3: RNA riboswitch design with conditioning from Eterna puzzle structure string](#rna-riboswitch-conditioning-eterna) +- [Example 4: DNA–protein scaffolding, inpaint two DNA chains](#dna-protein-scaffolding-inpaint-2-dna) +- [Example 5: DNA–protein scaffolding, inpaint two DNA chains and one protein chain](#dna-protein-scaffolding-inpaint-2-dna-1-protein) +- [Example 6: DNA origami with symmetric denoising](#dna-origami-symmetric-denoising) +- [Example 7: RNA design with triple helix](#rna-design-triple-helix) +- [Example 8: Control of RNA tertiary structure with multi-contact specification](#control-rna-tertiary-multi-contact) +- [Example 9: Pseudocyclic symmetry using procedurally generated base-pair patterning](#pseudocyclic-symmetry-base-pair-patterning) +- [Example 10: De novo Holliday junctions using strand exchange](#de-novo-holliday-junctions) +- [Example 11: Sequence specification and sequence design](#sequence-specification-sequence-design) + - [Unconditional design of RNA and protein](#unconditional-design-rna-protein) + +- [Further Reading](#further-reading) + + +(unconditional-rna-generation)= +## Example 1: Unconditional RNA generation +``` +apptainer run --nv /path/to/SE3nv.sif /path/to/RFDpoly/rf_diffusion/run_inference.py \ +--config-name=multi_polymer \ +diffuser.T=50 \ +inference.num_designs=5 \ +'contigmap.contigs=["90"]' \ +contigmap.polymer_chains=['rna'] \ +inference.output_prefix='./demo_outputs/RNA_uncond_standard_settings' \ +inference.ckpt_path=/path/to/RFDpoly/weights/train_session2024-07-08_1720455712_BFF_3.00.pt +``` + +(unconditional-protein-2-dna)= +## Example 2: Unconditional design of one protein chain and two DNA chains +``` +apptainer run --nv /path/to/SE3nv.sif /path/to/RFDpoly/rf_diffusion/run_inference.py \ +--config-name=multi_polymer \ +diffuser.T=50 \ +inference.num_designs=5 \ +"contigmap.contigs=[20\ 20\ 75]" \ +'contigmap.polymer_chains=["dna","dna","protein"]' \ +inference.output_prefix='./demo_outputs/DNA_prot_uncond_standard_settings' +inference.ckpt_path=/path/to/RFDpoly/weights/train_session2024-07-08_1720455712_BFF_3.00.pt +``` + +(rna-riboswitch-conditioning-eterna)= +## Example 3: RNA riboswitch design with conditioning from Eterna puzzle structure string + +Puzzle source: [JR Openknot4 Week6 4RZD](https://eternagame.org/labs/13195459) +``` +((((...............))))(((((((((........))))))))).......(((((((........)))))))(((...........)))..... +``` +Because Hydra cannot use parentheses in command-line arguments, replace “(” and “)” with “5” +and “3”. +``` +apptainer run --nv /path/to/SE3nv.sif /path/to/rf_diffusion/run_inference.py \ +--config-name=multi_polymer \ +diffuser.T=50 \ +inference.num_designs=5 \ +'contigmap.contigs=["100"]' \ +contigmap.polymer_chains=['rna'] \ +scaffoldguided.target_ss_string='5555...............3333555555555........333333333.......5555555........3333333555...........333.....' \ +inference.output_prefix='./demo_outputs/RNA_eterna_cond_standard_settings' +inference.ckpt_path=/path/to/RFDpoly/weights/train_session2024-07-08_1720455712_BFF_3.00.pt +``` +Try this out on more [Eterna puzzles](https://eternagame.org/challenges/11843006)! To get the ss-string for a given puzzle, start playing, then +right-click → “Copy Structure” or choose it from the menu in the upper left corner. +**BUT WAIT!** What if you want to insert multiple pseudoknots into your design or insert them +into specific positions of certain chains? You can provide a list of secondary structure strings as +follows: +``` +scaffoldguided.target_ss_string_list=[\'B1-90:.5555555555555555555..ffffff.33333555....333555....33333333333333333555555..tttttt.333333.\',\'A116-205:.5555555555555555555..ffffff.33333555....333555....33333333333333333555555..tttttt.333333.\'] +``` + +where the ranges specify chains and indices of pseudoknot insertion locations in the output contigs. + +(dna-protein-scaffolding-inpaint-2-dna)= +## Example 4: DNA–protein scaffolding, inpaint two DNA chains + +Image of an example output that highlights each generated chain and connects it to the portion of the contig that describes it. + +The argument `inference.ij_visible` controls which motifs listed in the contigs have their relative +orientations locked during inference (credit to [Dr. David Juergens](https://scholar.google.com/citations?user=k6PhNDQAAAAJ&hl=en) for this intuitive system). In the following +contact map, lowercase letters denote motifs from the input PDB in the order they occur in the contigs +(not to be confused with uppercase chain IDs): + +``` +10,D8-13,6,B8-13,10\ 10,B18-23,6,D18-23,10\ A1-56,0\ C1-56,0 + a. b. c. d. e. f. +``` +Motifs A1-56, B8-13, and B18-23 (b,c,e) form the first DNA+binder group “bce”, while C1-5, +D8-13, and D18-23 (a,d,f) form “adf”. The relative orientations of “bce” and “adf” can now vary +during inference. +``` +/path/to/SE3nv.sif /path/to/rf_diffusion/run_inference.py \ +--config-name=multi_polymer \ +diffuser.T=50 \ +inference.num_designs=5 \ +"contigmap.contigs=['10,D8-13,6,B8-13,10 10,B18-23,6,D18-23,10 A1-56,0 C1-56,0']" \ +inference.ij_visible='bce-adf' \ +"contigmap.polymer_chains=[dna,dna,protein,protein]" \ +inference.input_pdb='path/to/RFDpoly/test_data/combo_DBP009_DBP010_DBP011_with_DNA_v2.pdb' \ +inference.output_prefix='./demo_outputs/DNA_binders_scaffolding_test1_standard_settings' \ +inference.ckpt_path=/path/to/RFDpoly/weights/train_session2024-07-08_1720455712_BFF_3.00.pt +``` + +(dna-protein-scaffolding-inpaint-2-dna-1-protein)= +## Example 5: DNA–protein scaffolding, inpaint two DNA chains and one protein chain + +Image of an example output that highlights each generated chain and connects it to the portion of the contig that describes it. + +Here, contigmap.polymer chains lists only three chains because binding proteins are merged into +one. +``` +/path/to/SE3nv.sif /path/to/rf_diffusion/run_inference.py \ +--config-name=multi_polymer \ +diffuser.T=50 \ +inference.num_designs=5 \ +"contigmap.contigs=['5,D8-13,2,B8-13,5 5,B18-23,2,D18-23,5 A1-52,90,C4-56,0']" \ +inference.ij_visible='bce-adf' \ +contigmap.polymer_chains=[dna,dna,protein] \ +scaffoldguided.target_ss_pairs="['A1-24,B1-24']" \ +inference.input_pdb='path/to/RFDpoly/rf_diffusion/test_data/combo_DBP009_DBP010_DBP011_with_DNA_v2.pdb' \ +inference.output_prefix='./demo_outputs/DNA_binders_scaffolding_test2_standard_settings' \ +inference.ckpt_path=/path/to/RFDpoly/weights/train_session2024-07-08_1720455712_BFF_3.00.pt +``` +The new argument scaffoldguided.target_ss_pairs enforces base-pairing of polymer ranges +(here A1-24 B1-24). Each corresponds to the two full DNA contigs (A,B) in the outputs, each of +length 24. + +(dna-origami-symmetric-denoising)= +## Example 6: DNA origami with symmetric denoising +The scaffoldguided.target ss pairs argument specifies paired ranges in the design; both +ranges in each pair must have equal length. The first is 5→3, the second 3→5. + +Example of an output of this calculation. + +``` +/path/to/SE3nv.sif /path/to/rf_diffusion/run_inference.py \ +--config-name=multi_polymer \ +diffuser.T=50 \ +inference.num_designs=5 \ +"contigmap.contigs=['60 60 60 60']" \ +"contigmap.polymer_chains=[dna,dna,dna,dna]" \ +scaffoldguided.target_ss_pairs="['A1-20,B1-20','A21-40,C21-40','A41-60,D41-60','B21-40,D21-40','B41-60,C41-60','C1-20,D1-20']" \ +inference.symmetry='d2' \ +inference.output_prefix='./demo_outputs/DNA_origami_standard_settings' \ +inference.ckpt_path=/path/to/RFDpoly/weights/train_session2024-07-08_1720455712_BFF_3.00.pt +``` + +(rna-design-triple-helix)= +## Example 7: RNA design with triple helix +Just as in the previous example, we can use base-paired ranges of sequence to control RNA +topology. By default, paired regions are antiparallel, but specific orientations can be assigned (e.g., triple helices with parallel/antiparallel combinations). + +Example of an output of this calculation. + +``` +/path/to/SE3nv.sif /path/to/rf_diffusion/run_inference.py \ +--config-name=multi_polymer \ +diffuser.T=50 \ +inference.num_designs=5 \ +contigmap.contigs=['75'] \ +"contigmap.polymer_chains=[rna]" \ +scaffoldguided.target_ss_pairs="['A5-20,A55-70','A55-70,A30-45']" \ +scaffoldguided.target_ss_pair_ori="[P,A]" \ +inference.output_prefix='./demo_outputs/Triple_helix_test' \ +inference.ckpt_path=/path/to/RFDpoly/weights/train_session2024-07-08_1720455712_BFF_3.00.pt +``` +Each orientation in the list scaffoldguided.target_ss_pair_ori corresponds to the element at +the same index in scaffoldguided.target_ss_pairs. + +(control-rna-tertiary-multi-contact)= +## Example 8: Control of RNA tertiary structure with multi-contact specification + +Example of a the tertiary structure that can be produced with this calculation. + +We can “staple” distal loops in RNA pseudoknots together by specifying regions of multi-base +contacts using scaffoldguided.force_multi_contacts. Secondary-structure strings cannot encode +beyond simple two-base pair configurations, so this feature enables higher-order tertiary interactions. +We can also force loop placement via scaffoldguided.force_loops_list. +``` +/path/to/SE3nv.sif /path/to/rf_diffusion/run_inference.py \ +--config-name=multi_polymer \ +diffuser.T=50 \ +inference.num_designs=5 \ +"contigmap.contigs=['80']" \ +"contigmap.polymer_chains=[rna]" \ +scaffoldguided.target_ss_pairs="['A5-15,A25-35','A45-55,A65-75']" \ +scaffoldguided.force_multi_contacts="['A19,A61,A20','A59,A21,A60']" \ +scaffoldguided.force_loops_list="['A38-42']" \ +inference.output_prefix='./demo_outputs/loop_touch_test' \ +inference.ckpt_path=/path/to/RFDpoly/weights/train_session2024-07-08_1720455712_BFF_3.00.pt +``` + +(pseudocyclic-symmetry-base-pair-patterning)= +## Example 9: Pseudocyclic symmetry using procedurally generated base-pair patterning +Below are arguments used to create pseudo-symmetry in a single chain forming a cyclic-symmetric shape (e.g., C2 pseudocycle). + +Example of a c2 pseudocycle. + +``` +/path/to/SE3nv.sif /path/to/rf_diffusion/run_inference.py --config-name=multi_polymer \ +inference.ckpt_path='/path/to.RFDpoly/weights/train_session2024-07-08_1720455712_BFF_3.00.pt' \ +diffuser.T=50 \ +inference.num_designs=5 \ +"contigmap.contigs=['240']" \ +inference.pseudo_symmetry='c2' \ +inference.n_repeats=2 \ +scaffoldguided.target_ss_pairs="['A4-5,A237-238','A7,A236','A11-12,A230-231','A18-27,A137-146','A44-47,A72-75','A49-44,A64-69','A63-69,A49-55','A72-75,A44-47','A90-99,A197-206','A110-112,A130-132','A115-115,A127-128','A117-118,A124-125','A124-125,A117-118','A127-128,A115-116','A130-132,A110-112','A137-140,A24-27','A141-146,A18-23','A164-168,A191-195','A169-175,A183-189','A183-189,A169-175','A191-195,A164-168','A197-206,A90-99','A210-214,A82-86','A215-219,A77-81','A230-231,A11-12','A236,A7','A237-238,A4-5']" \ +contigmap.polymer_chains=['rna'] \ +inference.output_prefix='./outputs_2025-02-03/pC2_test01__BFF_3.00' \ +``` + +(de-novo-holliday-junctions)= +## Example 10: De novo Holliday junctions using strand exchange + +Example of a designed Holliday junction. + +We can use symmetry and strand exchange to design Holliday-junction-style complexes. Chain +and index specifications in `scaffoldguided.target_ss_pairs` refer to chain IDs and indices in the +output structure defined by contig topology. +``` +/path/to/SE3nv.sif /path/to/rf_diffusion/run_inference.py \ +--config-name=multi_polymer \ +diffuser.T=50 \ +inference.symmetry='c2' \ +contigmap.inpaint_seq=['D60','D57','D58','D11','D15','D19','D8','D61','D18','A9','A59','A62','A8','A4','A5'] \ +inference.num_designs=5 \ +inference.ckpt_path='/path/to/RFDpoly/weights/train_session2024-07-08_1720455712_BFF_3.00.pt' \ +"contigmap.contigs=['A1-61,60,D14-65 15,B6-12,4,F1-8,15 15,E7-14,4,C4-10,15 A1-61,60,D14-65 15,B6-12,4,F1-8,15 15,E7-14,4,C4-10,15']" \ +inference.ij_visible='acf-bde-gil-hjk' \ +"contigmap.polymer_chains=['protein','dna','dna','protein','dna','dna']" \ +scaffoldguided.target_ss_pairs="['B1-10,F40-49','B40-49,F1-10','B16-34,C16-34','E16-34,F16-34','C1-10,E40-49','C40-49,E1-10']" \ +inference.input_pdb='/path/to/RFDpoly/rf_diffusion/test_data/DBP35opt_DBP48.pdb' \ +inference.output_prefix='./outputs_2025-02-03/DBP_scaffolding_test06__BFF_3.00' +``` + +(sequence-specification-sequence-design)= +## Example 11: Sequence specification and sequence design +We can now specify the sequence of our +structures to be whatever we want! +Additionally, the model has been trained to do sequence prediction, so we can decode a sequence during +the denoising trajectory (this allows us to generate outputs with all of the base atoms rendered for +NA design, as well as nice sidechain interactions for protein design). +The default behavior throughout RFdiffusion is to keep the sequence of diffused regions masked +during the trajectory, even if the outputs seem to have residue labels. RoseTTAfold must “see” +sequence labels in order to generate sidechains, so I added a flag, inference.update_seq_t=True, +which allows the model to see either a user-specified sequence or the model's predicted sequence from +the previous timestep. This gives us sidechains, and is a unique addition to RFdiffusion-based tools. +There are two ways to control the sequence info during a trajectory: +- Turn on full-sequence visibility at some timestep towards the end of the trajectory, using inference.show_seq under t=15. +- Gradually decode a random selection of positions at each step below some point in the trajectory, using diffuser.aa_decode_steps=40. +Both methods work nicely, so test them both during your design process. Examples using both methods are shown below. + +(unconditional-design-rna-protein)= +### Example 11A: Unconditional design of RNA and protein +While specifying the RNA sequence, letting the model design the protein sequence, and gradually revealing various sequence positions over the course of the last 40 steps: +``` +/path/to/SE3nv.sif /path/to/rf_diffusion/run_inference.py --config-name=multi_polymer \ +diffuser.T=50 \ +inference.num_designs=3 \ +"contigmap.contigs=['43 20 75']" \ +"contigmap.polymer_chains=['rna','rna','protein']" \ +inference.set_sequence="['A1-43:GGAUGUACUACCAGCUGAUGAGUCCCAAAUAGGACGAAACGCC','B1-20:GGCGUCCUGGUAUCCAAUCC']" \ +inference.update_seq_t=True \ +diffuser.aa_decode_steps=40 \ +inference.ckpt_path='/path/to/RFDpoly/weights/train_session2024-07-08_1720455712_BFF_3.00.pt' \ +inference.output_prefix='./demo_outputs/RNA-prot_seq-spec_and_seq-design_standard_settings' +``` + + + +(further-reading)= +## Further Reading +We can use several symbol pairs to denote which of two bases being paired is 5’-led or 3’-led. Unfortu- +nately, due to Hydra’s constraints on command-line tokens, we cannot use standard DBN conventions +and must exclude the following symbols: (, ), [, ], {, }, <, >. However, the following symbols can be +used instead to control hierarchical pseudoknot specification: + +|5'-led | 3'-led | Explanation| +|-------|--------|--------------------| +|5 | 3 | (5'→3' and 3'→5')| +|i | j | (index convention)| +|f | t | ("from" and "to")| +|b | e | ("begin" and "end")| +|A | a | ("begin" and "end")| + +The code accepts any of these symbols to denote base pairs using the dot-bracket-notation paradigm. +However, Hydra cannot parse parentheses, braces, or brackets in command-line arguments. If your +topology requires multiple distinct open/close pairs (as with pseudoknots), simply replace symbols as +long as they remain distinct. + +### Polymer Contigs +If you’ve gotten this far, you’ve probably already seen the arguments such as: +`contigmap.polymer_chains="['dna','dna','protein']"` +To control polymer classes generated, specify the polymer type for each chain in the contig line; +otherwise, the model makes assumptions that may yield odd results (e.g., threading nucleic backbones +through protein motifs). Future updates may automate this behavior, but for now specify polymer +chains manually. + +### A Note on the Different Contigs +Different configs and their intended use cases: +- `multi_polymer`: main config; handles motif scaffolding and secondary-structure conditioning. +- `RNA_ss_cond`: similar to above, but loads a checkpoint fine-tuned on more RNA samples. +- `RNA_uncond`: older configuration predating SS-conditioning. Produces diverse unconditional +designs but with less designable backbones. + +In practice, `multi_polymer` works best across all polymer classes and tasks — truly a “one-model- +to-rule-them-all” scenario. + +### A Note on Checkpoints +Many checkpoints are available, but the multi polymer config generally suffices. Alternative checkpoints can be found at: +`/rf_diffusion/config/inference` + +### Arguments that Specify Residues +Chain–resi pairs (e.g., “A23”) specify that something should +happen to a given residue. Sometimes we use this to refer to the chain–resi pairing in the input PDB +(for example, in the `contigmap.contigs` argument). Other times we refer to a location within the +newly generated structure (for example, in `scaffoldguided.target_ss_pairs`). + +### A Note on PyMOL's Cartoon Representation +Since RFDpoly is based on the original implementation of RFdiffusion, its outputs lack sidechains in non-motif (designed) regions. This is particularly problematic for nucleic acids: PyMOL refuses to render nucleic acid cartoons if sidechains are missing. Therefore, Andrew recommends inspecting outputs in ChimeraX or using an alternative PyMOL scheme to visualize NA outputs. + +If outputs seem invisible in PyMOL, it's not a bug, just a rendering limitation. + +### Questions? +If you have any questions about RFDpoly, open a Discussion on GitHub or post in the Rosetta Commons Slack (Commons members only). If you have issues running any of the examples, open an Issue on GitHub. \ No newline at end of file