Skip to content
Open
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
6208a9a
add save_genomes function
Xiangs18 Aug 15, 2024
a2fabf4
fix positional arg #1 is the wrong type bug
Xiangs18 Aug 16, 2024
330d6c2
use batch genome save in GenbankToGenome.py
Xiangs18 Aug 16, 2024
27158ec
make save_genome_mass function internal
Xiangs18 Aug 16, 2024
04aa203
add tests for save_genome_mass function
Xiangs18 Aug 17, 2024
8d672d1
fix bug
Xiangs18 Aug 17, 2024
d1dd768
update release notes && make the dicts in the loop
Xiangs18 Aug 26, 2024
e79c297
remove logging && add NOTE for workspace_datatype
Xiangs18 Aug 26, 2024
35030a0
move set_up_single_params && validate_mass_params into GenomeUtils
Xiangs18 Aug 28, 2024
4b1b88b
remove redundant tests
Xiangs18 Aug 28, 2024
b56d77f
add test to cover the missing line
Xiangs18 Sep 3, 2024
788c7ee
fix params name typo
Xiangs18 Sep 3, 2024
65467ea
add metagenome json file && cover the missing line
Xiangs18 Sep 5, 2024
66af0b7
rm gff_handle_ref
Xiangs18 Sep 5, 2024
fcbb510
add features_handle_ref && protein_handle_ref before upload
Xiangs18 Sep 10, 2024
32f96a0
add boolean flag for validate_genome
Xiangs18 Sep 10, 2024
41eed0b
test validate_genome boolean flag
Xiangs18 Sep 11, 2024
71670bc
1. add pydoc for save_genme_mass; 2. make the dicts in the _save_geno…
Xiangs18 Sep 11, 2024
0cbe155
update release notes && remove tiny files
Xiangs18 Sep 13, 2024
52e280d
add more info and warnings checks in test
Xiangs18 Sep 17, 2024
f018f0e
remove metagenome from test
Xiangs18 Sep 19, 2024
54fedf0
remove unused lib
Xiangs18 Sep 19, 2024
861a50b
fix documentation
Xiangs18 Oct 30, 2024
01a1dc8
add workspace_id in GenomeFileUtil.spec
Xiangs18 Nov 1, 2024
654a5e9
replace formatted string literals by docstring
Xiangs18 May 19, 2025
bda89a0
add check_hidden func and tests
Xiangs18 May 20, 2025
6bb1009
update check info, display prov and data
Xiangs18 May 23, 2025
2f546ac
move info, prov, and data checks to test_utils.py
Xiangs18 May 27, 2025
0d192f7
fix the failed test
Xiangs18 May 28, 2025
a8bced6
clean up genbank_upload_full_test.py file
Xiangs18 May 28, 2025
b690b09
add _get_info_and_ref helper function and refactor _check_info and _c…
Xiangs18 May 30, 2025
c1018df
fix UnboundLocalError
Xiangs18 May 30, 2025
6f5de92
move provenance to test_utils for reuse
Xiangs18 Jun 3, 2025
bd8ea82
display data before pop genbank_handle_ref
Xiangs18 Jun 3, 2025
787c7b9
display data before any processing
Xiangs18 Jun 3, 2025
828edb6
add target genome data to pass in
Xiangs18 Jun 3, 2025
595c78b
add manual handle setup
Xiangs18 Jun 7, 2025
8992544
refactor _set_handle function & delete old json.file
Xiangs18 Jun 10, 2025
306ae55
upload expected_genome_data
Xiangs18 Jun 10, 2025
cf72c07
test md5sum & genome data in test_genomes
Xiangs18 Jun 10, 2025
f2b8f99
clean up test_utils.py & test test_genomes_with_hidden
Xiangs18 Jun 10, 2025
cc5cee1
register node to delete
Xiangs18 Jun 11, 2025
d5ae8a3
add test for g2g_mass with large genome input and refactor test file
Xiangs18 Jun 11, 2025
7022f35
use g2g_mass for testing
Xiangs18 Jun 11, 2025
be96213
update docString in save_genome_mass
Xiangs18 Jul 14, 2025
f624b11
link genbank file instead of JSON file
Xiangs18 Jul 23, 2025
7fb8355
check prov diff
Xiangs18 Jul 23, 2025
79b44fd
update prov
Xiangs18 Jul 23, 2025
6f64bbd
update subactions dict in prov
Xiangs18 Jul 23, 2025
a3eaec2
revert changes on run_tests_within_container.sh
Xiangs18 Jul 23, 2025
f22ca6a
simplify check_hidden function
Xiangs18 Jul 23, 2025
19ccfb5
pass in object_id
Xiangs18 Jul 23, 2025
c306fc6
restore deleted AnnotatedMetagenomeAssembly code
Xiangs18 Jul 25, 2025
948e3b1
1.fix _update_metagenome bug; 2.update prepare_data function; 3.add t…
Xiangs18 Aug 7, 2025
b626756
display genome info
Xiangs18 Aug 7, 2025
5378631
refactor _get_object_type logic
Xiangs18 Aug 8, 2025
ef95de7
add missing param
Xiangs18 Aug 8, 2025
fafc886
display data info
Xiangs18 Aug 8, 2025
e7d0e04
fix missing key bug
Xiangs18 Aug 8, 2025
0ae6e9d
add expected_genome_data
Xiangs18 Aug 8, 2025
6f0f0f7
pop features_handle_ref and protein_handle_ref
Xiangs18 Aug 8, 2025
6abafd6
add missing is_metagenome param in _check_data
Xiangs18 Aug 8, 2025
ab8b829
pass in correct expected_data
Xiangs18 Aug 8, 2025
f5f6aee
finish updating metagenome test
Xiangs18 Aug 8, 2025
d360fda
fix params bug
Xiangs18 Aug 8, 2025
e30e0e1
refactor current genome_size_tests script and setup for mass saves test
Xiangs18 Aug 15, 2025
8dc7b37
run all tests
Xiangs18 Aug 15, 2025
1571b43
Uses a flag (g2g_mass) to determine which method to call inside a pat…
Xiangs18 Aug 25, 2025
ed57557
add mass saves tests in genome_size_tests.py script
Xiangs18 Aug 26, 2025
2eb63c1
simplify getting wsID
Xiangs18 Sep 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions GenomeFileUtil.spec
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,7 @@ module GenomeFileUtil {
returns (MetagenomeSaveResult returnVal) authentication required;

typedef structure {
int workspace_id;
string workspace;
string name;
KBaseGenomes.Genome data;
Expand Down
44 changes: 22 additions & 22 deletions lib/GenomeFileUtil/GenomeFileUtilImpl.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ class GenomeFileUtil:
######################################### noqa
VERSION = "0.11.7"
GIT_URL = "[email protected]:kbaseapps/GenomeFileUtil.git"
GIT_COMMIT_HASH = "591a19ccf4d1b42f01cc06486654b6d3a8ea08e4"
GIT_COMMIT_HASH = "861a50b5ce2d0b8b481fe514b41a6d6b89ee0c96"

#BEGIN_CLASS_HEADER
#END_CLASS_HEADER
Expand Down Expand Up @@ -831,27 +831,27 @@ def fasta_gff_to_metagenome(self, ctx, params):
def save_one_genome(self, ctx, params):
"""
:param params: instance of type "SaveOneGenomeParams" -> structure:
parameter "workspace" of String, parameter "name" of String,
parameter "data" of type "Genome" (Genome type -- annotated and
assembled genome data. Field descriptions: id - string - KBase
legacy data ID scientific_name - string - human readable species
name domain - string - human readable phylogenetic domain name
(eg. "Bacteria") warnings - list of string - genome-level warnings
generated in the annotation process genome_tiers - list of string
- controlled vocabulary (based on app input and checked by
GenomeFileUtil) A list of labels describing the data source for
this genome. Allowed values - Representative, Reference,
ExternalDB, User Tier assignments based on genome source: * All
phytozome - Representative and ExternalDB * Phytozome flagship
genomes - Reference, Representative and ExternalDB * Ensembl -
Representative and ExternalDB * RefSeq Reference - Reference,
Representative and ExternalDB * RefSeq Representative -
Representative and ExternalDB * RefSeq Latest or All Assemblies
folder - ExternalDB * User Data - User tagged feature_counts - map
of string to integer - total counts of each type of feature keys
are a controlled vocabulary of - "CDS", "gene", "misc_feature",
"misc_recomb", "mobile_element", "ncRNA" - 72,
"non_coding_features", "non_coding_genes",
parameter "workspace_id" of Long, parameter "workspace" of String,
parameter "name" of String, parameter "data" of type "Genome"
(Genome type -- annotated and assembled genome data. Field
descriptions: id - string - KBase legacy data ID scientific_name -
string - human readable species name domain - string - human
readable phylogenetic domain name (eg. "Bacteria") warnings - list
of string - genome-level warnings generated in the annotation
process genome_tiers - list of string - controlled vocabulary
(based on app input and checked by GenomeFileUtil) A list of
labels describing the data source for this genome. Allowed values
- Representative, Reference, ExternalDB, User Tier assignments
based on genome source: * All phytozome - Representative and
ExternalDB * Phytozome flagship genomes - Reference,
Representative and ExternalDB * Ensembl - Representative and
ExternalDB * RefSeq Reference - Reference, Representative and
ExternalDB * RefSeq Representative - Representative and ExternalDB
* RefSeq Latest or All Assemblies folder - ExternalDB * User Data
- User tagged feature_counts - map of string to integer - total
counts of each type of feature keys are a controlled vocabulary of
- "CDS", "gene", "misc_feature", "misc_recomb", "mobile_element",
"ncRNA" - 72, "non_coding_features", "non_coding_genes",
"protein_encoding_gene", "rRNA", "rep_origin", "repeat_region",
"tRNA" genetic_code - int - An NCBI-assigned taxonomic category
for the organism See here -
Expand Down
77 changes: 24 additions & 53 deletions lib/GenomeFileUtil/core/GenbankToGenome.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@
from installed_clients.WorkspaceClient import Workspace
from GenomeFileUtil.core.GenomeUtils import (
is_parent, propagate_cds_props_to_gene, warnings, parse_inferences,
load_ontology_mappings, set_taxon_data, set_default_taxon_data
load_ontology_mappings, set_taxon_data, set_default_taxon_data,
set_up_single_params, validate_mass_params
)

MAX_MISC_FEATURE_SIZE = 10000
Expand Down Expand Up @@ -113,52 +114,16 @@ def __init__(self, config):

def import_genbank(self, params):
print('validating parameters')
mass_params = self._set_up_single_params(params)
mass_params = set_up_single_params(
params, _WSNAME, self._validate_params, self.dfu.ws_name_to_id
)
return self._import_genbank_mass(mass_params)[0]

def import_genbank_mass(self, params):
print('validating parameters')
self._validate_mass_params(params)
validate_mass_params(params, self._validate_params)
return self._import_genbank_mass(params)

def _set_up_single_params(self, params):
# avoid side effects and keep variables in params unmodfied
inputs = dict(params)
self._validate_params(inputs)
ws_id = self._get_int(inputs.pop(_WSID, None), _WSID)
ws_name = inputs.pop(_WSNAME, None)
if (bool(ws_id) == bool(ws_name)): # xnor
raise ValueError(f"Exactly one of a '{_WSID}' or a '{_WSNAME}' parameter must be provided")
if not ws_id:
print(f"Translating workspace name {ws_name} to a workspace ID. Prefer submitting "
+ "a workspace ID over a mutable workspace name that may cause race conditions")
ws_id = self.dfu.ws_name_to_id(ws_name)
mass_params = {_WSID: ws_id, _INPUTS: [inputs]}
return mass_params

def _validate_mass_params(self, params):
ws_id = self._get_int(params.get(_WSID), _WSID)
if not ws_id:
raise ValueError(f"{_WSID} is required")
inputs = params.get(_INPUTS)
if not inputs or type(inputs) is not list:
raise ValueError(f"{_INPUTS} field is required and must be a non-empty list")
for i, inp in enumerate(inputs, start=1):
if type(inp) is not dict:
raise ValueError(f"Entry #{i} in {_INPUTS} field is not a mapping as required")
try:
self._validate_params(inp)
except Exception as e:
raise ValueError(f"Entry #{i} in {_INPUTS} field has invalid params: {e}") from e

def _get_int(self, putative_int, name, minimum=1):
if putative_int is not None:
if type(putative_int) is not int:
raise ValueError(f"{name} must be an integer, got: {putative_int}")
if putative_int < minimum:
raise ValueError(f"{name} must be an integer >= {minimum}")
return putative_int

def _import_genbank_mass(self, params):

workspace_id = params[_WSID]
Expand Down Expand Up @@ -197,6 +162,12 @@ def _import_genbank_mass(self, params):
# parse genbank file
self._parse_genbank(genome_obj)

# check features
self.gi.check_dna_sequence_in_features(genome_obj.genome_data)

# validate genome
genome_obj.genome_data['warnings'] = self.gi.validate_genome(genome_obj.genome_data)

Comment on lines +165 to +170
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there tests for G2G that exercise these code paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have genbank_upload_full_test.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there are tests that cause errors to be thown from the check / validate methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check_dna_sequence_in_features function does not raise any errors.

validate_genome will raise an error if the genome size is too large, but I assume this is already covered in genome_size_test.py by other devs.

https://github.com/kbaseapps/GenomeFileUtil/blob/master/test/utility/genome_size_tests.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on, I'm confused - these two checks are being added to the import mass function, but genbank_upload_full_test isn't changing. How can it test that these functions are called correctly from the mass function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're talking about test_full_sequence, test_partial_sequence, and test_no_sequence_kept?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, those three

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I put them in files and diff them there are differences. I don't know anything about this file other than what the git history says about it unfortunately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored genome_size_test.py script and added mass save tests for the check_dna_sequence_in_features function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes you made to the tests look good and are valuable in and of themselves, but reading through the tests I'm now not sure they actually exercise the check_dna_sequence_in_features function in a meaningful way. The only way that function does anything detectable is if there is at least one feature in the genome without dna sequence, in which case it will add dna sequences by calling a service to retrieve them. It's not clear to me whether the test genome file has any features that are missing dna sequences. If there aren't any then the 3 size tests will pass even if the line in the mass function that calls check_dna_sequence_in_features is deleted.

# gather all objects
genome_objs.append(genome_obj)

Expand All @@ -209,7 +180,6 @@ def _import_genbank_mass(self, params):
for genome_obj in genome_objs:
shutil.rmtree(genome_obj.input_directory)

# TODO make an internal mass function save_genomes
results = self._save_genomes(workspace_id, genome_objs)

# return the result
Expand All @@ -227,17 +197,18 @@ def _import_genbank_mass(self, params):
return details

def _save_genomes(self, workspace_id, genome_objs):
results = [
self.gi.save_one_genome(
{
'workspace': workspace_id,
'name': genome_obj.genome_name,
'data': genome_obj.genome_data,
"meta": genome_obj.genome_meta,
}
) for genome_obj in genome_objs
]

results = self.gi.save_genome_mass(
{
"workspace_id": workspace_id,
"inputs": [
{
"name": genome_obj.genome_name,
"data": genome_obj.genome_data,
"meta": genome_obj.genome_meta,
} for genome_obj in genome_objs
],
}
)
return results

def _validate_params(self, params):
Expand Down
Loading