Skip to content

Field separators in quoted attributes cause error #212

@zwpwjwtz

Description

@zwpwjwtz

Testing this GTF file with gffutils and an AttributeStringError exception was raised on line 349, parser.py.
After some exploration I noticed that lines containing attributes with ";" (field separator) in it actually caused the malfunction of parser.

Example:
NC_000964.3 RefSeq CDS 410 1747 . + 0 gene_id "BSU_00010"; ...... note "Evidence 1a: Function from experimental evidences in the studied strain; PubMedId: 2167836, 2846289, 12682299, 16120674, 1779750, 28166228; Product type f : factor"; ......
Since the semicolons were first extracted as field separators, the sub-attributes ("Evidence 1a", "PubMedId" and "Product type f") were then broken into separated fields, and the numbers after "PubMedId" were parsed as multiple values associated with the (wrong) "PubMedId" key. Since dialect["repeated key"] had been set by multiple definition of field "db_xref", an exception mentioned above was thus triggered.

I suggest that quotes get parsed in priority, before the field separators getting located and parsed. Although this may require the parser to behave like a streaming parser rather than a structured one, it guarantees that no content between quotes can escape and contaminate the other fields.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions