Skip to content

SMILES should be canonical (unless augmentation is used or other format is specifically requested) #117

@sfluegel05

Description

@sfluegel05

At the moment, we use "raw" SMILES strings as they come from ChEBI and put them directly into the tokeniser. While ChEBI might perform some form of standardisation, we can't guarantee that a consistent standardisation is used. More importantly, for predicting new, unknown SMILES, we also don't standardise the SMILES, leading to the following (hypothetical) scenario:

  1. The model is trained on ChEBI molecules which always use [Ge++]
  2. The user puts in [Ge+2]
  3. The model is utterly confused because it has never seen this token (although the user put in a SMILES that corresponds to a ChEBI molecule)

Todo

  • Do canonicalisation with rdkit by default
  • As soon as we have canonical-trained models in chebifier, make sure that the SMILES from user input get the same treatment (this should happen automatically if the canonicalisation is a reader-feature)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions