SMILES should be canonical (unless augmentation is used or other format is specifically requested)

At the moment, we use "raw" SMILES strings as they come from ChEBI and put them directly into the tokeniser. While ChEBI might perform some form of standardisation, we can't guarantee that a consistent standardisation is used. More importantly, for predicting new, unknown SMILES, we also don't standardise the SMILES, leading to the following (hypothetical) scenario:

1. The model is trained on ChEBI molecules which always use `[Ge++]`
2. The user puts in `[Ge+2]`
3. The model is utterly confused because it has never seen this token (although the user put in a SMILES that corresponds to a ChEBI molecule)

## Todo
- Do canonicalisation with rdkit by default
- As soon as we have canonical-trained models in chebifier, make sure that the SMILES from user input get the same treatment (this should happen automatically if the canonicalisation is a reader-feature)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SMILES should be canonical (unless augmentation is used or other format is specifically requested) #117

Todo

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SMILES should be canonical (unless augmentation is used or other format is specifically requested) #117

Description

Todo

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions