You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment, we use "raw" SMILES strings as they come from ChEBI and put them directly into the tokeniser. While ChEBI might perform some form of standardisation, we can't guarantee that a consistent standardisation is used. More importantly, for predicting new, unknown SMILES, we also don't standardise the SMILES, leading to the following (hypothetical) scenario:
The model is trained on ChEBI molecules which always use [Ge++]
The user puts in [Ge+2]
The model is utterly confused because it has never seen this token (although the user put in a SMILES that corresponds to a ChEBI molecule)
Todo
Do canonicalisation with rdkit by default
As soon as we have canonical-trained models in chebifier, make sure that the SMILES from user input get the same treatment (this should happen automatically if the canonicalisation is a reader-feature)