rdkit.Chem.RegistrationHash module¶

Generate a unique hash code for a molecule based on chemistry. If two molecules are chemically “the same”, they should have the same hash.

Using molhash adds value beyond using SMILES because it:

Ignores SMILES features that are not chemically meaningful

(e.g. atom map numbers) * Canonicalizes enhanced stereochemistry groups. For example C[C@H](O)CC |&1:1| and C[C@@H](O)CC |&1:1| have the same molhash * Canonicalizes S group data (for example, polymer data)

There are two hash schemes, the default, and one in which tautomers are considered equivalent.

class rdkit.Chem.RegistrationHash.EnhancedStereoUpdateMode(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶

Bases: Enum

ADD_WEIGHTS = 1¶

REMOVE_WEIGHTS = 2¶

rdkit.Chem.RegistrationHash.GetMolHash(all_layers, hash_scheme: HashScheme = HashScheme.ALL_LAYERS) → str¶

Generate a molecular hash using a specified set of layers.

Parameters:

all_layers – a dictionary of layers
hash_scheme – enum encoding information layers for the hash

Returns:

hash for the given scheme constructed from the input layers

rdkit.Chem.RegistrationHash.GetMolLayers(original_molecule: Mol, data_field_names: Iterable | None = None, escape: str | None = None, cxflag=1089, enable_tautomer_hash_v2=False) → {<HashLayer.NO_STEREO_TAUTOMER_HASH: 5>, <HashLayer.CANONICAL_SMILES: 1>, <HashLayer.ESCAPE: 2>, <HashLayer.SGROUP_DATA: 6>, <HashLayer.TAUTOMER_HASH: 7>, <HashLayer.NO_STEREO_SMILES: 4>, <HashLayer.FORMULA: 3>}¶

Generate layers of data about that could be used to identify a molecule

Parameters:

original_molecule – molecule to obtain canonicalization layers from
data_field_names – optional sequence of names of SGroup DAT fields which will be included in the hash.
escape – optional field which can contain arbitrary information
enable_tautomer_hash_v2 – use v2 of the tautomer hash

Returns:

dictionary of HashLayer enum to calculated hash

rdkit.Chem.RegistrationHash.GetNoStereoLayers(mol, enable_tautomer_hash_v2=False)¶

rdkit.Chem.RegistrationHash.GetStereoTautomerHash(molecule, cxflag=1089, enable_tautomer_hash_v2=False)¶

class rdkit.Chem.RegistrationHash.HashLayer(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶

Bases: Enum

Variables:

CANONICAL_SMILES – RDKit canonical SMILES (excluding enhanced stereo)
ESCAPE – arbitrary other information to be incorporated
FORMULA – a simple molecular formula for the molecule
NO_STEREO_SMILES – RDKit canonical SMILES with all stereo removed
SGROUP_DATA – canonicalization of all SGroups data present
TAUTOMER_HASH – SMILES-like representation for a generic tautomer form
NO_STEREO_TAUTOMER_HASH – the above tautomer hash lacking all stereo

CANONICAL_SMILES = 1¶

ESCAPE = 2¶

FORMULA = 3¶

NO_STEREO_SMILES = 4¶

NO_STEREO_TAUTOMER_HASH = 5¶

SGROUP_DATA = 6¶

TAUTOMER_HASH = 7¶

class rdkit.Chem.RegistrationHash.HashScheme(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶

Bases: Enum

Which hash layers to use to when deduplicating molecules

Typically the “ALL_LAYERS” scheme is used, but some users may want the “TAUTOMER_INSENSITIVE_LAYERS” scheme.

Variables:

ALL_LAYERS – most strict hash scheme utilizing all layers
STEREO_INSENSITIVE_LAYERS – excludes stereo sensitive layers
TAUTOMER_INSENSITIVE_LAYERS – excludes tautomer sensitive layers

ALL_LAYERS = (HashLayer.CANONICAL_SMILES, HashLayer.ESCAPE, HashLayer.FORMULA, HashLayer.NO_STEREO_SMILES, HashLayer.NO_STEREO_TAUTOMER_HASH, HashLayer.SGROUP_DATA, HashLayer.TAUTOMER_HASH)¶

STEREO_INSENSITIVE_LAYERS = (HashLayer.ESCAPE, HashLayer.FORMULA, HashLayer.NO_STEREO_SMILES, HashLayer.NO_STEREO_TAUTOMER_HASH, HashLayer.SGROUP_DATA)¶

TAUTOMER_INSENSITIVE_LAYERS = (HashLayer.ESCAPE, HashLayer.FORMULA, HashLayer.NO_STEREO_TAUTOMER_HASH, HashLayer.SGROUP_DATA, HashLayer.TAUTOMER_HASH)¶

rdkit.Chem.RegistrationHash module¶

Table of Contents

Previous topic

Next topic

This Page