Molecular Similarity & Distance¶
Overview¶
"Similarity" and "distance" are mathematical measures used to quantify the likeness or difference between molecules.
To compare molecules, we first need a "description" of each molecule, which is obtained from features or fingerprints (refer to the Featurizers
section). We then apply a mathematical operator to determine the "distance" between these fingerprints.
The process of comparing molecules typically involves:
- Generating a fingerprint for each molecule OR starting with a pre-existing list of fingerprints
- Applying a distance formula to quantify the proximity of two fingerprints
The SimilarityEngine
class manages these steps.
Example
Suppose we want to compare two molecules based on three features:
- Fraction of sp3 carbons
- Number of hydrogen donors
- Number of hydrogen acceptors
We measure these values for each molecule, resulting in a "fingerprint" of [x,y,z]
:
- Molecule 1:
[0.345, 5, 6]
- Molecule 2:
[0.543, 2, 1]
To determine the similarity, we can "plot" these fingerprints in 3D space and calculate the "distance" between these points. If we're interested in similarity rather than distance, we can consider them as inverses of each other:
distance^2 = (x2-x1)^2 + (y2-y1)^2 + (z2-z1)^2
similarity = 1 / distance
Using these rules, the "similarity" of these two molecules is scored as 0.17
.
In this example, we used a basic fingerprint (3 features) and the Euclidean
formula for distance. However, there are numerous ways to generate fingerprints (some with >1000 features!) and calculate distance. The same process and concepts apply to each.
Basic Use¶
We'll use Tanimoto
as an example here, but all methods behave similarly.
1. Select a fingerprint method¶
Choose any fingerprint method from the simmate.toolkit.featurizers
module. Refer to the Featurizers section for all options. For this example, we'll use MorganFingerprint
:
from simmate.toolkit.featurizers import MorganFingerprint
2. Select a similarity metric¶
Choose an appropriate metric for similarity based on the selected fingerprint. The following types of similarity/distance measurements are supported:
CLASS |
---|
Cosine |
Dice |
Euclidean |
Tanimoto |
from simmate.toolkit.similarity import Tanimoto
Warning
For many fingerprints, there is a "logical & correct" choice for the metric to use. If you're unsure, don't guess! Seek help & advice
For instance, the MorganFingerprint
that we selected in step 1 is most effective when used with Tanimoto
.
2. Select a similarity method¶
All SimilarityEngine
subclasses support the following methods:
CLASS |
---|
get_similarity(fingerprint1, fingerprint2) |
get_similarity_series(fingerprint1, [fingerprint2, fingerprint3, fingerprint4, ...]) |
get_similarity_matrix([fingerprint1, fingerprint2, fingerprint3, ...]) |
get_distance(fingerprint1, fingerprint2) |
get_distance_series(fingerprint1, [fingerprint2, fingerprint3, fingerprint4, ...]) |
get_distance_matrix([fingerprint1, fingerprint2, fingerprint3, ...]) |
Suppose we have a single molecule that we want to compare to a set of 1,000 molecules. For this, we'll use get_similarity_series
:
Tanimoto.get_similarity_series
3. Construct the final script¶
Now, let's combine everything for our final script.
from simmate.toolkit import Molecule
from simmate.toolkit.featurizers import MorganFingerprint
from simmate.toolkit.similarity import Tanimoto
# Load query molecule
query_molecule = Molecule.from_smiles(".....")
# Load other molecules
smiles_strs = [......]
molecules = [Molecule.from_smiles(s) for s in smiles_strs]
# Generate fingerprints
query_fingerprint = MorganFingerprint.featurize(query_molecule)
fingerprints = MorganFingerprint.featurize_many(molecules)
# Generate the similarity scores
similarities = Tanimoto.get_similarity_series(
query_fingerprint,
fingerprints,
)