Molecule Featurizers¶
Overview¶
Featurizing molecules involves generating properties and information about each molecule, such as the number of rings, molecular weight, and number of stereocenters. These features are crucial for exploring and analyzing data, identifying trends, and training machine learning or artificial intelligence models. The Featurizer base class helps optimize this process.
Fingerprints¶
Fingerprints are a unique type of featurizer that generate a 1D array of numerical values for each molecule. Each value represents a specific feature or property of the molecule. Fingerprints are entirely numerical, making them suitable for mathematical applications without requiring additional processing.
Basic API vs. Featurizer API¶
The Featurizer API and the single-molecule API can both be used to calculate a list of properties. The choice between the two depends on your coding style preference and the scale of your project. The Featurizer API offers parallelization, which can be beneficial for large-scale projects.
Single Molecule ("Basic") API:
my_final_props = {
"num_rings": [m.num_rings for m in input_molecules],
"num_stereocenters": [m.num_stereocenters for m in input_molecules],
"molecular_weight": [m.molecular_weight for m in input_molecules],
}
Featurizer API:
from simmate.toolkit.featurizers import PropertyGrabber
my_final_props = PropertyGrabber.featurize_many(
molecules=input_molecules,
properties=["num_rings", "num_stereocenters", "molecular_weight"],
parallel=True, # this is the key reason you'd want to use a Featurizer class!
)
Bug
parallel=True has not yet been implemented
Tip
Both APIs yield the same result. The main difference is that the Featurizer API can use Dask for parallelization when parallel=True. If feature generation takes more than 15 minutes for all molecules, we recommend using the Featurizer API. This is typically the case when working with datasets of over 1 million molecules.
Warning
Our API is still in the early stages of development and may undergo changes to improve usability.
Usage Guide¶
All classes that inherit from the Featurizer class can be used in the same way. This guide uses the MorganFingerprint as an example, but you can substitute it with any supported featurizer.
Available Featurizers¶
The toolkit.featurizers module contains all available featurizers, including:
PropertyGrabberMorganFingerprint
Serial Use¶
You can featurize molecules one at a time using either the featurize or featurize_many(parallel=False) methods:
from simmate.toolkit.featurizers import MorganFingerprint
# OPTION 1
for molecule in input_molecules:
fingerprint = MorganFingerprint.featurize(
molecule=molecule
)
# OPTION 2
fingerprints = MorganFingerprint.featurize_many(
molecules=input_molecules,
parallel=False,
)
Parallel Use¶
Enable parallelization by using the featurize_many(parallel=True) method:
from simmate.toolkit.featurizers import MorganFingerprint
fingerprints = MorganFingerprint.featurize_many(
molecules=input_molecules,
parallel=True,
)
Adding a New Featurizer¶
To add a new featurizer, you need to:
- Inherit from the
Featurizerbase class - Define a
featurizemethod (can be a@classmethodor@staticmethod) that accepts amoleculeas a kwarg.
The Featurizer will then handle how featurize_many and other features behave.
For example:
from simmate.toolkit import Molecule
from simmate.toolkit.featurizers.base import Featurizer
class Example(Featurizer):
"""
An example featurizer
"""
@staticmethod
def featurize(
molecule: Molecule,
# feel free to add any extra kwargs you'd like
example_setting: float = 0.123,
):
# use the molecule to generate your feature(s)
return calculation_property