Molecule Featurizers¶
Overview¶
Featurizing molecules involves generating properties and information about each molecule, such as the number of rings, molecular weight, and number of stereocenters. These features are crucial for exploring and analyzing data, identifying trends, and training machine learning or artificial intelligence models. The Featurizer
base class helps optimize this process.
Fingerprints¶
Fingerprints are a unique type of featurizer that generate a 1D array of numerical values for each molecule. Each value represents a specific feature or property of the molecule. Fingerprints are entirely numerical, making them suitable for mathematical applications without requiring additional processing.
Basic API vs. Featurizer API¶
The Featurizer
API and the single-molecule API can both be used to calculate a list of properties. The choice between the two depends on your coding style preference and the scale of your project. The Featurizer
API offers parallelization, which can be beneficial for large-scale projects.
Single Molecule ("Basic") API:
my_final_props = {
"num_rings": [m.num_rings for m in input_molecules],
"num_stereocenters": [m.num_stereocenters for m in input_molecules],
"molecular_weight": [m.molecular_weight for m in input_molecules],
}
Featurizer
API:
from simmate.toolkit.featurizers import PropertyGrabber
my_final_props = PropertyGrabber.featurize_many(
molecules=input_molecules,
properties=["num_rings", "num_stereocenters", "molecular_weight"],
parallel=True, # this is the key reason you'd want to use a Featurizer class!
)
Bug
parallel=True
has not yet been implemented
Tip
Both APIs yield the same result. The main difference is that the Featurizer
API can use Dask for parallelization when parallel=True
. If feature generation takes more than 15 minutes for all molecules, we recommend using the Featurizer
API. This is typically the case when working with datasets of over 1 million molecules.
Warning
Our API is still in the early stages of development and may undergo changes to improve usability.
Usage Guide¶
All classes that inherit from the Featurizer
class can be used in the same way. This guide uses the MorganFingerprint
as an example, but you can substitute it with any supported featurizer.
Available Featurizers¶
The toolkit.featurizers
module contains all available featurizers, including:
PropertyGrabber
MorganFingerprint
Serial Use¶
You can featurize molecules one at a time using either the featurize
or featurize_many(parallel=False)
methods:
from simmate.toolkit.featurizers import MorganFingerprint
# OPTION 1
for molecule in input_molecules:
fingerprint = MorganFingerprint.featurize(
molecule=molecule
)
# OPTION 2
fingerprints = MorganFingerprint.featurize_many(
molecules=input_molecules,
parallel=False,
)
Parallel Use¶
Enable parallelization by using the featurize_many(parallel=True)
method:
from simmate.toolkit.featurizers import MorganFingerprint
fingerprints = MorganFingerprint.featurize_many(
molecules=input_molecules,
parallel=True,
)
Adding a New Featurizer¶
To add a new featurizer, you need to:
- Inherit from the
Featurizer
base class - Define a
featurize
method (can be a@classmethod
or@staticmethod
) that accepts amolecule
as a kwarg.
The Featurizer
will then handle how featurize_many
and other features behave.
For example:
from simmate.toolkit import Molecule
from simmate.toolkit.featurizers.base import Featurizer
class Example(Featurizer):
"""
An example featurizer
"""
@staticmethod
def featurize(
molecule: Molecule,
# feel free to add any extra kwargs you'd like
example_setting: float = 0.123,
):
# use the molecule to generate your feature(s)
return calculation_property