Skip to content

Molecule Featurizers


Overview

Featurizing molecules involves generating properties and information about each molecule, such as the number of rings, molecular weight, and number of stereocenters. These features are crucial for exploring and analyzing data, identifying trends, and training machine learning or artificial intelligence models. The Featurizer base class helps optimize this process.

Fingerprints

Fingerprints are a unique type of featurizer that generate a 1D array of numerical values for each molecule. Each value represents a specific feature or property of the molecule. Fingerprints are entirely numerical, making them suitable for mathematical applications without requiring additional processing.


Basic API vs. Featurizer API

The Featurizer API and the single-molecule API can both be used to calculate a list of properties. The choice between the two depends on your coding style preference and the scale of your project. The Featurizer API offers parallelization, which can be beneficial for large-scale projects.

Single Molecule ("Basic") API:

my_final_props = {
    "num_rings": [m.num_rings for m in input_molecules],
    "num_stereocenters": [m.num_stereocenters for m in input_molecules],
    "molecular_weight": [m.molecular_weight for m in input_molecules],
}

Featurizer API:

from simmate.toolkit.featurizers import PropertyGrabber

my_final_props = PropertyGrabber.featurize_many(
    molecules=input_molecules,
    properties=["num_rings", "num_stereocenters", "molecular_weight"],
    parallel=True,  # this is the key reason you'd want to use a Featurizer class!
)

Bug

parallel=True has not yet been implemented

Tip

Both APIs yield the same result. The main difference is that the Featurizer API can use Dask for parallelization when parallel=True. If feature generation takes more than 15 minutes for all molecules, we recommend using the Featurizer API. This is typically the case when working with datasets of over 1 million molecules.

Warning

Our API is still in the early stages of development and may undergo changes to improve usability.


Usage Guide

All classes that inherit from the Featurizer class can be used in the same way. This guide uses the MorganFingerprint as an example, but you can substitute it with any supported featurizer.

Available Featurizers

The toolkit.featurizers module contains all available featurizers, including:

  • PropertyGrabber
  • MorganFingerprint

Serial Use

You can featurize molecules one at a time using either the featurize or featurize_many(parallel=False) methods:

from simmate.toolkit.featurizers import MorganFingerprint

# OPTION 1
for molecule in input_molecules:
    fingerprint = MorganFingerprint.featurize(
        molecule=molecule
    )

# OPTION 2
fingerprints = MorganFingerprint.featurize_many(
    molecules=input_molecules,
    parallel=False,
)

Parallel Use

Enable parallelization by using the featurize_many(parallel=True) method:

from simmate.toolkit.featurizers import MorganFingerprint

fingerprints = MorganFingerprint.featurize_many(
    molecules=input_molecules,
    parallel=True,
)

Adding a New Featurizer

To add a new featurizer, you need to:

  1. Inherit from the Featurizer base class
  2. Define a featurize method (can be a @classmethod or @staticmethod) that accepts a molecule as a kwarg.

The Featurizer will then handle how featurize_many and other features behave.

For example:

from simmate.toolkit import Molecule
from simmate.toolkit.featurizers.base import Featurizer


class Example(Featurizer):
    """
    An example featurizer
    """

    @staticmethod
    def featurize(
        molecule: Molecule,
        # feel free to add any extra kwargs you'd like
        example_setting: float = 0.123,
    ):
        # use the molecule to generate your feature(s)
        return calculation_property