Basic Database Access¶
Outline of all steps¶
Accessing and analyzing data typically involves the following steps:
- Connect to your database
- Load a specific database table
- Filter data
- Convert data to a desired format
- Modify data via
simmate.toolkit
or pandas.Dataframe
The sections below will guide you on performing each step. But to place everything up-front, your final script may look like this:
# Connect to your database
from simmate.database import connect
# Load a specific database table
from simmate.database.third_parties import MatprojStructure
# Filter data
results = MatprojStructure.objects.filter(
nsites=3,
is_gap_direct=False,
spacegroup=166,
).all()
# Convert data to a desired format
structures = results.to_toolkit()
dataframe = results.to_dataframe()
# Modify data
for structure in structures:
# run your anaylsis/modifications here!
Connect to your database¶
For interactive use, Django settings must be configured before any of these submodules can be imported. This can be done with...
# connect to the database
from simmate.database import connect
# and now you can import tables in this module
from simmate.database.workflow_results import MITStaticEnergy
If the connect
step is not done, you will recieve the following error:
ImproperlyConfigured: Requested setting INSTALLED_APPS, but settings are not
configured. You must either define the environment variable DJANGO_SETTINGS_MODULE
or call settings.configure() before accessing settings.
Load your database table¶
The name of your table will depend on the source you're trying to access. To see the available sources (Materials Project, OQMD, Jarvis, COD), you can explore the contents the database/third_parties module.
Using Materials Project as an example, we can load the table using...
from simmate.database.third_parties import MatprojStructure
Alternatively, if you intend to access data from a specific workflow, there are two methods to access the table. in addition to loading from the workflow_results
module, most workflows have a database_table
attribute that let you access the table as well:
########## METHOD 1 ########
from simmate.workflows.static_energy import mit_workflow
table = mit_workflow.database_table
######## METHOD 2 ########
from simmate.database import connect
from simmate.database.workflow_results import MITStaticEnergy
# The line below shows that these tables are the same! Therfore, use
# whichever method you prefer.
assert table == MITStaticEnergy
Query and filter data¶
To query a table, Simmate inherits methods from Django, which is a web framework for quering massive datasets. It is powerful and efficient, and is therefore used to deliver data to many familiar websites, such as Instagram and Spotify. The key feature of Django that we use is its Object-Relational Mapper (ORM). The ORM allows us to use a simple language for making complex queries to our database. Below, we show some common queries. A full description of all query methods is discussed on Django's query page.
All rows of the database table are available via the objects
attribute:
MITStaticEnergy.objects.all()
All columns of the database table can be printed via the show_columns
methods:
MITStaticEnergy.show_columns()
To filter rows with exact-value matches in a column:
MITStaticEnergy.objects.filter(
nsites=3,
is_gap_direct=False,
spacegroup=166,
).all()
To filter rows based on conditions, chain the column name with two underscores. Conditions supported are listed here, but the most commonly used ones are:
contains
= contains text, case-sensitive queryicontains
= contains text, case-insensitive querygt
= greater thangte
= greater than or equal tolt
= less thanlte
= less than or equal torange
= provides upper and lower bound of valuesisnull
= returnsTrue
if the entry does not exist
An example query with conditional filters:
MITStaticEnergy.objects.filter(
nsites__gte=3, # greater or equal to 3 sites
energy__isnull=False, # the structure DOES have an energy
density__range=(1,5), # density is between 1 and 5
elements__icontains='"C"', # the structure includes the element Carbon
spacegroup__number=167, # the spacegroup number is 167
).all()
Note, for the filtering condition elements__icontains
, we used some odd quotations when querying for carbon: '"C"'
. This is not a typo! The quotes ensure we don't accidentally grab Ca, Cs, Ce, Cl, and so on. This is an issue when you are using SQLite (the default datbase backend). If you are using Postgres, this line can change to the cleaner version elements__contains="C"
.
Convert data to desired format¶
By default, Django returns your query results as a queryset
(or SearchResults
in simmate). This is a list of database objects. It is more useful to convert them to a pandas dataframe or to toolkit objects.
# Gives a pandas dataframe.
df = MITStaticEnergy.objects.filter(...).to_dataframe()
# Gives a list of toolkit Structure objects
df = MITStaticEnergy.objects.filter(...).to_toolkit()
# '...' are the set of filters selected from above.
Modify data¶
To modify and analyze data, see the pandas and simmate.toolkit
documentation for more info.