chembl_downloader
Don't worry about downloading/extracting ChEMBL or versioning - just use chembl_downloader
to write code that knows how to download it and use it automatically.
Installation
$ pip install chembl-downloader
Usage
Download A Specific Version
import chembl_downloader
path = chembl_downloader.download(version='28')
After it's been downloaded and extracted once, it's smart and does not need to download again. It gets stored using pystow
automatically in the ~/.data/chembl
directory.
We'd like to implement something such that it could load directly into SQLite from the archive, but it appears this is a paid feature.
Download the Latest Version
First, you'll have to install bioversions
with pip install bioversions
, whose job it is to look up the latest version of many databases. Then, you can modify the previous code slightly by omitting the version
keyword argument:
import chembl_downloader
path = chembl_downloader.download()
The version
keyword argument is available for all functions in this package (e.g., including connect()
, cursor()
, and query()
), but will be omitted below for brevity.
Automate Connection
Inside the archive is a single SQLite database file. Normally, people manually untar this folder then do something with the resulting file. Don't do this, it's not reproducible! Instead, the file can be downloaded and a connection can be opened automatically with:
import chembl_downloader
with chembl_downloader.connect() as conn:
with conn.cursor() as cursor:
cursor.execute(...) # run your query string
rows = cursor.fetchall() # get your results
The cursor()
function provides a convenient wrapper around this operation:
import chembl_downloader
with chembl_downloader.cursor() as cursor:
cursor.execute(...) # run your query string
rows = cursor.fetchall() # get your results
Run a query and get a pandas DataFrame
The most powerful function is query()
which builds on the previous connect()
function in combination with pandas.read_sql
to make a query and load the results into a pandas DataFrame for any downstream use.
import chembl_downloader
sql = """
SELECT
MOLECULE_DICTIONARY.chembl_id,
MOLECULE_DICTIONARY.pref_name
FROM MOLECULE_DICTIONARY
JOIN COMPOUND_STRUCTURES ON MOLECULE_DICTIONARY.molregno == COMPOUND_STRUCTURES.molregno
WHERE molecule_dictionary.pref_name IS NOT NULL
LIMIT 5
"""
df = chembl_downloader.query(sql)
df.to_csv(..., sep='\t', index=False)
Suggestion 1: use pystow
to make a reproducible file path that's portable to other people's machines (e.g., it doesn't have your username in the path).
Suggestion 2: RDKit is now pip-installable with pip install rdkit-pypi
, which means most users don't have to muck around with complicated conda environments and configurations. One of the powerful but understated tools in RDKit is the rdkit.Chem.PandasTools module.
Store in a Different Place
If you want to store the data elsewhere using pystow
(e.g., in pyobo
I also keep a copy of this file), you can use the prefix
argument.
import chembl_downloader
# It gets downloaded/extracted to
# ~/.data/pyobo/raw/chembl/29/chembl_29/chembl_29_sqlite/chembl_29.db
path = chembl_downloader.download(prefix=['pyobo', 'raw', 'chembl'])
See the pystow
documentation on configuring the storage location further.
The prefix
keyword argument is available for all functions in this package (e.g., including connect()
, cursor()
, and query()
).
Download via CLI
After installing, run the following CLI command to ensure it and send the path to stdout
$ chembl_downloader
Use --test
to show two example queries
$ chembl_downloader --test
Contributing
If you'd like to contribute, there's a submodule called chembl_downloader.queries
where you can add an SQL query along with a description of what it does for easy importing.