Write reproducible code for getting and processing ChEMBL

Charles Tapley Hoyt

Last update: Dec 25, 2022

Related tags

Downloader reproducible-research chembl

Overview

chembl_downloader

Don't worry about downloading/extracting ChEMBL or versioning - just use chembl_downloader to write code that knows how to download it and use it automatically.

Installation

$ pip install chembl-downloader

Usage

Download A Specific Version

import chembl_downloader

path = chembl_downloader.download(version='28')

After it's been downloaded and extracted once, it's smart and does not need to download again. It gets stored using pystow automatically in the ~/.data/chembl directory.

We'd like to implement something such that it could load directly into SQLite from the archive, but it appears this is a paid feature.

Download the Latest Version

First, you'll have to install bioversions with pip install bioversions, whose job it is to look up the latest version of many databases. Then, you can modify the previous code slightly by omitting the version keyword argument:

import chembl_downloader

path = chembl_downloader.download()

The version keyword argument is available for all functions in this package (e.g., including connect(), cursor(), and query()), but will be omitted below for brevity.

Automate Connection

Inside the archive is a single SQLite database file. Normally, people manually untar this folder then do something with the resulting file. Don't do this, it's not reproducible! Instead, the file can be downloaded and a connection can be opened automatically with:

import chembl_downloader

with chembl_downloader.connect() as conn:
    with conn.cursor() as cursor:
        cursor.execute(...)  # run your query string
        rows = cursor.fetchall()  # get your results

The cursor() function provides a convenient wrapper around this operation:

import chembl_downloader

with chembl_downloader.cursor() as cursor:
    cursor.execute(...)  # run your query string
    rows = cursor.fetchall()  # get your results

Run a query and get a pandas DataFrame

The most powerful function is query() which builds on the previous connect() function in combination with pandas.read_sql to make a query and load the results into a pandas DataFrame for any downstream use.

import chembl_downloader

sql = """
SELECT
    MOLECULE_DICTIONARY.chembl_id,
    MOLECULE_DICTIONARY.pref_name
FROM MOLECULE_DICTIONARY
JOIN COMPOUND_STRUCTURES ON MOLECULE_DICTIONARY.molregno == COMPOUND_STRUCTURES.molregno
WHERE molecule_dictionary.pref_name IS NOT NULL
LIMIT 5
"""

df = chembl_downloader.query(sql)
df.to_csv(..., sep='\t', index=False)

Suggestion 1: use pystow to make a reproducible file path that's portable to other people's machines (e.g., it doesn't have your username in the path).

Suggestion 2: RDKit is now pip-installable with pip install rdkit-pypi, which means most users don't have to muck around with complicated conda environments and configurations. One of the powerful but understated tools in RDKit is the rdkit.Chem.PandasTools module.

Store in a Different Place

If you want to store the data elsewhere using pystow (e.g., in pyobo I also keep a copy of this file), you can use the prefix argument.

import chembl_downloader

# It gets downloaded/extracted to 
# ~/.data/pyobo/raw/chembl/29/chembl_29/chembl_29_sqlite/chembl_29.db
path = chembl_downloader.download(prefix=['pyobo', 'raw', 'chembl'])

See the pystow documentation on configuring the storage location further.

The prefix keyword argument is available for all functions in this package (e.g., including connect(), cursor(), and query()).

Download via CLI

After installing, run the following CLI command to ensure it and send the path to stdout

$ chembl_downloader

Use --test to show two example queries

$ chembl_downloader --test

Contributing

If you'd like to contribute, there's a submodule called chembl_downloader.queries where you can add an SQL query along with a description of what it does for easy importing.

Comments

Repo status

Dear @cthoyt,

I know that you have multiple responsibilities, but I was wondering if the current repo is in working condition or if is it a legacy repo which worked with a specific version of ChEMBL? It would be great if you could add a batch on the repo for the same.

Thank You.

opened by YojanaGadiya 4
Add SQL for getting activities by target

This PR adds some functionality for generating target-based datasets, motivated by https://github.com/PatWalters/yamc/issues/14.

See the notebook here (note that this is pinned with a permalink to the state after merging this PR).

opened by cthoyt 1
Improve ChEBI mapping notebook

This filters out about 10% of the possible ChEMBL - ChEBI curations since ChEBI externally already took care of that

-> move this into biomappings repo

opened by cthoyt 0
Call for additional functionality
What other operations do people commonly want to do with the entire ChEMBL database/SDF file that would be good to wrap (including loading other files released by ChEMBL)?

What other operations like the RDKit supplier exist in other libraries that might be worth wrapping?

@iwatobipen do you have any suggestions?
opened by cthoyt 0
Add functionality for bacting

@egonw are there any bulk SMILES, InChI, or SDF loading operations in bacting that are exposed by pybacting that would be nice to wrap inside this library for full loading of ChEMBL? On the readme, you can see I made a specific function for RDKit's "supplier" that reads an SDF file

opened by cthoyt 3

Releases(v0.4.1)

v0.4.1(Nov 19, 2022)
What's Changed

Add SQL for getting activities by target by @cthoyt in https://github.com/cthoyt/chembl-downloader/pull/8

Improve ChEBI mapping notebook by @cthoyt in https://github.com/cthoyt/chembl-downloader/pull/10

Add UniProt target mapping functions by @cthoyt in https://github.com/cthoyt/chembl-downloader/pull/11

Full Changelog: https://github.com/cthoyt/chembl-downloader/compare/v0.4.0...v0.4.1
Source code(tar.gz)
Source code(zip)
v0.4.0(Oct 28, 2022)
This PR does several things:

Removes dependency on bioversions and just implements the code locally

Adds a CLI for generating a statistics table for all versions of ChEMBL

Add proper project skeleton (documentation, unit tests, code quality assurance, CI)

Improve SQLite loading in case you delete the compressed data

Notebooks

Adds notebook about drug indications

Adds notebook about mapping to ChEBI

Source code(tar.gz)
Source code(zip)
v0.3.0(Mar 19, 2022)
This release adds two new functions:

chembl_downloader.download_monomer_library which gets this file https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_30_monomer_library.xml for whatever version you specify

chembl_downloader.get_monomer_library_root which does the same as the downloader but also parses the XML for you

Thanks to @iwatobipen and his recent blog post for inspiring this.
Source code(tar.gz)
Source code(zip)
v0.2.0(Jan 14, 2022)
New Functions

chembl_downloader.download_fps downloads the pre-computed Morgan fingerprint file

chembl_downloader.download_chemreps downloads the chembl-smiles-inchi-inchikey map

chembl_downloader.get_chemreps_df builds on chembl_downloader.download_chemreps and loads them in a pandas dataframe

Misc

Add isort to code quality checking

Enable many functions with return_version to make a tuple with the version, which is useful if you're having it infer the latest version.

Source code(tar.gz)
Source code(zip)
v0.1.3(Dec 20, 2021)
This release adds the get_substructure_library() for automating the generation of an RDKit substructure library as described in Greg Landrum's RDKit blog post, Some new features in the SubstructLibrary. The following example shows how it can be used to accomplish some of the first tasks presented in the post:

from rdkit import Chem import chembl_downloader library = chembl_downloader.get_substructure_library() query = Chem.MolFromSmarts('[O,N]=C-c:1:c:c:n:c:c:1') matches = library.GetMatches(query)

Full Changelog: https://github.com/cthoyt/chembl-downloader/compare/v0.1.2...v0.1.3
Source code(tar.gz)
Source code(zip)
v0.1.2(Dec 20, 2021)
Add get_assay_sql() function

Full Changelog: https://github.com/cthoyt/chembl-downloader/compare/v0.1.1...v0.1.2
Source code(tar.gz)
Source code(zip)
v0.1.1(Aug 5, 2021)

Add more top-level imports for download_sdf(), download_sqlite(), and latest()
Source code(tar.gz)
Source code(zip)
v0.1.0(Aug 4, 2021)
rename download() to download_extract_sqlite() to make room for other download functions

added supplier() function for loading the SDF dump through RDKit

Source code(tar.gz)
Source code(zip)
v0.0.4(Jul 28, 2021)
Update pandas backend for query() function

Improve CLI

Source code(tar.gz)
Source code(zip)
v0.0.3(Jul 27, 2021)

Add query() function for automatically generating pandas DataFrames from a given SQL query
Source code(tar.gz)
Source code(zip)
v0.0.2(Jul 27, 2021)
Fix bug when version not given

Fix bug where different chembl versions' different folder structures causes problem

Source code(tar.gz)
Source code(zip)
v0.0.1(Jul 27, 2021)

Initial release has a download(), connect(), and cursor() function.
Source code(tar.gz)
Source code(zip)

Owner

Charles Tapley Hoyt

Bio/cheminformatician, open scientist, maintainer of @pybel and @pykeen, part of @indralab (he/him)

GitHub

Source code of paper: "HRegNet: A Hierarchical Network for Efficient and Accurate Outdoor LiDAR Point Cloud Registration".

HRegNet: A Hierarchical Network for Efficient and Accurate Outdoor LiDAR Point Cloud Registration Environments The code mainly requires the following

Intelligent Sensing, Perception and Computing Group

3 Oct 6, 2022

Python code to crawl computer vision papers from top CV conferences. Currently it supports CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, SIGGRAPH

Python code to crawl computer vision papers from top CV conferences. Currently it supports CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, SIGGRAPH. It leverages selenium, a website testing framework to crawl the titles and pdf urls from the conference website, and download them one by one with some simple anti-anti-crawler tricks.

39 Nov 21, 2022

python code used to download all images contained in a facebook uid , the uid can be profile,group,fanpage

2 Dec 21, 2021

This repository contains code for a youtube-dl GUI written in PyQt.

youtube-dl-GUI This repository contains code for a youtube-dl GUI written in PyQt. It is based on youtube-dl which is a Video downloading script maint

191 Jan 2, 2023

code for paper"3D reconstruction method based on a generative model in continuous latent space"

PyTorch implementation of 3D-VGT(3D-VAE-GAN-Transformer) This repository contains the source code for the paper "3D reconstruction method based on a g

5 Apr 25, 2022

Code for "Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions"

Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions Codebase for the "Adversarial Motion Priors Make Good Substitutes for Com

54 Dec 13, 2022

Code for "Temporal Difference Learning for Model Predictive Control"

Temporal Difference Learning for Model Predictive Control Original PyTorch implementation of TD-MPC from Temporal Difference Learning for Model Predic

156 Jan 3, 2023

GTK4 + Python tutorial with code examples

Taiko's GTK4 Python tutorial Wanna make apps for Linux but not sure how to start with GTK? This guide will hopefully help! The intent is to show you h

190 Jan 8, 2023

FireDM is a python open source (Internet Download Manager) with multi-connections, high speed engine, it downloads general files and videos from youtube and tons of other streaming websites .

python open source (Internet Download Manager) with multi-connections, high speed engine, based on python, LibCurl, and youtube_dl https://github.com/firedm/FireDM

1.6k Apr 12, 2022

Using Youtube downloader is the fast and easy way to download and save any YouTube video.

Youtube video downloader using Django Using Django as a backend along with pytube module to create Youtbue Video Downloader. https://yt-videos-downloa

10 Jun 18, 2022

Advance Image Downloader/Extractor (Job) is a Python-Flask web-based app, which will help the user download the any kind of Images at any date and time over the internet. These images will get downloaded as a job and then let user know that the images have been downloaded by sending them a link over an email.

Advance Image Downloader/Extractor(Job) Advance Image Downloader/Extractor (Job) is a Python-Flask web-based app, which will help the user download th

13 Aug 27, 2022

Write reproducible code for getting and processing ChEMBL

Related tags

Overview

chembl_downloader

Installation

Usage

Download A Specific Version

Download the Latest Version

Automate Connection

Run a query and get a pandas DataFrame

Store in a Different Place

Download via CLI

Contributing

Comments

Repo status

Add SQL for getting activities by target

Improve ChEBI mapping notebook

Call for additional functionality

Add functionality for bacting

Releases(v0.4.1)

v0.4.1(Nov 19, 2022)

What's Changed

v0.4.0(Oct 28, 2022)

Notebooks

v0.3.0(Mar 19, 2022)

v0.2.0(Jan 14, 2022)

New Functions

Misc

v0.1.3(Dec 20, 2021)

v0.1.2(Dec 20, 2021)

v0.1.1(Aug 5, 2021)

v0.1.0(Aug 4, 2021)

v0.0.4(Jul 28, 2021)

v0.0.3(Jul 27, 2021)

v0.0.2(Jul 27, 2021)

v0.0.1(Jul 27, 2021)

Owner

Charles Tapley Hoyt

Source code of paper: "HRegNet: A Hierarchical Network for Efficient and Accurate Outdoor LiDAR Point Cloud Registration".

Python code to crawl computer vision papers from top CV conferences. Currently it supports CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, SIGGRAPH

python code used to download all images contained in a facebook uid , the uid can be profile,group,fanpage

This repository contains code for a youtube-dl GUI written in PyQt.

code for paper"3D reconstruction method based on a generative model in continuous latent space"

Code for "Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions"

Code for "Temporal Difference Learning for Model Predictive Control"

GTK4 + Python tutorial with code examples

FireDM is a python open source (Internet Download Manager) with multi-connections, high speed engine, it downloads general files and videos from youtube and tons of other streaming websites .

Using Youtube downloader is the fast and easy way to download and save any YouTube video.

Download and save Bing wallpapers and set as background for GNOME desktop

Vinetrimmer-DRM-TOOL - Widevine DRM downloader and decrypter for AMZN|NF|STAN And all

AkShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库

Official s3cmd repo -- Command line tool for managing Amazon S3 and CloudFront services

Most versatile Telegram torrent and youtube-dl bot.

Command-line program to download videos from YouTube.com and other video sites

The free and open-source Download Manager written in pure Python

A scriptable music downloader for Qobuz, Tidal, and Deezer