GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

Last update: Dec 12, 2022

Related tags

Overview

GeneDisco: A benchmark for active learning in drug discovery

In vitro cellular experimentation with genetic interventions, using for example CRISPR technologies, is an essential step in early-stage drug discovery and target validation that serves to assess initial hypotheses about causal associations between biological mechanisms and disease pathologies. With billions of potential hypotheses to test, the experimental design space for in vitro genetic experiments is extremely vast, and the available experimental capacity - even at the largest research institutions in the world - pales in relation to the size of this biological hypothesis space.

GeneDisco (published at ICLR-22) is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery. GeneDisco contains a curated set of multiple publicly available experimental data sets as well as open-source i mplementations of state-of-the-art active learning policies for experimental design and exploration.

GeneDisco ICLR-22 Challenge

Learn more about the GeneDisco challenge for experimental design for optimally exploring the vast genetic intervention space here.

Install

pip install genedisco

Use

How to Run the Full Benchmark Suite?

Experiments (all baselines, acquisition functions, input and target datasets, multiple seeds) included in GeneDisco can be executed sequentially for e.g. acquired batch size 64, 8 cycles and a bayesian_mlp model using:

run_experiments \
  --cache_directory=/path/to/genedisco_cache  \
  --output_directory=/path/to/genedisco_output  \
  --acquisition_batch_size=64  \
  --num_active_learning_cycles=8  \
  --max_num_jobs=1

Results are written to the folder at /path/to/genedisco_cache, and processed datasets will be cached at /path/to/genedisco_cache (please replace both with your desired paths) for faster startup in future invocations.

Note that due to the number of experiments being run by the above command, we recommend execution on a compute cluster.
The GeneDisco codebase also supports execution on slurm compute clusters (the slurm command must be available on the executing node) using the following command and using dependencies in a Python virtualenv available at /path/to/your/virtualenv (please replace with your own virtualenv path):

run_experiments \
  --cache_directory=/path/to/genedisco_cache  \
  --output_directory=/path/to/genedisco_output  \
  --acquisition_batch_size=64  \
  --num_active_learning_cycles=8  \
  --schedule_on_slurm \
  --schedule_children_on_slurm \
  --remote_execution_virtualenv_path=/path/to/your/virtualenv

Other scheduling systems are currently not supported by default.

How to Run A Single Isolated Experiment (One Learning Cycle)?

To run one active learning loop cycle, for example, with the "topuncertain" acquisition function, the "achilles" feature set and the "schmidt_2021_ifng" task, execute the following command:

active_learning_loop  \
    --cache_directory=/path/to/genedisco/genedisco_cache \
    --output_directory=/path/to/genedisco/genedisco_output \
    --model_name="bayesian_mlp" \
    --acquisition_function_name="topuncertain" \
    --acquisition_batch_size=64 \
    --num_active_learning_cycles=8 \
    --feature_set_name="achilles" \
    --dataset_name="schmidt_2021_ifng"

How to Evaluate a Custom Acquisition Function?

To run a custom acquisition function, set --acquisition_function_name="custom" and --acquisition_function_path to the file path that contains your custom acquisition function.

active_learning_loop  \
    --cache_directory=/path/to/genedisco/genedisco_cache \
    --output_directory=/path/to/genedisco/genedisco_output \
    --model_name="bayesian_mlp" \
    --acquisition_function_name="custom" \
    --acquisition_function_path=/path/to/custom_acquisition_function.py \
    --acquisition_batch_size=64 \
    --num_active_learning_cycles=8 \
    --feature_set_name="achilles" \
    --dataset_name="schmidt_2021_ifng"

...where "/path/to/custom_acquisition_function.py" contains code for your custom acquisition function corresponding to the BaseBatchAcquisitionFunction interface, e.g.:

import numpy as np
from typing import AnyStr, List
from slingpy import AbstractDataSource
from slingpy.models.abstract_base_model import AbstractBaseModel
from genedisco.active_learning_methods.acquisition_functions.base_acquisition_function import \
    BaseBatchAcquisitionFunction

class RandomBatchAcquisitionFunction(BaseBatchAcquisitionFunction):
    def __call__(self,
                 dataset_x: AbstractDataSource,
                 batch_size: int,
                 available_indices: List[AnyStr], 
                 last_selected_indices: List[AnyStr] = None, 
                 model: AbstractBaseModel = None,
                 temperature: float = 0.9,
                 ) -> List:
        selected = np.random.choice(available_indices, size=batch_size, replace=False)
        return selected

Note that the last class implementing BaseBatchAcquisitionFunction is loaded by GeneDisco if there are multiple valid acquisition functions present in the loaded file.

Citation

Please consider citing, if you reference or use our methodology, code or results in your work:

@inproceedings{mehrjou2022genedisco,
    title={{GeneDisco: A Benchmark for Experimental Design in Drug Discovery}},
    author={Mehrjou, Arash and Soleymani, Ashkan and Jesson, Andrew and Notin, Pascal and Gal, Yarin and Bauer, Stefan and Schwab, Patrick},
    booktitle={{International Conference on Learning Representations (ICLR)}},
    year={2022}
}

License

Authors

Patrick Schwab, GlaxoSmithKline plc
Arash Mehrjou, GlaxoSmithKline plc
Andrew Jesson, University of Oxford
Ashkan Soleymani, MIT

Acknowledgements

PS and AM are employees and shareholders of GlaxoSmithKline plc.

Comments

genedisco dependency incompatibility with evalai

Hey, interesting challenge. Perhaps you could setup your repos with poetry, to ensure complete reproducibility?

When installing your repository using poetry alongside evalai package, poetry package management points to incompatibility:

Because no versions of evalai match >1.3.14,<2.0.0
and evalai (1.3.14) depends on requests (2.25.1), evalai (>=1.3.14,<2.0.0) requires requests (2.25.1).
And because genedisco (rev master) depends on requests (>=2.26.0), evalai (>=1.3.14,<2.0.0) is incompatible with 
genedisco (rev master). So, because genedisco-challenge depends on both genedisco (branch master) 
and evalai (^1.3.14), version solving failed.

opened by jgamper 1

HitRatio metric

Add HitRatio as a performance metric computer at every cycle.

Description:

Before the AL loop begins, the function prepare_hitratio_evaluation in active_learning_loop.py finds the top mover genes for a certain ratio that it receives as input and saves those genes at output_dir/hitratio_artefacts . This set is found and saved for every random seed that is used for experiments.

The constructor of the Slingpy Evaluator is designed for supervised learning tasks: https://github.com/slingpy/slingpy/blob/841651d8d95bf7bfe905cd7e622fac8bbbf8ab71/slingpy/evaluation/evaluator.py#L29

To support the HitRatio metric, this constructor has to be changed or a new Evaluator has to be designed specifically for this metric. This PR took the second approach and designed a new Evaluator at genedisco/evaluation/evaluator.py. As this Evaluator has a different constructor, the evaluate_model method of the AbstractBaseApplication is overridden in the SingleCycleApplication class so that it can choose which evaluator to use (Slingpy's normal evaluator or the newly developed evaluator for HitRatio in GeneDisco.)

The HitRatio metric in genedisco/evaluation/hitratio.py builds a list of genes that are chosen at all cycles until (inclusive) the current cycle and checks how many of them are among the top mover genes (which was initially calculated and stored before the AL cycles began.)

opened by amehrjou 0

Difficulty reproducing results

Hi! I imagine this might be a trivial mistake but I'm having some issues making the package work. I'm installing the package in a new environment and running the following:

active_learning_loop  \
    --cache_directory=/path/to/genedisco/genedisco_cache \
    --output_directory=/path/to/genedisco/genedisco_output \
    --model_name="bayesian_mlp" \
    --acquisition_function_name="topuncertain" \
    --acquisition_batch_size=64 \
    --num_active_learning_cycles=8 \
    --feature_set_name="achilles" \
    --dataset_name="schmidt_2021_ifng"

However, when opening output/results.pickle , the hit ratio seems to be far off from the figure shown in the paper in Appendix C (hit ratio of about 0.10). However, I'm constantly getting the same hit ratio as random selection of genes (about 0.03 in this case) and have tried several seeds. While the model loss seems to be getting smaller after each training cycle in the torch epoch loop, the errors in the results file seem to be very stable which leads me to believe the predictions are not being evaluated correctly. This is the output of output/results.pickle:

[{'HitRatio': 0.004901960784313725,
  'MeanAbsoluteError': 0.15648,
  'RootMeanSquaredError': 0.21979001,
  'SymmetricMeanAbsolutePercentageError': 177.80421150155175,
  'SpearmanRho': 0.003993152761730373},
 {'HitRatio': 0.006535947712418301,
  'MeanAbsoluteError': 0.15606545,
  'RootMeanSquaredError': 0.21957004,
  'SymmetricMeanAbsolutePercentageError': 185.59466626510942,
  'SpearmanRho': 0.004703939468359378},
 {'HitRatio': 0.014705882352941176,
  'MeanAbsoluteError': 0.15653546,
  'RootMeanSquaredError': 0.21988931,
  'SymmetricMeanAbsolutePercentageError': 176.9613241689807,
  'SpearmanRho': -0.015053413954220293},
 {'HitRatio': 0.016339869281045753,
  'MeanAbsoluteError': 0.15699045,
  'RootMeanSquaredError': 0.22019535,
  'SymmetricMeanAbsolutePercentageError': 169.33131150971903,
  'SpearmanRho': -0.01484454339976535},
 {'HitRatio': 0.0196078431372549,
  'MeanAbsoluteError': 0.1561022,
  'RootMeanSquaredError': 0.21959972,
  'SymmetricMeanAbsolutePercentageError': 189.1150129655341,
  'SpearmanRho': -0.0007014284997006996},
 {'HitRatio': 0.021241830065359478,
  'MeanAbsoluteError': 0.15598604,
  'RootMeanSquaredError': 0.2196098,
  'SymmetricMeanAbsolutePercentageError': 186.18022092453444,
  'SpearmanRho': 0.007904273772308604},
 {'HitRatio': 0.027777777777777776,
  'MeanAbsoluteError': 0.15609008,
  'RootMeanSquaredError': 0.21968609,
  'SymmetricMeanAbsolutePercentageError': 185.90767033853314,
  'SpearmanRho': 0.00014303902195985987},
 {'HitRatio': 0.032679738562091505,
  'MeanAbsoluteError': 0.15619111,
  'RootMeanSquaredError': 0.21959607,
  'SymmetricMeanAbsolutePercentageError': 185.43084240240117,
  'SpearmanRho': -0.01098975831586953}]

Some other things:

the file output/run_results.pickle is empty, and I'm not sure it should be, containing the following:

RunResult(validation_scores=None, test_scores=None, model_path=None)

I have been looking through the code for quite a while but have been unable to find a solution. Any help would be much appreciated, thanks!

opened by lcamillo 2

Data GPU issue

It seems like data are in cpu and model in GPU. Because of slingpy we can't easily move the data in GPU so I think the easier solution is to add preprocessors to the model to do this. Let me know if there's an easier way or a better fix from your side.

More details about the issue below:

How to repro: This fails

run_experiments \
  --cache_directory=/path/to/genedisco_cache  \
  --output_directory=/path/to/genedisco_output  \
  --acquisition_batch_size=64  \
  --num_active_learning_cycles=8  \
  --max_num_jobs=1

with the message

Traceback (most recent call last):
  File "/blah/bin/run_experiments", line 8, in <module>
    sys.exit(main())
  File "/blah/lib/python3.8/site-packages/genedisco/apps/run_experiments_application.py", line 139, in main
    results = app.run()
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/abstract_run_policy.py", line 62, in run
    run_results = self._run(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/abstract_base_application.py", line 419, in _run
    run_result = self.run_policy._run(**self.get_params())
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/local_single_run_policy.py", line 31, in _run
    return self.base_policy_fun(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/abstract_base_application.py", line 436, in run_single
    model = self.train_model()
  File "/blah/lib/python3.8/site-packages/genedisco/apps/run_experiments_application.py", line 128, in train_model
    outputs.append(RunExperimentsApplication.parallel_run_wrapper(arg, self_reference=self))
  File "/blah/lib/python3.8/site-packages/genedisco/apps/run_experiments_application.py", line 95, in parallel_run_wrapper
    app.run()
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/abstract_run_policy.py", line 62, in run
    run_results = self._run(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/abstract_base_application.py", line 419, in _run
    run_result = self.run_policy._run(**self.get_params())
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/local_single_run_policy.py", line 31, in _run
    return self.base_policy_fun(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/abstract_base_application.py", line 436, in run_single
    model = self.train_model()
  File "/blah/lib/python3.8/site-packages/genedisco/apps/active_learning_loop.py", line 220, in train_model
    results = app.run().run_result
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/abstract_run_policy.py", line 62, in run
    run_results = self._run(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/abstract_base_application.py", line 419, in _run
    run_result = self.run_policy._run(**self.get_params())
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/composite_run_policy.py", line 90, in _run
    result_dicts = list(map(
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/abstract_run_policy.py", line 97, in run_with_file_output
    run_results_w_metadata = base_policy.run(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/composite_run_policy.py", line 76, in run
    run_results = self._run(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/composite_run_policy.py", line 90, in _run
    result_dicts = list(map(
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/abstract_run_policy.py", line 97, in run_with_file_output
    run_results_w_metadata = base_policy.run(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/composite_run_policy.py", line 76, in run
    run_results = self._run(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/composite_run_policy.py", line 90, in _run
    result_dicts = list(map(
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/abstract_run_policy.py", line 97, in run_with_file_output
    run_results_w_metadata = base_policy.run(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/abstract_run_policy.py", line 62, in run
    run_results = self._run(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/run_policies/local_single_run_policy.py", line 31, in _run
    return self.base_policy_fun(**kwargs)
  File "/blah/lib/python3.8/site-packages/slingpy/apps/abstract_base_application.py", line 436, in run_single
    model = self.train_model()
  File "/blah/lib/python3.8/site-packages/genedisco/apps/single_cycle_application.py", line 246, in train_model
    self.model.fit(self.datasets.training_set_x,
  File "/blah/lib/python3.8/site-packages/genedisco/models/meta_models.py", line 163, in fit
    return self.model.fit(train_x, train_y, validation_set_x, validation_set_y)
  File "/blah/lib/python3.8/site-packages/slingpy/models/torch_model.py", line 171, in fit
    y_pred_i, loss = get_output_and_loss(inputs, labels)
  File "/blah/lib/python3.8/site-packages/slingpy/models/torch_model.py", line 154, in get_output_and_loss
    y_pred_i = self.model(model_inputs)
  File "/blah/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/blah/lib/python3.8/site-packages/genedisco/active_learning_methods/batchbald_redux/consistent_mc_dropout.py", line 45, in forward
    mc_output_BK = self.mc_forward_impl(mc_input_BK)[0]
  File "/blah/lib/python3.8/site-packages/genedisco/models/pytorch_models.py", line 54, in mc_forward_impl
    emb = self.fc1(x)
  File "/blah/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/blah/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/blah/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: Tensor for argument #2 'mat1' is on CPU, but expected it to be on GPU (while checking arguments for addmm)

opened by ptigas 1

GeneDisco breaks if `torch.cuda.is_available()` when running active learning setup
Great tool and exciting work!

When running the following (from the starter kit) if torch.cuda.is_available()

active_learning_loop \ --cache_directory=/path/to/genedisco/genedisco_cache \ --output_directory=/path/to/genedisco/genedisco_output \ --model_name="bayesian_mlp" \ --acquisition_function_name="topuncertain" \ --acquisition_batch_size=64 \ --num_active_learning_cycles=8 \ --feature_set_name="achilles" \ --dataset_name="schmidt_2021_ifng"

GeneDisco fails with a RuntimeError: Tensor for argument #1 is on CPU but expected it to be on GPU.

Note that my cluster supports python 3.7 not 3.8, so I had to alter the requirements for this package. This could be the source of the issue but I doubt it in this case. (Sidenote: is there a reason why 3.8 is required?)

The problem is in slingpy: specifically here: https://github.com/slingpy/slingpy/blob/7b27a4dae3957d3bcdb2622f288a2fc02022c69c/slingpy/models/torch_model.py#L144

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') self.model = self.model.to(device)

Where slingpy checks for CUDA and puts the model on device. However, the data is never put on device causing the error.

The easy fix (and verified working) which ignores GPU is simply to remove these two lines.

In my experience for my GPU (RTX 8000) vs. 4 CPU, GPU is ~4X faster.

To fix this example to use GPU (for training and not prediction), at least in this case, requires a two line fix.
opened by atong01 1
Jupyter notebook EDA

Hi all, thanks for launching the competition. It seems very interesting! I read the paper and quickly checked the code and datasets. If you think that it is reasonable, I wanted to suggest for someone in your group to make some jupyter notebook with exploratory data analysis and running a simple model example. I used to compete at kaggle and usually a notebook going over the data makes the understanding of the problem much clearer, specially for folks that are not from this field. If you want some examples, you can see it here, where people published their code to predict mechanism of actions: https://www.kaggle.com/c/lish-moa/code?competitionId=19988&sortBy=voteCount

best, Felipe

opened by fmellomascarenhas 1

GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

Related tags

Overview

GeneDisco: A benchmark for active learning in drug discovery

GeneDisco ICLR-22 Challenge

Install

Use

How to Run the Full Benchmark Suite?

How to Run A Single Isolated Experiment (One Learning Cycle)?

How to Evaluate a Custom Acquisition Function?

Citation

License

Authors

Acknowledgements

Comments

genedisco dependency incompatibility with evalai

HitRatio metric

Difficulty reproducing results

Data GPU issue

GeneDisco breaks if `torch.cuda.is_available()` when running active learning setup

Jupyter notebook EDA

Owner

Systemic Evolutionary Chemical Space Exploration for Drug Discovery

DeepOBS: A Deep Learning Optimizer Benchmark Suite

Benchmark for evaluating open-ended generation

Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Open-L2O: A Comprehensive and Reproducible Benchmark for Learning to Optimize Algorithms

This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

The code for SAG-DTA: Prediction of Drug–Target Affinity Using Self-Attention Graph Network.

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms

Benchmark spaces - Benchmarks of how well different two dimensional spaces work for clustering algorithms

Experimental solutions to selected exercises from the book [Advances in Financial Machine Learning by Marcos Lopez De Prado]

A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

🤖 A Python library for learning and evaluating knowledge graph embeddings

Tensorflow 2 implementation of the paper: Learning and Evaluating Representations for Deep One-class Classification published at ICLR 2021

A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT).