This is the repo for the paper "Improving the Accuracy-Memory Trade-Off of Random Forests Via Leaf-Refinement".

Last update: Dec 29, 2022

Related tags

Deep Learning leaf-refinement-experiments

Overview

Improving the Accuracy-Memory Trade-Off of Random Forests Via Leaf-Refinement

This is the repository for the paper "Improving the Accuracy-Memory Trade-Off of Random Forests Via Leaf-Refinement". The repository is structured as the following:

PyPruning: This repository contains the implementations for all pruning algorithms and can be installed as a regular python package and used in other projects. For more information have a look at the Readme file in PyPruning/Readme.md and its documentation in PyPruning/docs.
experiment_runner: This is a simple package / script which can be used to run multiple experiments in parallel on the same machine or distributed across many different machines. It can also be installed as a regular python package and used for other projects. For more information have a look at the Readme file in experiment_runner/Readme.md.
{adult, bank, connect, ..., wine-quality}: Each folder contains an script init.sh which downloads the necessary files and performs pre-processing if necessary (e.g. extract archives etc.).
init_all.sh: Iterates over all datasets and calls the respective init.sh files. Depending on your internet connection this may take some time
environment.yml: Anaconda environment file which contains all dependencies. For more details see below
LeafRefinement.py: This is the implementation of the LeafRefinement method. We initially implemented a more complex method which uses Proximal Gradient Descent to simultaneously learn the weights and refine leaf nodes. During our experiments we discovered that leaf-refinement in iteself was enough and much simpler. We kept our old code, but implemented the LeafRefinement.py class for easier usage.
run.py: The script which executes the experiments. For more details see the examples below.
plot_results.py: The script is used explore and display results. It also creates the plots for the paper.

Getting everything ready

This git repository contains two submodules PyPruning and experiment_runner which need to be cloned first.

git clone --recurse-submodules [email protected]:sbuschjaeger/leaf-refinement-experiments.git

After the code has been obtained you need to install all dependencies. If you use Anaconda you can simply call

conda env create -f environment.yml

to prepare and activate the environment LR. After that you can install the python packages PyPruning and experiment_runner via pip:

pip install -e file:PyPruning
pip install -e file:experiment_runner

and finally activate the environment with

conda activate LR

Last you will need to get some data. If you are interested in a specific dataset you can use the accompanying init.sh script via

cd `${Dataset}`
./init.sh

or if you want to download all datasets use

./init_all.sh

Depending on your internet connection this may take some time.

Running experiments

If everything worked as expected you should now be able to run the run.py script to prune some ensembles. This script has a decent amount of parameters. See further below for an minimal working example.

n_jobs: Number of jobs / threads used for multiprocessing
base: Base learner used for experiments. Can be {RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, HeterogenousForest}. Can be a list of arguments for multiple experiments.
nl: Maximum number of leaf nodes (corresponds to scikit-learns max_leaf_nodes parameter)
dataset: Dataset used for experiment. Can be a list of arguments for multiple experiments.
n_estimators: Number of estimators trained for the base learner.
n_prune: Size of the pruned ensemble. Can be a list of arguments for multiple experiments.
xval: Number of cross validation runs (default is 5)
use_prune: If set then the script uses a train / prune / test split. If not set then the training data is also used for pruning.
timeout: Maximum number of seconds per run. If the runtime exceeds the provided value, stop execution (default is 5400 seconds)

Note that all base ensembles for all cross validation splits of a dataset are trained before any of the pruning algorithms are used. If you want to evaluate many datasets / hyperparameter configuration in one run this requires a lot of memory.

To train and prune forests on the magic dataset you can for example do

./run.py --dataset adult -n_estimators 256 --n_prune 2 4 8 16 32 64 128 256 --nl 64 128 256 512 1024 --n_jobs 128 --xval 5 --base RandomForestClassifier

The results are stored in ${Dataset}/results/${base}/${use_prune}/${date}/results.jsonl where ${Dataset} is the dataset (e.g. magic) and ${date} is the current time and date.

In order to re-produce the experiments form the paper you can call:

./run.py --dataset adult anura bank chess connect eeg elec postures japanese-vowels magic mozilla mnist nomao avila ida2016 satimage --n_estimators 256 --n_prune 2 4 8 16 32 64 128 256 --nl 64 128 256 512 1024 --n_jobs 128 --xval 5 --base RandomForestClassifier

Important: This call uses 128 threads and requires a decent (something in the range of 64GB) amount of memory to work.

Exploring the results

After you run the experiments you can view the results with the plot_results.py script. We recommend to use an interactive Python environment for that such as Jupyter or VSCode with the ability to execute cells, but you should also be able to run this script as-is. This script is fairly well-commented, so please have a look at it for more detailed comments.

This repo in the implementation of EMNLP'21 paper "SPARQLing Database Queries from Intermediate Question Decompositions" by Irina Saparina, Anton Osokin

SPARQLing Database Queries from Intermediate Question Decompositions This repo is the implementation of the following paper: SPARQLing Database Querie

20 Dec 19, 2022

Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

Zero-Shot Information Extraction as a Unified Text-to-Triple Translation Source code repo for paper Zero-Shot Information Extraction as a Unified Text

88 Dec 31, 2022

Official Repo for ICCV2021 Paper: Learning to Regress Bodies from Images using Differentiable Semantic Rendering

[ICCV2021] Learning to Regress Bodies from Images using Differentiable Semantic Rendering Getting Started DSR has been implemented and tested on Ubunt

83 Nov 27, 2022

Official repo for BMVC2021 paper ASFormer: Transformer for Action Segmentation

ASFormer: Transformer for Action Segmentation This repo provides training & inference code for BMVC 2021 paper: ASFormer: Transformer for Action Segme

42 Dec 23, 2022

This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".

Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories This repo is the code release of EMNLP 2021 con

12 Nov 22, 2022

Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP"

DiLBERT Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP" Pretrained Model The pretrained model presented in the paper is

2 Dec 15, 2022

Repo for the paper Extrapolating from a Single Image to a Thousand Classes using Distillation

Extrapolating from a Single Image to a Thousand Classes using Distillation by Yuki M. Asano* and Aaqib Saeed* (*Equal Contribution) Extrapolating from

16 Nov 4, 2022

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

BiDR Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval. Requirements torch==

11 Oct 20, 2022

The repo for the paper "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection".

33 Jan 5, 2023

This is the repo for the paper "Improving the Accuracy-Memory Trade-Off of Random Forests Via Leaf-Refinement".

Related tags

Overview

Improving the Accuracy-Memory Trade-Off of Random Forests Via Leaf-Refinement

Getting everything ready

Running experiments

Exploring the results

You might also like...

This repo in the implementation of EMNLP'21 paper "SPARQLing Database Queries from Intermediate Question Decompositions" by Irina Saparina, Anton Osokin

Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

Official Repo for ICCV2021 Paper: Learning to Regress Bodies from Images using Differentiable Semantic Rendering

Official repo for BMVC2021 paper ASFormer: Transformer for Action Segmentation

This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".

Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP"

Repo for the paper Extrapolating from a Single Image to a Thousand Classes using Distillation

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

The repo for the paper "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection".

Owner

The repo contains the code of the ACL2020 paper `Dice Loss for Data-imbalanced NLP Tasks`

The repo of the preprinting paper "Labels Are Not Perfect: Inferring Spatial Uncertainty in Object Detection"

This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

Repo for the Video Person Clustering dataset, and code for the associated paper

This repo contains the code and data used in the paper "Wizard of Search Engine: Access to Information Through Conversations with Search Engines"

This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

This repo is a PyTorch implementation for Paper "Unsupervised Learning for Cuboid Shape Abstraction via Joint Segmentation from Point Clouds"

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

The official repo of the CVPR 2021 paper Group Collaborative Learning for Co-Salient Object Detection .

Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"