A practical ML pipeline for data labeling with experiment tracking using DVC.

Overview

Auto Label Pipeline

A practical ML pipeline for data labeling with experiment tracking using DVC

Goals:

  • Demonstrate reproducible ML
  • Use DVC to build a pipeline and track experiments
  • Automatically clean noisy data labels using Cleanlab cross validation
  • Determine which FastText subword embedding performs better for semi-supervised cluster classification
  • Determine optimal hyperparameters through experiment tracking
  • Prepare casually labeled data for human evaluation

Demo: View Experiments recorded in git branches:

asciicast

The Data

For our working demo, we will purify some of the slightly noisy/dirty labels found in Wikidata people entries for attributes for Employers and Occupations. Our initial data labels have been harvested from a json dump of Wikidata, the Kensho Wikidata dataset, and this notebook script for extracting the data.

Data Input Format

Tab separated CSV files, with the fields:

  • text_data - the item that is to be labeled (single word or short group of words)
  • class_type - the class label
  • context - any text that surrounds the text_data field in situ, or defines the text_data item in other words.
  • count - the number of occurrences of this label; how common it appears in the existing data.

Data Output format

  • (same parameters as the data input plus)
  • date_updated - when the label was updated
  • previous_class_type - the previous class_type label
  • mislabeled_rank - records how low the confidence was prior to a re-label

The Pipeline

  • Fetch
  • Prepare
  • Train
  • Relabel

For details, see the README in the src folder. The pipeline is orchestrated via the dvc.yaml file, and parameterized via params.yaml.

Using/Extending the pipeline

  1. Drop your own CSV files into the data/raw directory
  2. Run the pipeline
  3. Tune settings, embeddings, etc, until no longer amused
  4. Verify your results manually and by submitting data/final/data.csv for human evaluation, using random sampling and drawing heavily from the mislabeled_rank entries.

Project Structure

├── LICENSE
├── README.md
├── data                    # <-- Directory with all types of data
│ ├── final                 # <-- Directory with final data
│ │ ├── class.metrics.csv   # <-- Directory with raw and intermediate data
│ │ └── data.csv            # <-- Pipeline output (not stored in git)
│ ├── interim               # <-- Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared              # <-- Directory with prepared data
│ │ └── data.all.csv
│ └── raw                   # <-- Directory with raw data; populated by pipeline's fetch stage
│     ├── README.md
│     ├── cc.en.300.bin               # <-- Fasttext binary model file, creative commons 
│     ├── crawl-300d-2M-subword.bin   # <-- Fasttext binary model file, common crawl
│     ├── crawl-300d-2M-subword.vec
│     ├── employers.wikidata.csv      # <-- Our initial data, 1 set of class labels 
│     ├── lid.176.ftz
│     └── occupations.wikidata.csv    # <-- Our initial data, 1 set of class labels
├── dvc.lock                # <-- DVC internal state tracking file
├── dvc.yaml                # <-- DVC project configuration file
├── dvc_plots               # <-- Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json    # <-- Metrics from the pipeline's train stage  
├── mypy.ini
├── params.yaml             # <-- Parameter configuration file for the pipeline
├── reports                 # <-- Directory with metrics output
│ ├── prepare.metrics.json  
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src                     # <-- Directory containing the pipeline's code
    ├── README.md
    ├── fetch.py
    ├── prepare.py
    ├── relabel.py
    ├── train.py
    └── utils.py

Setup

Create environment

conda create --name auto-label-pipeline python=3.9

conda activate auto-label-pipeline

Install requirements

pip install -r requirements.txt

If you're going to modify the source, also install the requirements-dev.txt file


Reproduce the pipeline results locally

dvc repro

View Metrics

dvc metrics show

See also: DVC metrics

Working with Experiments

To see your local experiments:

dvc exp show

Experiments that have been turned into a branches can be referenced directly in commands:

dvc exp diff svc_linear_ex svc_rbf_ex

e.g. to compare experiments:

dvc exp diff [experiment branch name] [experiment branch 2 name]

e.g.:

dvc exp diff svc_linear_ex svc_rbf_ex

dvc exp diff svc_poly_ex svc_rbf_ex

To create an experiment by changing a parameter:

dvc exp run --set-param train.split=0.9 --name my_split_ex

(When promoting an experiment to a branch, DVC does not switch into the branch.)

To save and share your experiment in a branch:

dvc exp branch my_split_ex my_split_ex_branch

See also: DVC Experiments

View plots

Initial Confusion matrix:

dvc plots show model/class.metrics.csv -x actual -y predicted --template confusion

Confusion matrix after relabeling:

dvc plots show data/final/class.metrics.csv -x actual -y predicted --template confusion

See also: DVC plots


Conclusions

  • For relabeling and cleaning, it's important to have more than two labels, and to specifying an UNK label for: unknown; labels spanning multiple groups; or low confidence support.
  • Standardizing the input data formats allow users to flexibly use many different data sources.
  • Language detection is an important part of data cleaning, however problematic because:
    • Modern languages sometimes "borrow" words from other languages (but not just any words!)
    • Language detection models perform inference poorly with limited data, especially just a single word.
    • Normalization utilities, such as unidecode aren't helpful; (the wrong word in more readable letters is still the wrong word).
  • Experimentation parameters often have co-dependencies that would make a simple combinatorial grid search inefficient.

Recommended readings:

  • Confident Learning: Estimating Uncertainty in Dataset Labels by Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang, 31 Oct 2019, arxiv
  • A Simple but tough-to-beat baseline for sentence embeddings by Sanjeev Arora, Yingyu Liang, Tengyu Ma, ICLR 2017, paper
  • Support Vector Clustering by Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik, November 2001 Journal of Machine Learning Research 2 (12):125-137, DOI:10.1162/15324430260185565, paper
  • SVM clustering by Winters-Hilt, S., Merat, S. BMC Bioinformatics 8, S18 (2007). link, paper

Note: this repo layout borrows heavily from the Cookie Cutter Data Science Layout If you're not familiar with it, please check it out.

You might also like...
Tracking code for the winner of track 1 in the MMP-Tracking Challenge at ICCV 2021 Workshop.

Tracking Code for the winner of track1 in MMP-Trakcing challenge This repository contains our tracking code for the Multi-camera Multiple People Track

Quadruped-command-tracking-controller - Quadruped command tracking controller (flat terrain)
Quadruped-command-tracking-controller - Quadruped command tracking controller (flat terrain)

Quadruped command tracking controller (flat terrain) Prepare Install RAISIM link

Python package for multiple object tracking research with focus on laboratory animals tracking.
Python package for multiple object tracking research with focus on laboratory animals tracking.

motutils is a Python package for multiple object tracking research with focus on laboratory animals tracking. Features loads: MOTChallenge CSV, sleap

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

Experiment about Deep Person Re-identification with EfficientNet-v2
Experiment about Deep Person Re-identification with EfficientNet-v2

We evaluated the baseline with Resnet50 and Efficienet-v2 without using pretrained models. Also Resnet50-IBN-A and Efficientnet-v2 using pretrained on ImageNet. We used two datasets: Market-1501 and CUHK03.

An experiment to bait a generalized frontrunning MEV bot

Honeypot 🍯 A simple experiment that: Creates a honeypot contract Baits a generalized fronturnning bot with a unique transaction Analyze bot behaviour

Small-bets - Ergodic Experiment With Python

Ergodic Experiment Based on this video. Run this experiment with this command: p

An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available actions
An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available actions

Agar.io_Q-Learning_AI An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available act

All the essential resources and template code needed to understand and practice data structures and algorithms in python with few small projects to demonstrate their practical application.

Data Structures and Algorithms Python INDEX 1. Resources - Books Data Structures - Reema Thareja competitiveCoding Big-O Cheat Sheet DAA Syllabus Inte

Owner
Todd Cook
Software craftsman
Todd Cook
A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

Appen Repos 86 Dec 7, 2022
Tracking Pipeline helps you to solve the tracking problem more easily

Tracking_Pipeline Tracking_Pipeline helps you to solve the tracking problem more easily I integrate detection algorithms like: Yolov5, Yolov4, YoloX,

VNOpenAI 32 Dec 21, 2022
Generic template to bootstrap your PyTorch project with PyTorch Lightning, Hydra, W&B, and DVC.

NN Template Generic template to bootstrap your PyTorch project. Click on Use this Template and avoid writing boilerplate code for: PyTorch Lightning,

Luca Moschella 520 Dec 30, 2022
YoHa - A practical hand tracking engine.

YoHa - A practical hand tracking engine.

null 2k Jan 6, 2023
Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Data Augmentation for Scene Text Recognition (ICCV 2021 Workshop) (Pronounced as "strog") Paper Arxiv Why it matters? Scene Text Recognition (STR) req

Rowel Atienza 152 Dec 28, 2022
Calling Julia from Python - an experiment on data loading

Calling Julia from Python - an experiment on data loading See the slides. TLDR After reading Patrick's blog post, we decided to try to replace C++ wit

Abel Siqueira 8 Jun 7, 2022
code for our paper "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer"

SHOT++ Code for our TPAMI submission "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer" that is ext

null 75 Dec 16, 2022
Automatic labeling, conversion of different data set formats, sample size statistics, model cascade

Simple Gadget Collection for Object Detection Tasks Automatic image annotation Conversion between different annotation formats Obtain statistical info

llt 4 Aug 24, 2022
Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT

CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT CheXbert is an accurate, automated dee

Stanford Machine Learning Group 51 Dec 8, 2022
Joint detection and tracking model named DEFT, or ``Detection Embeddings for Tracking.

DEFT: Detection Embeddings for Tracking DEFT: Detection Embeddings for Tracking, Mohamed Chaabane, Peter Zhang, J. Ross Beveridge, Stephen O'Hara

Mohamed Chaabane 253 Dec 18, 2022