Auto Label Pipeline
A practical ML pipeline for data labeling with experiment tracking using DVC
Goals:
- Demonstrate reproducible ML
- Use DVC to build a pipeline and track experiments
- Automatically clean noisy data labels using Cleanlab cross validation
- Determine which FastText subword embedding performs better for semi-supervised cluster classification
- Determine optimal hyperparameters through experiment tracking
- Prepare casually labeled data for human evaluation
Demo: View Experiments recorded in git branches:
The Data
For our working demo, we will purify some of the slightly noisy/dirty labels found in Wikidata people entries for attributes for Employers and Occupations. Our initial data labels have been harvested from a json dump of Wikidata, the Kensho Wikidata dataset, and this notebook script for extracting the data.
Data Input Format
Tab separated CSV files, with the fields:
text_data
- the item that is to be labeled (single word or short group of words)class_type
- the class labelcontext
- any text that surrounds thetext_data
field in situ, or defines thetext_data
item in other words.count
- the number of occurrences of this label; how common it appears in the existing data.
Data Output format
- (same parameters as the data input plus)
date_updated
- when the label was updatedprevious_class_type
- the previousclass_type
labelmislabeled_rank
- records how low the confidence was prior to a re-label
The Pipeline
- Fetch
- Prepare
- Train
- Relabel
For details, see the README in the src folder. The pipeline is orchestrated via the dvc.yaml file, and parameterized via params.yaml.
Using/Extending the pipeline
- Drop your own CSV files into the
data/raw
directory - Run the pipeline
- Tune settings, embeddings, etc, until no longer amused
- Verify your results manually and by submitting
data/final/data.csv
for human evaluation, using random sampling and drawing heavily from themislabeled_rank
entries.
Project Structure
├── LICENSE
├── README.md
├── data # <-- Directory with all types of data
│ ├── final # <-- Directory with final data
│ │ ├── class.metrics.csv # <-- Directory with raw and intermediate data
│ │ └── data.csv # <-- Pipeline output (not stored in git)
│ ├── interim # <-- Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared # <-- Directory with prepared data
│ │ └── data.all.csv
│ └── raw # <-- Directory with raw data; populated by pipeline's fetch stage
│ ├── README.md
│ ├── cc.en.300.bin # <-- Fasttext binary model file, creative commons
│ ├── crawl-300d-2M-subword.bin # <-- Fasttext binary model file, common crawl
│ ├── crawl-300d-2M-subword.vec
│ ├── employers.wikidata.csv # <-- Our initial data, 1 set of class labels
│ ├── lid.176.ftz
│ └── occupations.wikidata.csv # <-- Our initial data, 1 set of class labels
├── dvc.lock # <-- DVC internal state tracking file
├── dvc.yaml # <-- DVC project configuration file
├── dvc_plots # <-- Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json # <-- Metrics from the pipeline's train stage
├── mypy.ini
├── params.yaml # <-- Parameter configuration file for the pipeline
├── reports # <-- Directory with metrics output
│ ├── prepare.metrics.json
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src # <-- Directory containing the pipeline's code
├── README.md
├── fetch.py
├── prepare.py
├── relabel.py
├── train.py
└── utils.py
Setup
Create environment
conda create --name auto-label-pipeline python=3.9
conda activate auto-label-pipeline
Install requirements
pip install -r requirements.txt
If you're going to modify the source, also install the requirements-dev.txt
file
Reproduce the pipeline results locally
dvc repro
View Metrics
dvc metrics show
See also: DVC metrics
Working with Experiments
To see your local experiments:
dvc exp show
Experiments that have been turned into a branches can be referenced directly in commands:
dvc exp diff svc_linear_ex svc_rbf_ex
e.g. to compare experiments:
dvc exp diff [experiment branch name] [experiment branch 2 name]
e.g.:
dvc exp diff svc_linear_ex svc_rbf_ex
dvc exp diff svc_poly_ex svc_rbf_ex
To create an experiment by changing a parameter:
dvc exp run --set-param train.split=0.9 --name my_split_ex
(When promoting an experiment to a branch, DVC does not switch into the branch.)
To save and share your experiment in a branch:
dvc exp branch my_split_ex my_split_ex_branch
See also: DVC Experiments
View plots
Initial Confusion matrix:
dvc plots show model/class.metrics.csv -x actual -y predicted --template confusion
Confusion matrix after relabeling:
dvc plots show data/final/class.metrics.csv -x actual -y predicted --template confusion
See also: DVC plots
Conclusions
- For relabeling and cleaning, it's important to have more than two labels, and to specifying an
UNK
label for: unknown; labels spanning multiple groups; or low confidence support. - Standardizing the input data formats allow users to flexibly use many different data sources.
- Language detection is an important part of data cleaning, however problematic because:
- Modern languages sometimes "borrow" words from other languages (but not just any words!)
- Language detection models perform inference poorly with limited data, especially just a single word.
- Normalization utilities, such as
unidecode
aren't helpful; (the wrong word in more readable letters is still the wrong word).
- Experimentation parameters often have co-dependencies that would make a simple combinatorial grid search inefficient.
Recommended readings:
- Confident Learning: Estimating Uncertainty in Dataset Labels by Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang, 31 Oct 2019, arxiv
- A Simple but tough-to-beat baseline for sentence embeddings by Sanjeev Arora, Yingyu Liang, Tengyu Ma, ICLR 2017, paper
- Support Vector Clustering by Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik, November 2001 Journal of Machine Learning Research 2 (12):125-137, DOI:10.1162/15324430260185565, paper
- SVM clustering by Winters-Hilt, S., Merat, S. BMC Bioinformatics 8, S18 (2007). link, paper
Note: this repo layout borrows heavily from the Cookie Cutter Data Science Layout If you're not familiar with it, please check it out.