Upgini : data search library for your machine learning pipelines
Find & deliver relevant external data & features to boost ML accuracy
β
Overview
Upgini is a Python library for an automated data search to boost supervised ML tasks. It enriches your dataset with intelligently crafted features from a broad range of curated data sources, including open and commercial datasets. The search is conducted for any combination of public IDs contained in your tabular dateset: IP, date, etc. Only features that could improve the prediction power of your ML model are returned.
Motivation: for most ML tasks external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient search tools for external data blocks massive adoption of external data in ML pipelines.
We want radically simplify data search and delivery for ML pipelines to make external data & features a standard approach. Like a hyperparameter tuning for machine learning nowadays.
π
Awesome features
- binary classification
- multiclass classification
- regression
- time series prediction
π - recommender system
π
π
Quick start with kaggle example
π
Pre-build dev environment for quick start
Pre-built dev environments with a kaggle example notebooks/kaggle_example.ipynb right inside your browser:
π
Jupyter via PyPI
Just install library from PyPi and read this doc
!pip install upgini
import upgini
π³
Docker-way
Clone $ git clone https://github.com/upgini/upgini
or download upgini git repo locally and follow steps below to build docker container
Build docker image
- ... from cloned git repo:
docker build -t upgini .
- ...or directly from GitHub:
DOCKER_BUILDKIT=0 docker build -t upgini [email protected]:upgini/upgini.git#main
Run docker image:
docker run -p 8888:8888 upgini
Open http://localhost:8888?token=
Kaggle notebook
Jupyter notebook with a kaggle example: notebooks/kaggle_example.ipynb. The problem being solved is a Kaggle competition Store Item Demand Forecasting Challenge. The goal is to predict future sales of different goods in different stores based on a 5-year history of sales. The evaluation metric is SMAPE.
Competition dataset was splited into train (2013-2016 year) and test (2017 year) parts. FeaturesEnricher
was fitted on train part. And both datasets were enriched with external features. Finally, ML algorithm was fitted both of the initial and the enriched datasets to compare accuracy improvement. With a solid improvement of the evaluation metric achieved by the enriched ML model.
π»
How it works?
π
Get access - API key
1. You'll need API key from User profile page https://profile.upgini.com
Pass API key via api_key
parameter in FeaturesEnricher
class constructor or export as environment variable:
... in python
import os
os.environ["UPGINI_API_KEY"] = "your_long_string_api_key_goes_here"
... in bash/zsh
export UPGINI_API_KEY = "your_long_string_api_key_goes_here"
π‘
Reuse existing labeled training datasets for search
2. To simplify things, you can reuse your existing labeled training datasets "as is" to initiate the search. Under the hood, we'll search for relevant data using:
- search keys from training dataset to match records from potential external data sources and features
- labels from training dataset to estimate relevancy of feature or dataset for your ML task and calculate metrics
- your features from training dataset columns to find datasets and features only give accuracy gain to your existing data in the ML model and estimate accuracy uplift (optional)
Just load training dataset into pandas dataframe and separate features' columns from label column:
import pandas as pd
# labeled training dataset - customer_churn_prediction_train.csv
train_df = pd.read_csv("customer_churn_prediction_train.csv")
train_features = train_df.drop(columns="label")
train_label = train_df["label"]
π¦
Choose at least one column as a search key
3. Search keys columns will be used to match records from all potential external data sources FeaturesEnricher
class initialization.
from upgini import FeaturesEnricher, SearchKey
enricher = FeaturesEnricher (
search_keys={"subscription_activation_date": SearchKey.DATE},
keep_input=True )
β¨
Search key types we support (more is coming!)
Our team works hard to introduce new search key types, currently we support:
Search Key Meaning Type | Description | Example |
---|---|---|
SearchKey.EMAIL | [email protected] |
|
SearchKey.HEM | sha256(lowercase(email)) |
0e2dfefcddc929933dcec9a5c7db7b172482814e63c80b8460b36a791384e955 |
SearchKey.IP | IP address (version 4) | 192.168.0.1 |
SearchKey.PHONE | phone number, E.164 standard | 443451925138 |
SearchKey.DATE | date | 2020-02-12 (ISO-8601 standard) 12.02.2020 (non standard notation) |
SearchKey.DATETIME | datetime | 2020-02-12 12:46:18 12:46:18 12.02.2020 unixtimestamp |
β οΈ
Requirements for search initialization dataset
We do dataset verification and cleaning under the hood, but still there are some requirements to follow:
- Pandas dataframe representation
- Correct label column types: integers or strings for binary and multiclass lables, floats for regression
- At least one column defined as a search key
- Min size after deduplication by search key column and NAs removal: 1000 records
- Max size after deduplication by search key column and NAs removal: 1 000 000 records
π
Start your first data search!
4. The main abstraction you interact is FeaturesEnricher
. FeaturesEnricher
is scikit-learn compatible estimator, so you can easily add it into your existing ML pipelines. First, create instance of the FeaturesEnricher
class. Once it created call
fit
to search relevant datasets & features- than
transform
to enrich your dataset with features from search result
Let's try it out!
import pandas as pd
from upgini import FeaturesEnricher, SearchKey
# load labeled training dataset to initiate search
train_df = pd.read_csv("customer_churn_prediction_train.csv")
train_features = train_df.drop(columns="label")
train_target = train_df["label"]
# now we're going to create `FeaturesEnricher` class
# if you still didn't define UPGINI_API_KEY env variable - not a problem, you can do it via `api_key`
enricher = FeaturesEnricher(
search_keys={"subscription_activation_date": SearchKey.DATE},
keep_input=True,
api_key="your_long_string_api_key_goes_here"
)
# everything is ready to fit! For 200ΠΊ records fitting should take around 10 minutes,
# but don't worry - we'll send email notification. Accuracy metrics of trained model and uplifts
# will be shown automaticly
enricher.fit(train_features, train_target)
That's all). We have fitted FeaturesEnricher
and any pandas dataframe, with exactly the same data schema, can be enriched with features from search results. Use transform
method, and let magic to do the rest
# load dataset for enrichment
test_df = pd.read_csv("test.csv")
test_features = test_df.drop(columns="target")
# enrich it!
enriched_test_features = enricher.transform(test_features)
enriched_test_features.head()
You can get more details about FeaturesEnricher
in runtime using docstrings, for example, via help(FeaturesEnricher)
or help(FeaturesEnricher.fit)
.
β
Optional: find datasets and features only give accuracy gain to your existing data in the ML model
If you already have a trained ML model, based on internal features or other external data sources, you can specifically search new datasets & features only give accuracy gain "on top" of them.
Just leave all these existing features in the labeled training dataset and Upgini library automatically use them as a baseline ML model to calculate accuracy metric uplift. And won't return any features that might not give an accuracy gain to the existing ML model feature set.
β
Optional: check stability of ML accuracy gain from search result datasets & features
You can validate data quality from your search result on out-of-time dataset using eval_set
parameter. Let's do that:
# load train dataset
train_df = pd.read_csv("train.csv")
train_features = train_df.drop(columns="target")
train_target = train_df["target"]
# load out-of-time validation dataset
eval_df = pd.read_csv("validation.csv")
eval_features = eval_df.drop(columns="eval_target")
eval_target = eval_df["eval_target"]
# create FeaturesEnricher
enricher = FeaturesEnricher(
search_keys={"registration_date": SearchKey.DATE},
keep_input=True
)
# now we fit WITH eval_set parameter to calculate accuracy metrics on OOT dataset.
# the output will contain quality metrics for both the training data set and
# the eval set (validation OOT data set)
enricher.fit(
train_features,
train_target,
eval_set = [(eval_features, eval_target)]
)
β οΈ
Requirements for out-of-time dataset
- Same data schema as for search initialization dataset
- Pandas dataframe representation
π§Ή
Search dataset validation
We validate and clean search initialization dataset uder the hood:
π
Accuracy and uplift metrics calculations
We calculate all the accuracy metrics and uplifts for non-linear machine learning algorithms, like gradient boosting or neural networks. If your external data consumer is a linear ML algorithm (like log regression), you might notice different accuracy metrics after data enrichment.
πΈ
Why it's a paid service? Can I use it for free?
The short answer is Yes! We do have two options for that
Let us explain. This is a part-time project for our small team, but as you might know, search is a very infrastructure-intensive service. We pay infrustructure cost for every search request generated on the platform, as we mostly use serverless components under the hood. Both storage and compute.
To cover these run costs we introduce paid plans with a certain amount of search requests, which we hope will be affordable for most of the data scientists & developers in the community.
First option. Participate in beta testing
Now service is still in a beta stage, so registered beta testers will get an 80USD credits for 6 months. Feel free to start with the registration form
Second option. Share license-free data with community
If you have ANY data which you might consider as royalty and license-free (Open Data) and potentially valuable for supervised ML applications, we'll be happy to give free individual access in exchange for sharing this data with community.
Just upload your data sample right from Jupyter. We will check your data sharing proposal and get back to you ASAP:
import pandas as pd
from upgini import SearchKey
from upgini.ads import upload_user_ads
import os
os.environ["UPGINI_API_KEY"] = "your_long_string_api_key_goes_here"
#you can define custom search key which might not be supported yet, just use SearchKey.CUSTOM_KEY type
sample_df = pd.read_csv("path_to_data_sample_file")
upload_user_ads("test", sample_df, {
"city": SearchKey.CUSTOM_KEY, "stats_date": SearchKey.DATE
})
π
Getting Help & Community
Requests and support channels, in preferred order
Please try to create bug reports that are:
- Reproducible. Include steps to reproduce the problem.
- Specific. Include as much detail as possible: which Python version, what environment, etc.
- Unique. Do not duplicate existing opened issues.
- Scoped to a Single Bug. One bug per report.
π§©
Contributing
We are a very small team and this is a part-time project for us, thus most probably we won't be able:
- implement ALL the data delivery and integration interfaces for most common ML stacks and frameworks
- implement ALL data verification and normalization capabilities for different types of search keys (we just started with current 4)
And we might need some help from community)
So, we'll be happy about every pull request you open and issue you find to make this library more awesome. Please note that it might sometimes take us a while to get back to you.
For major changes, please open an issue first to discuss what you would like to change
Developing
Some convinient ways to start contributing are:
π
Useful links