A library built upon PyTorch for building embeddings on discrete event sequences using self-supervision

Overview

pytorch-lifestream a library built upon PyTorch for building embeddings on discrete event sequences using self-supervision. It can process terabyte-size volumes of raw events like game history events, clickstream data, purchase history or card transactions.

It supports various methods of self-supervised training, adapted for event sequences:

  • Contrastive Learning for Event Sequences (CoLES)
  • Contrastive Predictive Coding (CPC)
  • Replaced Token Detection (RTD) from ELECTRA
  • Next Sequence Prediction (NSP) from BERT
  • Sequences Order Prediction (SOP) from ALBERT

It supports several types of encoders, including Transformer and RNN. It also supports many types of self-supervised losses.

The following variants of the contrastive losses are supported:

Install from PyPi

pip install pytorch-lifestream

Install from source

# Ubuntu 20.04

sudo apt install python3.8 python3-venv
pip3 install pipenv

pipenv sync  --dev # install packages exactly as specified in Pipfile.lock
pipenv shell
pytest

Demo notebooks

  • Self-supervided training and embeddings for downstream task notebook
  • Self-supervided embeddings in CatBoost notebook
  • Self-supervided training and fine-tuning notebook
  • PySpark and Parquet for data preprocessing notebook

Experiments on public datasets

pytorch-lifestream usage experiments on several public event datasets are available in the separate repo

Comments
  • torch.stack in def collate_feature_dict

    torch.stack in def collate_feature_dict

    ptls/data_load/utils.py

    Hello!

    If the dataloader has a feature called target. And the batchsize is not a multiple of the length of the dataset, then an error pops up on the last batch: "Sizes of tensors must match except in dimension 0". Due to the use of torch.staсk when processing a feature startwith 'target'.

    opened by Ivanich-spb 11
  • Correct seq_len for feature dict

    Correct seq_len for feature dict

    rec = {
        'mcc': [0, 1, 2, 3],
        'target_distribution': [0.1, 0.2, 0.4, 0.1, 0.1, 0.0],
    }
    

    How to get correct seq_len. true len: 4 possible length: 4, 6 'target_distribution' is incorrect field to get length, this is not a sequence, this is an array

    opened by ivkireev86 1
  • Save categories encodings along with model weights in demos

    Save categories encodings along with model weights in demos

    Вместе с обученной моделью необходимо сохранять обученный препроцессор и разбивку на трейн-тест. Иначе категории могут поехать и сохраненная предобученная модель станет бесполезной.

    opened by ivkireev86 1
  • Documentation index

    Documentation index

    Прототип главной страницы документации. Три секции:

    • описание моделей библиотеки
    • гайд как использовать библиотеку
    • как писать свои компоненты

    Есть краткое описание и ссылки на подробные (которые напишем потом).

    В описании модулей предложена структура библиотеки. Предполагается, что мы эти модули в ближайшее создадим и перетащим туда соответсвующие классы из библиотеки. Старые, модули, которые станут пустыми, удалим. Далее будем придерживаться схемы, описанной в этом документе.

    На ревью предлагается чекнуть предлагаемую структуру библиотеки, названия модулей ну и сам описательный текст документа.

    opened by ivkireev86 1
  • Data load refactoring

    Data load refactoring

    • rename ptls.data_preprocessing to ptls.preprocessing
    • renamed cols_event_time to col_event_time - only one column expected
    • preprocessing class split into small parts
    • numerical feature preprocessing removed from preprocessors. Keep it untouched and transform it later
    • dataset use cases are described
    • ptls.data_load deprecation warnings added
    opened by ivkireev86 0
  • KL cyclostationarity test tools

    KL cyclostationarity test tools

    Test provides a hystogram with self-samples similarity vs. random sample similarity. Shows compatibility with CoLES.

    Think about tests for other frameworks.

    opened by ivkireev86 0
  • Repair pyspark tests

    Repair pyspark tests

    def test_dt_to_timestamp(): spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame(data=[ {'dt': '1970-01-01 00:00:00'}, {'dt': '2012-01-01 12:01:16'}, {'dt': '2021-12-30 00:00:00'} ])

        df = df.withColumn('ts', dt_to_timestamp('dt'))
        ts = [rec.ts for rec in df.select('ts').collect()]
    
      assert ts == [0, 1325419276, 1640822400]
    

    E assert [-10800, 1325...6, 1640811600] == [0, 1325419276, 1640822400] E At index 0 diff: -10800 != 0 E Use -v to get more diff

    ptls_tests/test_preprocessing/test_pyspark/test_event_time.py:16: AssertionError


    def test_datetime_to_timestamp(): t = DatetimeToTimestamp(col_name_original='dt') spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame(data=[ {'dt': '1970-01-01 00:00:00', 'rn': 1}, {'dt': '2012-01-01 12:01:16', 'rn': 2}, {'dt': '2021-12-30 00:00:00', 'rn': 3} ]) df = t.fit_transform(df) et = [rec.event_time for rec in df.select('event_time').collect()]

      assert et[0] == 0
    

    E assert -10800 == 0

    ptls_tests/test_preprocessing/test_pyspark/test_event_time.py:48: AssertionError

    opened by ikretus 0
  • docs. Development guide (for demo notebooks)

    docs. Development guide (for demo notebooks)

    • add current patterns
    • when model training start print message "model training stats, please wait. See tensorboard to track progress", use it with enable_progress=False
    documentation user feedback 
    opened by ivkireev86 0
Releases(v0.4.0)
  • v0.4.0(Jul 27, 2022)

    What's Changed

    • Seq encoder refactoring by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/29
    • regr.task ZILNLoss, RMSE, BucketAccuracy by @ikretus in https://github.com/dllllb/pytorch-lifestream/pull/36
    • lighting modules and nn layers refactoring by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/34
    • Demo colab by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/40
    • Fix drop target arrays by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/42
    • feature naming by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/43
    • Update abs_module.py by @justalge in https://github.com/dllllb/pytorch-lifestream/pull/37
    • Extended inference demo by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/45
    • fix import path by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/46
    • Experiments sync by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/50
    • Experiments sync by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/52
    • Target dist by @ikretus in https://github.com/dllllb/pytorch-lifestream/pull/58
    • Data load refactoring by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/60
    • doc update by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/62
    • doc update by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/63

    New Contributors

    • @ikretus made their first contribution in https://github.com/dllllb/pytorch-lifestream/pull/36

    Full Changelog: https://github.com/dllllb/pytorch-lifestream/compare/v0.3.0...v0.4.0

    What's Changed

    • Seq encoder refactoring by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/29
    • regr.task ZILNLoss, RMSE, BucketAccuracy by @ikretus in https://github.com/dllllb/pytorch-lifestream/pull/36
    • lighting modules and nn layers refactoring by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/34
    • Demo colab by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/40
    • Fix drop target arrays by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/42
    • feature naming by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/43
    • Update abs_module.py by @justalge in https://github.com/dllllb/pytorch-lifestream/pull/37
    • Extended inference demo by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/45
    • fix import path by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/46
    • Experiments sync by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/50
    • Experiments sync by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/52
    • Target dist by @ikretus in https://github.com/dllllb/pytorch-lifestream/pull/58
    • Data load refactoring by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/60
    • doc update by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/62
    • doc update by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/63

    New Contributors

    • @ikretus made their first contribution in https://github.com/dllllb/pytorch-lifestream/pull/36

    Full Changelog: https://github.com/dllllb/pytorch-lifestream/compare/v0.3.0...v0.4.0

    What's Changed

    • Seq encoder refactoring by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/29
    • regr.task ZILNLoss, RMSE, BucketAccuracy by @ikretus in https://github.com/dllllb/pytorch-lifestream/pull/36
    • lighting modules and nn layers refactoring by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/34
    • Demo colab by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/40
    • Fix drop target arrays by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/42
    • feature naming by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/43
    • Update abs_module.py by @justalge in https://github.com/dllllb/pytorch-lifestream/pull/37
    • Extended inference demo by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/45
    • fix import path by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/46
    • Experiments sync by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/50
    • Experiments sync by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/52
    • Target dist by @ikretus in https://github.com/dllllb/pytorch-lifestream/pull/58
    • Data load refactoring by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/60
    • doc update by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/62
    • doc update by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/63

    New Contributors

    • @ikretus made their first contribution in https://github.com/dllllb/pytorch-lifestream/pull/36

    Full Changelog: https://github.com/dllllb/pytorch-lifestream/compare/v0.3.0...v0.4.0

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Jun 12, 2022)

    More Pythonic Core API: constructor arguments instead of config objects

    What's Changed

    • cpc params by @justalge in https://github.com/dllllb/pytorch-lifestream/pull/9
    • All modules by @justalge in https://github.com/dllllb/pytorch-lifestream/pull/15
    • Mlm pretrain by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/13
    • all encoders and get rid of get_loss by @justalge in https://github.com/dllllb/pytorch-lifestream/pull/19
    • init by @justalge in https://github.com/dllllb/pytorch-lifestream/pull/20
    • Documentation index by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/8
    • Demos api update by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/18
    • loss output correction by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/22
    • Test fixes by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/23
    • readme_demo_link by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/25
    • init by @justalge in https://github.com/dllllb/pytorch-lifestream/pull/26
    • work without logger by @justalge in https://github.com/dllllb/pytorch-lifestream/pull/7
    • trx_encoder refactoring by @ivkireev86 in https://github.com/dllllb/pytorch-lifestream/pull/28

    Full Changelog: https://github.com/dllllb/pytorch-lifestream/compare/v0.1.2...v0.3.0

    Source code(tar.gz)
    Source code(zip)
Owner
Dmitri Babaev
Dmitri Babaev
Mixup for Supervision, Semi- and Self-Supervision Learning Toolbox and Benchmark

OpenSelfSup News Downstream tasks now support more methods(Mask RCNN-FPN, RetinaNet, Keypoints RCNN) and more datasets(Cityscapes). 'GaussianBlur' is

AI Lab, Westlake University 241 Sep 27, 2022
Implementation of Neural Distance Embeddings for Biological Sequences (NeuroSEED) in PyTorch

Neural Distance Embeddings for Biological Sequences Official implementation of Neural Distance Embeddings for Biological Sequences (NeuroSEED) in PyTo

Gabriele Corso 49 Sep 15, 2022
Event sourced bank - A wide-and-shallow example using the Python event sourcing library

Event Sourced Bank A "wide but shallow" example of using the Python event sourci

null 3 Mar 9, 2022
Pytoydl: A toy deep learning framework built upon numpy.

Documents: https://pytoydl.readthedocs.io/zh/latest/ Pytoydl A toy deep learning framework built upon numpy. You can star this repository to keep trac

null 23 Sep 19, 2022
An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

Facebook Research 234 Sep 19, 2022
Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

null 4 Mar 11, 2022
Generic Event Boundary Detection: A Benchmark for Event Segmentation

Generic Event Boundary Detection: A Benchmark for Event Segmentation We release our data annotation & baseline codes for detecting generic event bound

null 45 Sep 3, 2022
Scikit-event-correlation - Event Correlation and Forecasting over High Dimensional Streaming Sensor Data algorithms

scikit-event-correlation Event Correlation and Changing Detection Algorithm Theo

Intellia ICT 4 Jul 23, 2022
Event-forecasting - Event Forecasting Algorithms With Python

event-forecasting Event Forecasting Algorithms Theory Correlating events in comp

Intellia ICT 4 Feb 15, 2022
Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision https://arxiv.org/abs/2003.00393 Abstract Active learning (AL) aims to min

Denis 29 Apr 6, 2022
Official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition" in AAAI2022.

AimCLR This is an official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Reco

Gty 32 Sep 7, 2022
Self-training with Weak Supervision (NAACL 2021)

This repo holds the code for our weak supervision framework, ASTRA, described in our NAACL 2021 paper: "Self-Training with Weak Supervision"

Microsoft 145 Sep 4, 2022
Improving Transferability of Representations via Augmentation-Aware Self-Supervision

Improving Transferability of Representations via Augmentation-Aware Self-Supervision Accepted to NeurIPS 2021 TL;DR: Learning augmentation-aware infor

hankook 38 Sep 16, 2022
Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

Meta Research 589 Sep 29, 2022
[CVPR 2022] PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision (Oral)

PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision Kehong Gong*, Bingbing Li*, Jianfeng Zhang*, Ta

null 224 Oct 2, 2022
PyTorch package for the discrete VAE used for DALL·E.

Overview [Blog] [Paper] [Model Card] [Usage] This is the official PyTorch package for the discrete VAE used for DALL·E. Installation Before running th

OpenAI 8.8k Sep 27, 2022
Embeddinghub is a database built for machine learning embeddings.

Embeddinghub is a database built for machine learning embeddings.

Featureform 1.1k Oct 2, 2022
Automatically Build Multiple ML Models with a Single Line of Code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.

Auto-ViML Automatically Build Variant Interpretable ML models fast! Auto_ViML is pronounced "auto vimal" (autovimal logo created by Sanket Ghanmare) N

AutoViz and Auto_ViML 372 Sep 19, 2022
YOLTv4 builds upon YOLT and SIMRDWN, and updates these frameworks to use the most performant version of YOLO, YOLOv4

YOLTv4 builds upon YOLT and SIMRDWN, and updates these frameworks to use the most performant version of YOLO, YOLOv4. YOLTv4 is designed to detect objects in aerial or satellite imagery in arbitrarily large images that far exceed the ~600×600 pixel size typically ingested by deep learning object detection frameworks.

Adam Van Etten 151 Sep 18, 2022