This is the code used in the paper "Entity Embeddings of Categorical Variables".

Overview

This is the code used in the paper "Entity Embeddings of Categorical Variables". If you want to get the original version of the code used for the Kaggle competition, please use the Kaggle branch.

To run the code one needs first download and unzip the train.csv and store.csv files on Kaggle and put them in this folder.

If you use Anaconda you can install the dependecies like the following example:

conda create --name ee python=3.7 pip
conda activate ee
pip install scikit-learn xgboost tensorflow keras jupyter matplotlib

Please refer to Keras for more details regarding how to install keras.

Next, run the following scripts to extract the csv files and prepare the features:

python3 extract_csv_files.py
python3 prepare_features.py

To run the models:

python3 train_test_model.py

You can anaylize the embeddings with plot_embeddings.ipynb. For example, the following are the learned embeeding of German States printed in 2D and the map of Germany side by side. Considering the algorithm knows nothing about German geography the remarkable resemblance between the two demonstrates the power of the algorithm for abductive reasoning. I expect entity embedding will be a very useful tool to study the relationship of genome, proteins, drugs, diseases and I would love to see its applications in biology and medicine one day.

Visualizaiton of Entity Embedding of German States in 2D Map of Germany
EE_German_States Karte-Deutschland-Fun-Facts-Deutsch
Comments
  • Categorical Variables with long tail

    Categorical Variables with long tail

    Thanks for sharing your code serving it as a reference to learn from and understand the use case of embedding better.

    Our dataset is also tabular and have high cardinality columns (~6 million) like IP Address. Besides that 85% of IP Addresses appear only once in the dataset. What are your thoughts on using this technique to convert such a categorical column into euclidean space?

    opened by Nithanaroy 8
  • Cannot reproduce meaningful embedding

    Cannot reproduce meaningful embedding

    Dear Entron, I downloaded the kaggle branch and trained with test_models.py file (default option: 1 network, train_ratio = 0.97). But I cannot see the meaningful embeddings as yours. I attached state_embedding with all states in the same distance with each other. Can you tell me why it is ? Thank you very much ! P/S: I used keras 1.2.2, tensorflow r0.10 with GPU. I got "Result on validation data: 0.10472426564821177" and I saved trained model by keras.save() (I could not save by pickle.dump because of error as in issue #9) state_embedding

    opened by vanduc103 8
  • Keras version problem

    Keras version problem

    I'm sorry I asked such a question,I used keras2.2.4,It was found 'Merge' that it could not be introduced in 'keras.layers.core ',But I found 'merge' in 'keras.layers',Unfortunately, the following error occurred Thank you very much ! image

    opened by superguopeng 7
  • when do u decide to dense or embed?

    when do u decide to dense or embed?

    Hi entron,

    really very awesome and learnt alot from your work. However, i do have a question and i think this is the most appropriate place to ask.

    Looking at your script, what is the methodology to decide to embed a variable or dense a variable. For example u chose to make variable like promo as a dense layer ...etc . Is there a reason for those that were chosen as a dense layer?

    opened by germayneng 5
  • cannot reproduce

    cannot reproduce "with EE" (with embeddings) paper results using this code

    In case of XGBoost I could find in this repo any functions that would use embeddings. As I understood, entity embeddings are produced by __build_keras_model(), which seems to be used only for deep learning here, even though the paper shows its results in Tables III and IV also for KNN, RF, and XGB.

    Helping others to reproduce accuracy gains from the use of embeddings in XGBoost is important because XGBoost baseline accuracy is difficult to improve using any neural network-based method, not just embeddings.

    opened by mirekphd 4
  • Lagged features

    Lagged features

    Hi @entron I was wondering if you tried lagged variables? Did they end up just not being that useful?

    What was your experience trying to incorporate classic time series features into this dataset such as sales trends etc?

    Thanks for your insights.

    opened by hamelsmu 4
  • Can't pickle <class 'module'>: attribute lookup module on builtins failed

    Can't pickle : attribute lookup module on builtins failed

    Hi entron,

    I got a problem when I pickle.dump the models.

    Traceback (most recent call last): File "train_test_model.py", line 90, in pickle.dump(models, f) _pickle.PicklingError: Can't pickle <class 'module'>: attribute lookup module on builtins failed

    Cloud you give me some ideas? very thanks.

    opened by r20041101 3
  • question, not issue, should I use all training data for embedding?

    question, not issue, should I use all training data for embedding?

    Should I use all training data for embedding and then attach the embedded feature to train/test and then train XGB? Or I need to split the training into two set and learn the embedding from on set and then use the other set to train XGB?

    opened by superfan123 2
  • Categorizations choice

    Categorizations choice

    Hi entron. This is more like conceptual question. I'm reading your source and docs and I can see that you have categorized many features alone by themselves but you haven't, for example, combined several of them into single embedding. By this I mean to use very similar features by nature in an embedding group whose resulting vector would represent "behaviour" of the group, maybe better than each feature solo categorized. Did you try something like that? If yes, can you share your conclusions? Thank you.

    opened by prekratko 2
  • CAN NOT FIND

    CAN NOT FIND "embeddings.pickle" anywhere

    Hi, I'm trying to learn your paper but something wrong with my code. I try to locate the error and I found the file named "embeddings.pickle" was everywhere but I just cannot find how to create it. I noticed the comment "# Use plot_embeddings.ipynb to create" but still I cannot find out the answer. I'm wondering if you can give me a hand, thanks a lot.

    opened by notingbad 2
  • Why not all features are used

    Why not all features are used

    Hi, Thanks for your code. I notice that the features for training are as:

    return [store_open,
                store_index,
                day_of_week,
                promo,
                year,
                month,
                day,
                store_data[store_index - 1]['State']
                ]
    

    but there are lots of other features like in store.csv:

    Assortment | CompetitionDistance | CompetitionOpenSinceMonth | CompetitionOpenSinceYear
    

    Could you share the consideration why you not using these as features? Thx!

    opened by claudehang 1
  • How to use the embedding on a new categorical data

    How to use the embedding on a new categorical data

    Hello,

    I have a general rudimentary question ( sorry in advance).

    I have reviewed (not fully) many parts of the codes in here. I'd like to test the proposed embedding on a new data, but am not sure where to begin.

    I have a simple 2-column data: first col is patient id (assume 1M unique patients) second col is ICD10 diag code (assume 10K categories). We have repeated measurements in data, meaning that diagnoses can be repeated within a given patient and across many patients.

    I tested Multiple Correspondance Analysis with categorical data from this link, but the results are not very useful.

    Similar to the German States example in the repo, my goal is to perform (unsupervised) dimensionality reduction ( such as the ones you'd see in denoising AE with minimizing reconstruction error).

    • Where should I start? Do I need to run one-hot beforehand?
    • What funcs should I use after loading my raw data to generate such embedding?

    Appreciate any words of wisdom you may be able to share.

    opened by isaac2lord 4
  • Keras Reshape() layer

    Keras Reshape() layer

    There is a problem:

    TypeError: init() missing 1 required positional argument: 'target_shape'

    and the code is:

    model_store.add(Reshape(dims=(50,)))
    

    dims??? I can't understand this paramter.

    Thank you very much~

    opened by PlayWithSanLei 1
Owner
Cheng Guo
Cheng Guo
This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

About Repository This repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.' About Code

Arun Verma 1 Nov 9, 2021
Code for sound field predictions in domains with impedance boundaries. Used for generating results from the paper

Code for sound field predictions in domains with impedance boundaries. Used for generating results from the paper

DTU Acoustic Technology Group 11 Dec 17, 2022
The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

João Fonseca 3 Jan 3, 2023
Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow 18 Oct 6, 2022
a practicable framework used in Deep Learning. So far UDL only provide DCFNet implementation for the ICCV paper (Dynamic Cross Feature Fusion for Remote Sensing Pansharpening)

UDL UDL is a practicable framework used in Deep Learning (computer vision). Benchmark codes, results and models are available in UDL, please contact @

Xiao Wu 11 Sep 30, 2022
Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

GVP Transformer (wip) Implementation of the GVP-Transformer, which was used in the paper Learning inverse folding from millions of predicted structure

Phil Wang 19 May 6, 2022
the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

EmbedSeg Introduction This repository hosts the version of the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

JugLab 88 Dec 25, 2022
This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Predicting Patient Outcomes with Graph Representation Learning This repository contains the code used for Predicting Patient Outcomes with Graph Repre

Emma Rocheteau 76 Dec 22, 2022
Code and models used in "MUSS Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

Multilingual Unsupervised Sentence Simplification Code and pretrained models to reproduce experiments in "MUSS: Multilingual Unsupervised Sentence Sim

Facebook Research 81 Dec 29, 2022
Code for STFT Transformer used in BirdCLEF 2021 competition.

STFT_Transformer Code for STFT Transformer used in BirdCLEF 2021 competition. The STFT Transformer is a new way to use Transformers similar to Vision

Jean-François Puget 69 Sep 29, 2022
Code used to generate the results appearing in "Train longer, generalize better: closing the generalization gap in large batch training of neural networks"

Train longer, generalize better - Big batch training This is a code repository used to generate the results appearing in "Train longer, generalize bet

Elad Hoffer 145 Sep 16, 2022
This project contains an implemented version of Face Detection using OpenCV and Mediapipe. This is a code snippet and can be used in projects.

Live-Face-Detection Project Description: In this project, we will be using the live video feed from the camera to detect Faces. It will also detect so

Hassan Shahzad 3 Oct 2, 2021
A module that used for encrypt code which includes RSA and AES

软件加密模块 requirement: Crypto,pycryptodome,pyqt5 本地加密信息为随机字符串 使用说明 命令行参数 -h 帮助 -checkWorking 检查是否能正常工作,后接1确认指令 -checkEndDate 检查截至日期,后接1确认指令 -activateCode

null 2 Sep 27, 2022
Retrieval.pytorch - The code we used in [2020 DIGIX]

Retrieval.pytorch - The code we used in [2020 DIGIX]

Guo-Hua Wang 2 Feb 7, 2022
This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

Gautam Singh 66 Dec 26, 2022
Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

Benjamin Biggs 29 Dec 28, 2022
TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

SLM: Structural Language Models of Code This is an official implementation of the model described in: "Structural Language Models of Code" [PDF] To ap

null 73 Nov 6, 2022
Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

CoProtector Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

Zhensu Sun 1 Oct 26, 2021
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

J K Terry 32 Nov 9, 2021