A Library for Field-aware Factorization Machines

Overview
Table of Contents
=================

- What is LIBFFM
- Overfitting and Early Stopping
- Installation
- Data Format
- Command Line Usage
- Examples
- OpenMP and SSE
- Building Windows Binaries
- FAQ


What is LIBFFM
==============

LIBFFM is a library for field-aware factorization machine (FFM). 

Field-aware factorization machine is a effective model for CTR prediction. It has been used to win the top-3 positions
of following competitions:

    * Criteo: https://www.kaggle.com/c/criteo-display-ad-challenge

    * Avazu: https://www.kaggle.com/c/avazu-ctr-prediction

    * Outbrain: https://www.kaggle.com/c/outbrain-click-prediction

    * RecSys 2015: http://dl.acm.org/citation.cfm?id=2813511&dl=ACM&coll=DL&CFID=941880276&CFTOKEN=60022934

You can find more information about FFM in the following paper / slides:

    * http://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf

    * http://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf

    * https://arxiv.org/abs/1701.04099


Overfitting and Early Stopping
==============================

FFM is prone to overfitting, and the solution we have so far is early stopping. See how FFM behaves on a certain data
set:

    > ffm-train -p va.ffm -l 0.00002 tr.ffm
    iter   tr_logloss   va_logloss
       1      0.49738      0.48776
       2      0.47383      0.47995
       3      0.46366      0.47480
       4      0.45561      0.47231
       5      0.44810      0.47034
       6      0.44037      0.47003
       7      0.43239      0.46952
       8      0.42362      0.46999
       9      0.41394      0.47088
      10      0.40326      0.47228
      11      0.39156      0.47435
      12      0.37886      0.47683
      13      0.36522      0.47975
      14      0.35079      0.48321
      15      0.33578      0.48703


We see the best validation loss is achieved at 7th iteration. If we keep training, then overfitting begins. It is worth
noting that increasing regularization parameter do not help:

    > ffm-train -p va.ffm -l 0.0002 -t 50 -s 12 tr.ffm
    iter   tr_logloss   va_logloss
       1      0.50532      0.49905
       2      0.48782      0.49242
       3      0.48136      0.48748
                 ...
      29      0.42183      0.47014
                 ...
      48      0.37071      0.47333
      49      0.36767      0.47374
      50      0.36472      0.47404


To avoid overfitting, we recommend always provide a validation set with option `-p.' You can use option `--auto-stop' to
stop at the iteration that reaches the best validation loss:

    > ffm-train -p va.ffm -l 0.00002 --auto-stop tr.ffm
    iter   tr_logloss   va_logloss
       1      0.49738      0.48776
       2      0.47383      0.47995
       3      0.46366      0.47480
       4      0.45561      0.47231
       5      0.44810      0.47034
       6      0.44037      0.47003
       7      0.43239      0.46952
       8      0.42362      0.46999
    Auto-stop. Use model at 7th iteration.


Installation
============

Requirement: It requires a C++11 compatible compiler. We also use OpenMP to provide multi-threading. If OpenMP is not
available on your platform, please refer to section `OpenMP and SSE.'

- Unix-like systems:
  Typeype `make' in the command line.

- Windows:
  See `Building Windows Binaries' to compile.



Data Format
===========

The data format of LIBFFM is:

<label> <field1>:<feature1>:<value1> <field2>:<feature2>:<value2> ...
.
.
.

`field' and `feature' should be non-negative integers. See an example `bigdata.tr.txt.'

It is important to understand the difference between `field' and `feature'. For example, if we have a raw data like this:

Click  Advertiser  Publisher
=====  ==========  =========
    0        Nike        CNN
    1        ESPN        BBC

Here, we have 

    * 2 fields: Advertiser and Publisher

    * 4 features: Advertiser-Nike, Advertiser-ESPN, Publisher-CNN, Publisher-BBC

Usually you will need to build two dictionares, one for field and one for features, like this:
    
    DictField[Advertiser] -> 0
    DictField[Publisher]  -> 1
    
    DictFeature[Advertiser-Nike] -> 0
    DictFeature[Publisher-CNN]   -> 1
    DictFeature[Advertiser-ESPN] -> 2
    DictFeature[Publisher-BBC]   -> 3

Then, you can generate FFM format data:

    0 0:0:1 1:1:1
    1 0:2:1 1:3:1

Note that because these features are categorical, the values here are all ones.


Command Line Usage
==================

-   `ffm-train'

    usage: ffm-train [options] training_set_file [model_file]

    options:
    -l <lambda>: set regularization parameter (default 0.00002)
    -k <factor>: set number of latent factors (default 4)
    -t <iteration>: set number of iterations (default 15)
    -r <eta>: set learning rate (default 0.2)
    -s <nr_threads>: set number of threads (default 1)
    -p <path>: set path to the validation set
    --quiet: quiet model (no output)
    --no-norm: disable instance-wise normalization
    --auto-stop: stop at the iteration that achieves the best validation loss (must be used with -p)

    By default we do instance-wise normalization. That is, we normalize the 2-norm of each instance to 1. You can use
    `--no-norm' to disable this function.
    
    A binary file `training_set_file.bin' will be generated to store the data in binary format.

    Because FFM usually need early stopping for better test performance, we provide an option `--auto-stop' to stop at
    the iteration that achieves the best validation loss. Note that you need to provide a validation set with `-p' when
    you use this option.


-   `ffm-predict'

    usage: ffm-predict test_file model_file output_file



Examples
========

Download a toy data from:

    zip: https://drive.google.com/open?id=1HZX7zSQJy26hY4_PxSlOWz4x7O-tbQjt

    tar.gz: https://drive.google.com/open?id=12-EczjiYGyJRQLH5ARy1MXRFbCvkgfPx

This dataset is subsampled 1% from Criteo's challenge.

> tar -xzf libffm_toy.tar.gz

or 

> unzip libffm_toy.zip


> ./ffm-train -p libffm_toy/criteo.va.r100.gbdt0.ffm libffm_toy/criteo.tr.r100.gbdt0.ffm model

train a model using the default parameters


> ./ffm-predict libffm_toy/criteo.va.r100.gbdt0.ffm model output

do prediction


> ./ffm-train -l 0.0001 -k 15 -t 30 -r 0.05 -s 4 --auto-stop -p libffm_toy/criteo.va.r100.gbdt0.ffm libffm_toy/criteo.tr.r100.gbdt0.ffm model

train a model using the following parameters:

    regularization cost = 0.0001
    latent factors = 15
    iterations = 30
    learning rate = 0.3
    threads = 4
    let it auto-stop


OpenMP and SSE
==============

We use OpenMP to do parallelization. If OpenMP is not available on your
platform, then please comment out the following lines in Makefile.

    DFLAG += -DUSEOMP
    CXXFLAGS += -fopenmp

Note: Please run `make clean all' if these flags are changed.

We use SSE instructions to perform fast computation. If you do not want to use it, comment out the following line:

    DFLAG += -DUSESSE

Then, run `make clean all'



Building Windows Binaries
=========================

The Windows part is maintained by different maintainer, so it may not always support the latest version.

The latest version it supports is: v1.21

To build them via command-line tools of Visual C++, use the following steps:

1. Open a DOS command box (or Developer Command Prompt for Visual Studio) and go to LIBFFM directory. If environment
variables of VC++ have not been set, type

"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\vcvars64.bat"

You may have to modify the above command according which version of VC++ or
where it is installed.

2. Type

nmake -f Makefile.win clean all


FAQ
===

Q: Why I have the same model size when k = 1 and k = 4?

A: This is because we use SSE instructions. In order to use SSE, the memory need to be aligned. So even you assign k =
   1, we still fill some dummy zeros from k = 2 to 4.


Q: Why the logloss is slightly different on the same data when I run the program two or more times when I use multi-threading

A: When there are more then one thread, the program becomes non-deterministic. To make it determinisitc you can only use one thread.


Contributors
============

Yuchin Juan, Wei-Sheng Chin, and Yong Zhuang

For questions, comments, feature requests, or bug report, please send your email to:

    Yuchin Juan ([email protected])

For Windows related questions, please send your email to:

    Wei-Sheng Chin ([email protected])
Comments
  • Segmentation fault

    Segmentation fault

    Hello,

    Thank you for your excellent method, software and description.

    I faced a problem trying to employ the libffm in my ML task. I am getting segmentation fault when using it with cross-validation option. Here are my setup and data: Ubuntu 13.10 ~/libffm$ ./ffm-train -k 5 -t 30 -r 0.03 -v 2 data.txt fold logloss 0 0.1080 Segmentation fault (core dumped)

    The data.txt can be downloaded here https://drive.google.com/open?id=0B9HyQ7ZccW4-VFE0VWtxUHF2R3c

    The problem arises only when working with big data files like that. If you cut it to 100K lines (it is around 250K lines) everything get OK.

    Regards, Sergey

    opened by skirpichenko 8
  • Train and val data set both have labels but there is no label in test data set. How to fill up <label> in FFM data format?

    Train and val data set both have labels but there is no label in test data set. How to fill up

    Thanks for your amazing libffm.

    When using ffm_predict, I have a problem about how to fill up

    Thanks again.

    opened by altmanWang 2
  • “-nan” value appeared during training

    “-nan” value appeared during training

    When I was training the model, the first few iterations worked fine but subsequent iterations returned "-nan" for the log losses of training and validating data sets.

    Any ideas what went wrong?

    image

    Sample of the data used for training:

    1 0:400492:1 1:977206:1 2:861366:1 3:223345:1 4:4:0.0 5:5:9567.0 6:6:31835.0 7:7:0.300471105528 8:8:0.0 9:9:0.0 10:35822:1 11:486386:1 12:528723:1 13:662860:1 14:990282:1 15:406964:1 16:698517:1 17:585048:1 18:18:0.38219606197 19:19:0.125217833586 20:20:0.438929013305 21:21:0.216453092359 22:923220:1 23:63477:1 24:216531:1 25:461117:1

    0 0:400492:1 1:203267:1 2:861366:1 3:223345:1 4:4:0.0 5:5:1642.0 6:6:9441.0 7:7:0.173830192674 8:8:0.0 9:9:0.0644 10:709579:1 11:486386:1 12:528723:1 13:662860:1 14:778015:1 15:581435:1 16:698517:1 17:181797:1 18:18:0.581693006318 19:19:0.097000178732 20:20:0.367630745198 21:21:0.182764132116 22:923220:1 23:63477:1 24:216531:1 25:461117:1

    opened by lxjhk 2
  • k_aligned & memory requirements

    k_aligned & memory requirements

    1. It would be useful to mention in the README that memory allocation depends on k_aligned, not just k. So changing k from 4 to 5 actually doubles memory requirements.

    2. Is there any particular reason why you align k to the power of 2?

    opened by mpekalski 2
  • ffm-train not found

    ffm-train not found

    Hi, I am trying to use libffm on ubuntu 16.04. I have C++11 and OpenMP installed via apt-get, downloaded libffm and did make. I am in the libffm dir and ran and got the following.

    josh:~/libffm-master$ ffm-train bigdata.tr.txt model
    ffm-train: command not found
    

    When I check the dir you can see it is there

    josh@josh-HP-ZBook-17-G2:~/libffm-master$ dir
    bigdata.te.txt  ffm.cpp  ffm-predict      ffm-train.cpp  README
    bigdata.tr.txt  ffm.h    ffm-predict.cpp  Makefile
    COPYRIGHT   ffm.o    ffm-train    Makefile.win
    

    Any help would be great. Thanks.

    opened by JoshuaC3 2
  • Refactor build scripts

    Refactor build scripts

    Changes

    • [x] Add CMakeLists.txt for CLion users.
    • [x] Update Makefile
    • [x] Add description to build macOS binaries.
    • [x] Update .gitignore

    How to build on macOS

    Apple clang (use libomp)

    $ brew install libomp
    $ make OMP_CXXFLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix libomp)/include" OMP_LDFLAGS="-L$(brew --prefix libomp)/lib -lomp"
    

    or cmake

    $ brew install libomp
    $ mkdir build
    $ cd build
    $ cmake \
      -DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix libomp)/include" \
      -DOpenMP_CXX_LIB_NAMES="omp" \
      -DOpenMP_omp_LIBRARY=$(brew --prefix libomp)/lib/libomp.dylib \
      ..
    $ make
    

    See https://cmake.org/cmake/help/latest/module/FindOpenMP.html

    Using gcc (installed by homebrew)

    $ brew install gcc
    $ make CXX="g++-8"
    

    or cmake

    $ brew install gcc
    $ export CXX=g++-8
    $ mkdir build && cd build
    $ cmake ..
    $ make
    

    Disable OpenMP

    $ make USEOMP=OFF
    

    or cmake

    $ mkdir build && cd build
    $ cmake -DUSE_OPENMP=OFF ..
    $ make
    
    opened by c-bata 1
  • viewing the model

    viewing the model

    I've used this pacakge a few months, ago, and I remember I was able to do $head model, and to see the model weights. It seems that the model is now encoded somehow (binarized?) am I correct? is there a way to see the model as before?

    opened by shgidi 1
  • Does parallel operation of train function in ffm.cpp ensure thread safety?

    Does parallel operation of train function in ffm.cpp ensure thread safety?

    Regarding train in ffm.cpp lines 228-375, I have a question on thread safety.

    below are lines 288-312 #if defined USEOMP

            #pragma omp parallel for schedule(static) reduction(+: tr_loss)
    
            #endif
    
            for(ffm_int ii = 0; ii < (ffm_int)order.size(); ii++)
    
            {
    
            ffm_int i = order[ii];
    
            ffm_float y = tr->Y[i];
            
            ffm_node *begin = &tr->X[tr->P[i]];
    
            ffm_node *end = &tr->X[tr->P[i+1]];
    
            ffm_float r = R_tr[i];
    
            ffm_float t = wTx(begin, end, r, *model);
    
            ffm_float expnyt = exp(-y*t);
    
            tr_loss += log(1+expnyt);
               
            ffm_float kappa = -y*expnyt/(1+expnyt);
    
            wTx(begin, end, r, *model, kappa, param.eta, param.lambda, true);
            }
    

    I'm new to openmp parallel operations. I'm curious whether it ensures thread safety regarding wTx operation at the very bottom. wTx(begin, end, r, *model, kappa, param.eta, param.lambda, true); It seems that since wTx with do_update = true updates weights, it could interfere with other threads updating the weights. Waiting for reply.

    opened by heekyungyoon 1
  • fix the numerical problem in the log loss calculation

    fix the numerical problem in the log loss calculation

    When some predictions is very near to 0 or 1, it may produce log(0)=-inf. I use epsilon = 1e-15 to limit the range of the prediction (the same as sklearn and all the competitions on Kaggle). The value should be configurable with a command line argument in the future. I also got -nan before using this (like in #11), but I'm not very sure why -nan is produced.

    (BTW, some redundant spaces are auto removed by my editor.)

    opened by ianlini 1
  • Unknown features

    Unknown features

    Unknown features (like new app_id or device_id that was not in training data) lead to random probabilities (too small or too high). Could you suggest a workaround for using LIBFFM in that case?

    opened by ralovets 1
  • libffm-linear prediction

    libffm-linear prediction

    Hello,

    I'm trying to use libffm-linear library. Here are my outputs:

    libffm-linear>windows\ffm-train -s 2 -l 0 -k 10 -t 50 -r 0.01 --au to-stop -p test_data.txt train_data.txt model iter tr_logloss va_logloss 1 0.25510 0.25017 2 0.25129 0.24927 3 0.25070 0.24882 4 0.25041 0.24843 5 0.25020 0.24821 6 0.25005 0.24808 7 0.24990 0.24801 8 0.24977 0.24800 9 0.24968 0.24820 Auto-stop. Use model at 8th iteration.

    libffm-linear>windows\ffm-predict test_data.txt model output_file logloss = 0.34800

    Why prediction logloss differs from validation logloss on same file?

    opened by gediminaszylius 1
  • How to use tags as features with ffm?

    How to use tags as features with ffm?

    How to use tags associated with item as a field in FFM? In FFM, only one feature for a given field can be turned on. But, for tags, we have several of features "1" for that given field. So, how to use tags as field for FFM?

    opened by sumitsidana 0
  • almost no comments in codes

    almost no comments in codes

    In the implement, there are almost no comments. It is hard to read and learn. It is known that C codes is harder to read than python lang. That there are no comments make learner much harder. All in all, the implement is unfriendly. Please add necessary comments. At least, the members of structs would be commented. Thank you on behalf of everyone

    opened by lmxhappy 0
  • Java wrapper

    Java wrapper

    Hello!

    I'm about to finish a generalised wrapper for "predict" and "ffm_load_model" function in Java. It would be great if you will review my code and then add it to your library if you deem it fit.

    Thank You

    opened by RochanMehrotra 0
  • make error

    make error

    g++ -Wall -O3 -std=c++0x -march=native -fopenmp -DUSESSE -DUSEOMP -c -o ffm.o ffm.cpp /tmp/cc2xJsit.s: Assembler messages: /tmp/cc2xJsit.s:3277: Error: no such instruction:vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cc2xJsit.s:3286: Error: suffix or operands invalid for vpaddd' /tmp/cc2xJsit.s:3598: Error: no such instruction:vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cc2xJsit.s:3609: Error: suffix or operands invalid for vpaddd' /tmp/cc2xJsit.s:3949: Error: no such instruction:vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cc2xJsit.s:3955: Error: suffix or operands invalid for vpaddd' /tmp/cc2xJsit.s:4273: Error: no such instruction:vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cc2xJsit.s:4284: Error: suffix or operands invalid for vpaddd'

    opened by einvince 0
Releases(v123)
  • v120(May 28, 2017)

    • Binary model

      In old version the model is in text file and it was very slow for saving and loading. To make it faster, we decide to use binary format.

    • Removed C API support

      In the old version in order to support pure C API, the code inside LIBFFM is writing in a mixed C++ / C style. This is very buggy and ugly. We decide to stop providing C API in this version. If you need this, let us know and we will consider to write a wrapper.

    • Remove cross-validation

      FFM so far has been shown useful for large scale categorical data. Because the dataset are usually large, it will take a very long time to do cross-validation. Indeed, ourselves have never used cross-validation (including when we were attending the Criteo and the Avazu contest). We think this function is a overkill so we decided to remove it.

    • Remove in memory training

      We find that on-disk training has very similar performance as in memory training but consuming way smaller memory. So we decide to remove in memory training and use on-disk version only.

    • Support random in on-disk mode

      In previous version the selection of data point is not randomized in on-disk mode.

    • Binary data file reuse

      Converting text file to binary file is slow. In this version you only need to convert once and we will automatically reuse the binary.

    • Add timer

      Now we output the training time

    Source code(tar.gz)
    Source code(zip)
Owner
null
NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs.

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

null 420 Jan 4, 2023
Plex-recommender - Get movie recommendations based on your current PleX library

plex-recommender Description: Get movie/tv recommendations based on your current

null 5 Jul 19, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 2.8k Feb 12, 2021
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
A Library for Field-aware Factorization Machines

Table of Contents ================= - What is LIBFFM - Overfitting and Early Stopping - Installation - Data Format - Command Line Usage - Examples -

null 1.6k Dec 5, 2022
fastFM: A Library for Factorization Machines

Citing fastFM The library fastFM is an academic project. The time and resources spent developing fastFM are therefore justified by the number of citat

null 1k Dec 24, 2022
fastFM: A Library for Factorization Machines

Citing fastFM The library fastFM is an academic project. The time and resources spent developing fastFM are therefore justified by the number of citat

null 1k Dec 24, 2022
Factorization machines in python

Factorization Machines in Python This is a python implementation of Factorization Machines [1]. This uses stochastic gradient descent with adaptive re

Corey Lynch 892 Jan 3, 2023
MAC address Model Field & Form Field for Django apps

django-macaddress MAC Address model and form fields for Django We use netaddr to parse and validate the MAC address. The tests aren't complete yet. Pa

null 49 Sep 4, 2022
Sparse Beta-Divergence Tensor Factorization Library

NTFLib Sparse Beta-Divergence Tensor Factorization Library Based off of this beta-NTF project this library is specially-built to handle tensors where

Stitch Fix Technology 46 Jan 8, 2022
A Python library for simulating finite automata, pushdown automata, and Turing machines

Automata Copyright 2016-2021 Caleb Evans Released under the MIT license Automata is a Python 3 library which implements the structures and algorithms

Caleb Evans 219 Dec 12, 2022
TensorFlow implementation of an arbitrary order Factorization Machine

This is a TensorFlow implementation of an arbitrary order (>=2) Factorization Machine based on paper Factorization Machines with libFM. It supports: d

Mikhail Trofimov 785 Dec 21, 2022
Neural Factorization of Shape and Reflectance Under An Unknown Illumination

NeRFactor [Paper] [Video] [Project] This is the authors' code release for: NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown I

Google 283 Jan 4, 2023
TuckER: Tensor Factorization for Knowledge Graph Completion

TuckER: Tensor Factorization for Knowledge Graph Completion This codebase contains PyTorch implementation of the paper: TuckER: Tensor Factorization f

Ivana Balazevic 296 Dec 6, 2022
A PyTorch implementation of a Factorization Machine module in cython.

fmpytorch A library for factorization machines in pytorch. A factorization machine is like a linear model, except multiplicative interaction terms bet

Jack Hessel 167 Jul 6, 2022
Transform-Invariant Non-Negative Matrix Factorization

Transform-Invariant Non-Negative Matrix Factorization A comprehensive Python package for Non-Negative Matrix Factorization (NMF) with a focus on learn

EMD Group 6 Jul 1, 2022
This is REST-API for Indonesian Text Summarization using Non-Negative Matrix Factorization for the algorithm to summarize documents and FastAPI for the framework.

Indonesian Text Summarization Using FastAPI This is REST-API for Indonesian Text Summarization using Non-Negative Matrix Factorization for the algorit

Viqi Nurhaqiqi 2 Nov 3, 2022
Implementation of SSMF: Shifting Seasonal Matrix Factorization

SSMF Implementation of SSMF: Shifting Seasonal Matrix Factorization, Koki Kawabata, Siddharth Bhatia, Rui Liu, Mohit Wadhwa, Bryan Hooi. NeurIPS, 2021

Koki Kawabata 9 Jun 10, 2022
PyTorch framework, for reproducing experiments from the paper Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks

Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks. Code, based on the PyTorch framework, for reprodu

Asaf 3 Dec 27, 2022
Restricted Boltzmann Machines in Python.

How to Use First, initialize an RBM with the desired number of visible and hidden units. rbm = RBM(num_visible = 6, num_hidden = 2) Next, train the m

Edwin Chen 928 Dec 30, 2022