H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Overview

H2O

Join the chat at https://gitter.im/h2oai/h2o-3

H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark. H2O provides implementations of many popular algorithms such as Generalized Linear Models (GLM), Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks, Stacked Ensembles, Naive Bayes, Generalized Additive Models (GAM), Cox Proportional Hazards, K-Means, PCA, Word2Vec, as well as a fully automatic machine learning algorithm (H2O AutoML).

H2O is extensible so that developers can add data transformations and custom algorithms of their choice and access them through all of those clients. H2O models can be downloaded and loaded into H2O memory for scoring, or exported into POJO or MOJO format for extemely fast scoring in production. More information can be found in the H2O User Guide.

H2O-3 (this repository) is the third incarnation of H2O, and the successor to H2O-2.

Table of Contents

1. Downloading H2O-3

While most of this README is written for developers who do their own builds, most H2O users just download and use a pre-built version. If you are a Python or R user, the easiest way to install H2O is via PyPI or Anaconda (for Python) or CRAN (for R):

Python

pip install h2o

R

install.packages("h2o")

For the latest stable, nightly, Hadoop (or Spark / Sparkling Water) releases, or the stand-alone H2O jar, please visit: https://h2o.ai/download

More info on downloading & installing H2O is available in the H2O User Guide.

2. Open Source Resources

Most people interact with three or four primary open source resources: GitHub (which you've already found), JIRA (for bug reports and issue tracking), Stack Overflow for H2O code/software-specific questions, and h2ostream (a Google Group / email discussion forum) for questions not suitable for Stack Overflow. There is also a Gitter H2O developer chat group, however for archival purposes & to maximize accessibility, we'd prefer that standard H2O Q&A be conducted on Stack Overflow.

2.1 Issue Tracking and Feature Requests

(Note: There is only one issue tracking system for the project. GitHub issues are not enabled; you must use JIRA.)

You can browse and create new issues in our open source JIRA: http://jira.h2o.ai

  • You can browse and search for issues without logging in to JIRA:
    1. Click the Issues menu
    2. Click Search for issues
  • To create an issue (either a bug or a feature request), please create yourself an account first:
    1. Click the Log In button on the top right of the screen
    2. Click Create an acccount near the bottom of the login box
    3. Once you have created an account and logged in, use the Create button on the menu to create an issue
    4. Create H2O-3 issues in the PUBDEV project. (Note: Sparkling Water questions should be filed under the SW project.)
  • You can also vote for feature requests and/or other issues. Voting can help H2O prioritize the features that are included in each release.
    1. Go to the H2O JIRA page.
    2. Click Log In to either log in or create an account if you do not already have one.
    3. Search for the feature that you want to prioritize, or create a new feature.
    4. Click on the Vote for this issue link. This is located on the right side of the issue under the People section.

2.2 List of H2O Resources

3. Using H2O-3 Artifacts

Every nightly build publishes R, Python, Java, and Scala artifacts to a build-specific repository. In particular, you can find Java artifacts in the maven/repo directory.

Here is an example snippet of a gradle build file using h2o-3 as a dependency. Replace x, y, z, and nnnn with valid numbers.

// h2o-3 dependency information
def h2oBranch = 'master'
def h2oBuildNumber = 'nnnn'
def h2oProjectVersion = "x.y.z.${h2oBuildNumber}"

repositories {
  // h2o-3 dependencies
  maven {
    url "https://s3.amazonaws.com/h2o-release/h2o-3/${h2oBranch}/${h2oBuildNumber}/maven/repo/"
  }
}

dependencies {
  compile "ai.h2o:h2o-core:${h2oProjectVersion}"
  compile "ai.h2o:h2o-algos:${h2oProjectVersion}"
  compile "ai.h2o:h2o-web:${h2oProjectVersion}"
  compile "ai.h2o:h2o-app:${h2oProjectVersion}"
}

Refer to the latest H2O-3 bleeding edge nightly build page for information about installing nightly build artifacts.

Refer to the h2o-droplets GitHub repository for a working example of how to use Java artifacts with gradle.

Note: Stable H2O-3 artifacts are periodically published to Maven Central (click here to search) but may substantially lag behind H2O-3 Bleeding Edge nightly builds.

4. Building H2O-3

Getting started with H2O development requires JDK 1.7, Node.js, Gradle, Python and R. We use the Gradle wrapper (called gradlew) to ensure up-to-date local versions of Gradle and other dependencies are installed in your development directory.

4.1. Before building

Building h2o requires a properly set up R environment with required packages and Python environment with the following packages:

grip
colorama
future
tabulate
requests
wheel

To install these packages you can use pip or conda. If you have troubles installing these packages on Windows, please follow section Setup on Windows of this guide.

(Note: It is recommended to use some virtual environment such as VirtualEnv, to install all packages. )

4.2. Building from the command line (Quick Start)

To build H2O from the repository, perform the following steps.

Recipe 1: Clone fresh, build, skip tests, and run H2O

# Build H2O
git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
./gradlew build -x test

You may encounter problems: e.g. npm missing. Install it:
brew install npm

# Start H2O
java -jar build/h2o.jar

# Point browser to http://localhost:54321

Recipe 2: Clone fresh, build, and run tests (requires a working install of R)

git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
./gradlew syncSmalldata
./gradlew syncRPackages
./gradlew build

Notes:

  • Running tests starts five test JVMs that form an H2O cluster and requires at least 8GB of RAM (preferably 16GB of RAM).
  • Running ./gradlew syncRPackages is supported on Windows, OS X, and Linux, and is strongly recommended but not required. ./gradlew syncRPackages ensures a complete and consistent environment with pre-approved versions of the packages required for tests and builds. The packages can be installed manually, but we recommend setting an ENV variable and using ./gradlew syncRPackages. To set the ENV variable, use the following format (where `${WORKSPACE} can be any path):
mkdir -p ${WORKSPACE}/Rlibrary
export R_LIBS_USER=${WORKSPACE}/Rlibrary

Recipe 3: Pull, clean, build, and run tests

git pull
./gradlew syncSmalldata
./gradlew syncRPackages
./gradlew clean
./gradlew build

Notes

  • We recommend using ./gradlew clean after each git pull.

  • Skip tests by adding -x test at the end the gradle build command line. Tests typically run for 7-10 minutes on a Macbook Pro laptop with 4 CPUs (8 hyperthreads) and 16 GB of RAM.

  • Syncing smalldata is not required after each pull, but if tests fail due to missing data files, then try ./gradlew syncSmalldata as the first troubleshooting step. Syncing smalldata downloads data files from AWS S3 to the smalldata directory in your workspace. The sync is incremental. Do not check in these files. The smalldata directory is in .gitignore. If you do not run any tests, you do not need the smalldata directory.

  • Running ./gradlew syncRPackages is supported on Windows, OS X, and Linux, and is strongly recommended but not required. ./gradlew syncRPackages ensures a complete and consistent environment with pre-approved versions of the packages required for tests and builds. The packages can be installed manually, but we recommend setting an ENV variable and using ./gradlew syncRPackages. To set the ENV variable, use the following format (where ${WORKSPACE} can be any path):

    mkdir -p ${WORKSPACE}/Rlibrary
    export R_LIBS_USER=${WORKSPACE}/Rlibrary
    

Recipe 4: Just building the docs

./gradlew clean && ./gradlew build -x test && (export DO_FAST=1; ./gradlew dist)
open target/docs-website/h2o-docs/index.html

4.3. Setup on Windows

Step 1: Download and install WinPython.

From the command line, validate python is using the newly installed package by using which python (or sudo which python). Update the Environment variable with the WinPython path.

Step 2: Install required Python packages:
pip install grip 'colorama>=0.3.8' future tabulate wheel
Step 3: Install JDK

Install Java 1.7 and add the appropriate directory C:\Program Files\Java\jdk1.7.0_65\bin with java.exe to PATH in Environment Variables. To make sure the command prompt is detecting the correct Java version, run:

javac -version

The CLASSPATH variable also needs to be set to the lib subfolder of the JDK:

CLASSPATH=/<path>/<to>/<jdk>/lib
Step 4. Install Node.js

Install Node.js and add the installed directory C:\Program Files\nodejs, which must include node.exe and npm.cmd to PATH if not already prepended.

Step 5. Install R, the required packages, and Rtools:

Install R and add the bin directory to your PATH if not already included.

Install the following R packages:

To install these packages from within an R session:

pkgs <- c("RCurl", "jsonlite", "statmod", "devtools", "roxygen2", "testthat")
for (pkg in pkgs) {
  if (! (pkg %in% rownames(installed.packages()))) install.packages(pkg)
}

Note that libcurl is required for installation of the RCurl R package.

Note that this packages don't cover running tests, they for building H2O only.

Finally, install Rtools, which is a collection of command line tools to facilitate R development on Windows.

NOTE: During Rtools installation, do not install Cygwin.dll.

Step 6. Install Cygwin

NOTE: During installation of Cygwin, deselect the Python packages to avoid a conflict with the Python.org package.

Step 6b. Validate Cygwin

If Cygwin is already installed, remove the Python packages or ensure that Native Python is before Cygwin in the PATH variable.

Step 7. Update or validate the Windows PATH variable to include R, Java JDK, Cygwin.
Step 8. Git Clone h2o-3

If you don't already have a Git client, please install one. The default one can be found here http://git-scm.com/downloads. Make sure that command prompt support is enabled before the installation.

Download and update h2o-3 source codes:

git clone https://github.com/h2oai/h2o-3
Step 9. Run the top-level gradle build:
cd h2o-3
./gradlew.bat build

If you encounter errors run again with --stacktrace for more instructions on missing dependencies.

4.4. Setup on OS X

If you don't have Homebrew, we recommend installing it. It makes package management for OS X easy.

Step 1. Install JDK

Install Java 1.7. To make sure the command prompt is detecting the correct Java version, run:

javac -version
Step 2. Install Node.js:

Using Homebrew:

brew install node

Otherwise, install from the NodeJS website.

Step 3. Install R and the required packages:

Install R and add the bin directory to your PATH if not already included.

Install the following R packages:

To install these packages from within an R session:

pkgs <- c("RCurl", "jsonlite", "statmod", "devtools", "roxygen2", "testthat")
for (pkg in pkgs) {
  if (! (pkg %in% rownames(installed.packages()))) install.packages(pkg)
}

Note that libcurl is required for installation of the RCurl R package.

Note that this packages don't cover running tests, they for building H2O only.

Step 4. Install python and the required packages:

Install python:

brew install python

Install pip package manager:

sudo easy_install pip

Next install required packages:

sudo pip install wheel requests 'colorama>=0.3.8' future tabulate  
Step 5. Git Clone h2o-3

OS X should already have Git installed. To download and update h2o-3 source codes:

git clone https://github.com/h2oai/h2o-3
Step 6. Run the top-level gradle build:
cd h2o-3
./gradlew build

Note: on a regular machine it may take very long time (about an hour) to run all the tests.

If you encounter errors run again with --stacktrace for more instructions on missing dependencies.

4.5. Setup on Ubuntu 14.04

Step 1. Install Node.js
curl -sL https://deb.nodesource.com/setup_0.12 | sudo bash -
sudo apt-get install -y nodejs
Step 2. Install JDK:

Install Java 8. Installation instructions can be found here JDK installation. To make sure the command prompt is detecting the correct Java version, run:

javac -version
Step 3. Install R and the required packages:

Installation instructions can be found here R installation. Click “Download R for Linux”. Click “ubuntu”. Follow the given instructions.

To install the required packages, follow the same instructions as for OS X above.

Note: If the process fails to install RStudio Server on Linux, run one of the following:

sudo apt-get install libcurl4-openssl-dev

or

sudo apt-get install libcurl4-gnutls-dev

Step 4. Git Clone h2o-3

If you don't already have a Git client:

sudo apt-get install git

Download and update h2o-3 source codes:

git clone https://github.com/h2oai/h2o-3
Step 5. Run the top-level gradle build:
cd h2o-3
./gradlew build

If you encounter errors, run again using --stacktrace for more instructions on missing dependencies.

Make sure that you are not running as root, since bower will reject such a run.

4.6. Setup on Ubuntu 13.10

Step 1. Install Node.js
curl -sL https://deb.nodesource.com/setup_10.x | sudo bash -
sudo apt-get install -y nodejs
Steps 2-4. Follow steps 2-4 for Ubuntu 14.04 (above)

4.7. Setup on CentOS 7

cd /opt
sudo wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/7u79-b15/jdk-7u79-linux-x64.tar.gz"

sudo tar xzf jdk-7u79-linux-x64.tar.gz
cd jdk1.7.0_79

sudo alternatives --install /usr/bin/java java /opt/jdk1.7.0_79/bin/java 2

sudo alternatives --install /usr/bin/jar jar /opt/jdk1.7.0_79/bin/jar 2
sudo alternatives --install /usr/bin/javac javac /opt/jdk1.7.0_79/bin/javac 2
sudo alternatives --set jar /opt/jdk1.7.0_79/bin/jar
sudo alternatives --set javac /opt/jdk1.7.0_79/bin/javac

cd /opt

sudo wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
sudo rpm -ivh epel-release-7-5.noarch.rpm

sudo echo "multilib_policy=best" >> /etc/yum.conf
sudo yum -y update

sudo yum -y install R R-devel git python-pip openssl-devel libxml2-devel libcurl-devel gcc gcc-c++ make openssl-devel kernel-devel texlive texinfo texlive-latex-fonts libX11-devel mesa-libGL-devel mesa-libGL nodejs npm python-devel numpy scipy python-pandas

sudo pip install scikit-learn grip tabulate statsmodels wheel

mkdir ~/Rlibrary
export JAVA_HOME=/opt/jdk1.7.0_79
export JRE_HOME=/opt/jdk1.7.0_79/jre
export PATH=$PATH:/opt/jdk1.7.0_79/bin:/opt/jdk1.7.0_79/jre/bin
export R_LIBS_USER=~/Rlibrary

# install local R packages
R -e 'install.packages(c("RCurl","jsonlite","statmod","devtools","roxygen2","testthat"), dependencies=TRUE, repos="http://cran.rstudio.com/")'

cd
git clone https://github.com/h2oai/h2o-3.git
cd h2o-3

# Build H2O
./gradlew syncSmalldata
./gradlew syncRPackages
./gradlew build -x test

5. Launching H2O after Building

To start the H2O cluster locally, execute the following on the command line:

java -jar build/h2o.jar

A list of available start-up JVM and H2O options (e.g. -Xmx, -nthreads, -ip), is available in the H2O User Guide.

6. Building H2O on Hadoop

Pre-built H2O-on-Hadoop zip files are available on the download page. Each Hadoop distribution version has a separate zip file in h2o-3.

To build H2O with Hadoop support yourself, first install sphinx for python: pip install sphinx Then start the build by entering the following from the top-level h2o-3 directory:

(export BUILD_HADOOP=1; ./gradlew build -x test)
./gradlew dist

This will create a directory called 'target' and generate zip files there. Note that BUILD_HADOOP is the default behavior when the username is jenkins (refer to settings.gradle); otherwise you have to request it, as shown above.

Adding support for a new version of Hadoop

In the h2o-hadoop directory, each Hadoop version has a build directory for the driver and an assembly directory for the fatjar.

You need to:

  1. Add a new driver directory and assembly directory (each with a build.gradle file) in h2o-hadoop
  2. Add these new projects to h2o-3/settings.gradle
  3. Add the new Hadoop version to HADOOP_VERSIONS in make-dist.sh
  4. Add the new Hadoop version to the list in h2o-dist/buildinfo.json

Secure user impersonation

Hadoop supports secure user impersonation through its Java API. A kerberos-authenticated user can be allowed to proxy any username that meets specified criteria entered in the NameNode's core-site.xml file. This impersonation only applies to interactions with the Hadoop API or the APIs of Hadoop-related services that support it (this is not the same as switching to that user on the machine of origin).

Setting up secure user impersonation (for h2o):

  1. Create or find an id to use as proxy which has limited-to-no access to HDFS or related services; the proxy user need only be used to impersonate a user
  2. (Required if not using h2odriver) If you are not using the driver (e.g. you wrote your own code against h2o's API using Hadoop), make the necessary code changes to impersonate users (see org.apache.hadoop.security.UserGroupInformation)
  3. In either of Ambari/Cloudera Manager or directly on the NameNode's core-site.xml file, add 2/3 properties for the user we wish to use as a proxy (replace with the simple user name - not the fully-qualified principal name).
    • hadoop.proxyuser.<proxyusername>.hosts: the hosts the proxy user is allowed to perform impersonated actions on behalf of a valid user from
    • hadoop.proxyuser.<proxyusername>.groups: the groups an impersonated user must belong to for impersonation to work with that proxy user
    • hadoop.proxyuser.<proxyusername>.users: the users a proxy user is allowed to impersonate
    • Example: <property> <name>hadoop.proxyuser.myproxyuser.hosts</name> <value>host1,host2</value> </property> <property> <name>hadoop.proxyuser.myproxyuser.groups</name> <value>group1,group2</value> </property> <property> <name>hadoop.proxyuser.myproxyuser.users</name> <value>user1,user2</value> </property>
  4. Restart core services such as HDFS & YARN for the changes to take effect

Impersonated HDFS actions can be viewed in the hdfs audit log ('auth:PROXY' should appear in the ugi= field in entries where this is applicable). YARN similarly should show 'auth:PROXY' somewhere in the Resource Manager UI.

To use secure impersonation with h2o's Hadoop driver:

Before this is attempted, see Risks with impersonation, below

When using the h2odriver (e.g. when running with hadoop jar ...), specify -principal <proxy user kerberos principal>, -keytab <proxy user keytab path>, and -run_as_user <hadoop username to impersonate>, in addition to any other arguments needed. If the configuration was successful, the proxy user will log in and impersonate the -run_as_user as long as that user is allowed by either the users or groups configuration property (configured above); this is enforced by HDFS & YARN, not h2o's code. The driver effectively sets its security context as the impersonated user so all supported Hadoop actions will be performed as that user (e.g. YARN, HDFS APIs support securely impersonated users, but others may not).

Precautions to take when leveraging secure impersonation

  • The target use case for secure impersonation is applications or services that pre-authenticate a user and then use (in this case) the h2odriver on behalf of that user. H2O's Steam is a perfect example: auth user in web app over SSL, impersonate that user when creating the h2o YARN container.
  • The proxy user should have limited permissions in the Hadoop cluster; this means no permissions to access data or make API calls. In this way, if it's compromised it would only have the power to impersonate a specific subset of the users in the cluster and only from specific machines.
  • Use the hadoop.proxyuser.<proxyusername>.hosts property whenever possible or practical.
  • Don't give the proxyusername's password or keytab to any user you don't want to impersonate another user (this is generally any user). The point of impersonation is not to allow users to impersonate each other. See the first bullet for the typical use case.
  • Limit user logon to the machine the proxying is occurring from whenever practical.
  • Make sure the keytab used to login the proxy user is properly secured and that users can't login as that id (via su, for instance)
  • Never set hadoop.proxyuser..{users,groups} to '*' or 'hdfs', 'yarn', etc. Allowing any user to impersonate hdfs, yarn, or any other important user/group should be done with extreme caution and strongly analyzed before it's allowed.

Risks with secure impersonation

  • The id performing the impersonation can be compromised like any other user id.
  • Setting any hadoop.proxyuser.<proxyusername>.{hosts,groups,users} property to '*' can greatly increase exposure to security risk.
  • When users aren't authenticated before being used with the driver (e.g. like Steam does via a secure web app/API), auditability of the process/system is difficult.
$ git diff
diff --git a/h2o-app/build.gradle b/h2o-app/build.gradle
index af3b929..097af85 100644
--- a/h2o-app/build.gradle
+++ b/h2o-app/build.gradle
@@ -8,5 +8,6 @@ dependencies {
   compile project(":h2o-algos")
   compile project(":h2o-core")
   compile project(":h2o-genmodel")
+  compile project(":h2o-persist-hdfs")
 }

diff --git a/h2o-persist-hdfs/build.gradle b/h2o-persist-hdfs/build.gradle
index 41b96b2..6368ea9 100644
--- a/h2o-persist-hdfs/build.gradle
+++ b/h2o-persist-hdfs/build.gradle
@@ -2,5 +2,6 @@ description = "H2O Persist HDFS"

 dependencies {
   compile project(":h2o-core")
-  compile("org.apache.hadoop:hadoop-client:2.0.0-cdh4.3.0")
+  compile("org.apache.hadoop:hadoop-client:2.4.1-mapr-1408")
+  compile("org.json:org.json:chargebee-1.0")
 }

7. Sparkling Water

Sparkling Water combines two open-source technologies: Apache Spark and the H2O Machine Learning platform. It makes H2O’s library of advanced algorithms, including Deep Learning, GLM, GBM, K-Means, and Distributed Random Forest, accessible from Spark workflows. Spark users can select the best features from either platform to meet their Machine Learning needs. Users can combine Spark's RDD API and Spark MLLib with H2O’s machine learning algorithms, or use H2O independently of Spark for the model building process and post-process the results in Spark.

Sparkling Water Resources:

8. Documentation

Documenation Homepage

The main H2O documentation is the H2O User Guide. Visit http://docs.h2o.ai for the top-level introduction to documentation on H2O projects.

Generate REST API documentation

To generate the REST API documentation, use the following commands:

cd ~/h2o-3
cd py
python ./generate_rest_api_docs.py  # to generate Markdown only
python ./generate_rest_api_docs.py --generate_html  --github_user GITHUB_USER --github_password GITHUB_PASSWORD # to generate Markdown and HTML

The default location for the generated documentation is build/docs/REST.

If the build fails, try gradlew clean, then git clean -f.

Bleeding edge build documentation

Documentation for each bleeding edge nightly build is available on the nightly build page.

9. Citing H2O

If you use H2O as part of your workflow in a publication, please cite your H2O resource(s) using the following BibTex entry:

H2O Software

@Manual{h2o_package_or_module,
    title = {package_or_module_title},
    author = {H2O.ai},
    year = {year},
    month = {month},
    note = {version_information},
    url = {resource_url},
}

Formatted H2O Software citation examples:

H2O Booklets

H2O algorithm booklets are available at the Documentation Homepage.

@Manual{h2o_booklet_name,
    title = {booklet_title},
    author = {list_of_authors},
    year = {year},
    month = {month},
    url = {link_url},
}

Formatted booklet citation examples:

Arora, A., Candel, A., Lanford, J., LeDell, E., and Parmar, V. (Oct. 2016). Deep Learning with H2O. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf.

Click, C., Lanford, J., Malohlava, M., Parmar, V., and Roark, H. (Oct. 2016). Gradient Boosted Models with H2O. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GBMBooklet.pdf.

10. Roadmap

H2O 3.34.0.1 - January 2021

  • Extended Isolation Forest Algorithm
  • Uplift Trees
  • Extracting & ranking feature interactions from GBM and XGBoost models
  • RuleFit MOJO, CoxPH MOJO
  • Support for MOJO2 Scoring
  • Grid-Search fault Tolerance
  • Kubernetes Operator
  • Externalized XGBoost on Kubernetes clusters

11. Community

H2O has been built by a great many number of contributors over the years both within H2O.ai (the company) and the greater open source community. You can begin to contribute to H2O by answering Stack Overflow questions or filing bug reports. Please join us!

Team & Committers

SriSatish Ambati
Cliff Click
Tom Kraljevic
Tomas Nykodym
Michal Malohlava
Kevin Normoyle
Spencer Aiello
Anqi Fu
Nidhi Mehta
Arno Candel
Josephine Wang
Amy Wang
Max Schloemer
Ray Peck
Prithvi Prabhu
Brandon Hill
Jeff Gambera
Ariel Rao
Viraj Parmar
Kendall Harris
Anand Avati
Jessica Lanford
Alex Tellez
Allison Washburn
Amy Wang
Erik Eckstrand
Neeraja Madabhushi
Sebastian Vidrio
Ben Sabrin
Matt Dowle
Mark Landry
Erin LeDell
Andrey Spiridonov
Oleg Rogynskyy
Nick Martin
Nancy Jordan
Nishant Kalonia
Nadine Hussami
Jeff Cramer
Stacie Spreitzer
Vinod Iyengar
Charlene Windom
Parag Sanghavi
Navdeep Gill
Lauren DiPerna
Anmol Bal
Mark Chan
Nick Karpov
Avni Wadhwa
Ashrith Barthur
Karen Hayrapetyan
Jo-fai Chow
Dmitry Larko
Branden Murray
Jakub Hava
Wen Phan
Magnus Stensmo
Pasha Stetsenko
Angela Bartz
Mateusz Dymczyk
Micah Stubbs
Ivy Wang
Terone Ward
Leland Wilkinson
Wendy Wong
Nikhil Shekhar
Pavel Pscheidl
Michal Kurka
Veronika Maurerova
Jan Sterba
Jan Jendrusak
Sebastien Poirier
Tomáš Frýda

Advisors

Scientific Advisory Council

Stephen Boyd
Rob Tibshirani
Trevor Hastie

Systems, Data, FileSystems and Hadoop

Doug Lea
Chris Pouliot
Dhruba Borthakur

Investors

Jishnu Bhattacharjee, Nexus Venture Partners
Anand Babu Periasamy
Anand Rajaraman
Ash Bhardwaj
Rakesh Mathur
Michael Marks
Egbert Bierman
Rajesh Ambati
Comments
  • PUBDEV-4652 Distributed XGBoost

    PUBDEV-4652 Distributed XGBoost

    Distributed XGBoost (using Rabit) + some improvements (working with 2D data tables to allow larger datasets, bulk CV scoring).

    Notes:

    • ~~Distributed mode does not work on OSX - it looks like the client (XGBoost C code) is not sending data while initializing the tracker. This is the same for all 3 trackers (Python XGBoost, Scala XGBoost, our Java impl)~~
    • GPU backend does not have a distributed implementation iirc (to be double checked) and won't work
    • Several ports (between 9000 and 10010 and some random ones) will be opened, this is an XGBoost limitation and some are hardcoded in Rabit code (9010 - 10010), some are hardcoded in our tracker (9000 - 10000).
    • If the cluster is running in SSL mode and there's more than one node, XGBoost will be disabled. XGBoost does not implement encrypted messaging between its clients.

    Testing:

    The simplest way is to build the project ./gradlew clean build -x test and run java -jar build/h2o.jar -log_level DEBUG twice on the same machine (you might want to use a flatfile to be sure). Upload some data and try to run XGboost.

    Logs which should be printed can be found in RabitTrackerH2O and RabitWorker.

    I've tested it using airlines data the most, so trying out other datasets sounds like a good idea.

    To check:

    • error handling. Are the tracker threads/sockets properly closed?
    • ~~is client mode working?~~
    • sparkling water
    • anything else??
    please review 
    opened by mdymczyk 52
  • [PUBDEV-4375] SVD using netlib-java for PCA

    [PUBDEV-4375] SVD using netlib-java for PCA

    The solution for the assignment to improve H2O PCA. The comparison of old and new version's performance is below...

    TODO

    • [ ] add PCAImplementation parameters to h2o-py and h2o-r clients and check, if integration tests pass
    • [x] create JIRA subtask about for PCAImplementation parameters
    • [x] compare Jama with netlib-java implementation using R - call R and bind it to h2o instance, and verify correctness and speed of each
    • [x] in PCAImplementation JMH benchmarks, make the filepath to the dataset a parameter & create JIRA task on this
    • [x] check, if the dataset downloaded. If not, download it
    • [x] in JMH benchs for PCAImplementation, extract an interface with train() and score() methods
    • [x] propagate reshape1DArray to other PCA implementations
    • [x] set up continuous benchmarking: pass for nightly benchmark runs @nikhil @navdeep, create script, retrieve results & store in CSVs
    • [ ] finalize testing & benchmarking PCAImplementation via R (consult with @wendycwong & @mmalohlava )

    Jama

    Benchmark                                                   Mode  Cnt  Score   Error  Units
    PCAImputeMissingScoringBench.measureImputeMissingScoring                     N/A  avgt   10    0.658 ±  0.429  ms/op
    PCAImputeMissingTrainingBench.measureImputeMissingTraining                   N/A  avgt   10    2.310 ±  1.111  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                1  avgt   10    0.748 ±  1.056  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                2  avgt   10    1.009 ±  1.172  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                3  avgt   10    0.564 ±  0.705  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                4  avgt   10    0.552 ±  0.779  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                5  avgt   10    1.260 ±  0.787  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                6  avgt   10    1.003 ±  0.716  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              1  avgt   10    5.077 ±  2.474  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              2  avgt   10    7.640 ±  6.329  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              3  avgt   10    6.764 ±  7.270  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              4  avgt   10    7.304 ±  4.105  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              5  avgt   10  705.364 ± 50.947  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              6  avgt   10  691.045 ± 60.670  ms/op
    

    MTJ (netlib-java)

    Benchmark                                                   Mode  Cnt  Score   Error  Units
    PCAImputeMissingScoringBench.measureImputeMissingScoring                     N/A  avgt   10    0.781 ±   0.646  ms/op
    PCAImputeMissingTrainingBench.measureImputeMissingTraining                   N/A  avgt   10    2.479 ±   1.466  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                1  avgt   10    2.066 ±   7.888  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                2  avgt   10    0.492 ±   0.667  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                3  avgt   10    0.568 ±   0.708  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                4  avgt   10    1.206 ±   2.768  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                5  avgt   10    2.338 ±   6.228  ms/op
    PCAWideDataSetsScoringBench.measureWideDataSetsBenchScoringCase                6  avgt   10    0.670 ±   0.487  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              1  avgt   10    4.111 ±   1.610  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              2  avgt   10    6.223 ±   4.150  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              3  avgt   10    5.835 ±   3.349  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              4  avgt   10    7.771 ±   4.570  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              5  avgt   10  140.828 ±  96.366  ms/op
    PCAWideDataSetsTrainingBench.measureWideDataSetsBenchTrainingCase              6  avgt   10  156.375 ± 112.684  ms/op
    

    System info

    $ inxi 
    CPU~Dual core Intel Core i5-4300M (-HT-MCP-) speed/max~2485/3300 MHz Kernel~4.4.0-72-generic x86_64 Up~3:37 Mem~2390.6/3831.1MB HDD~500.1GB(50.1% used) Procs~247 Client~Shell inxi~2.2.35  
    
    $ inxi -S
    System:    Host: ThinkPad-T440p Kernel: 4.4.0-72-generic x86_64 (64 bit) Desktop: Cinnamon 2.8.6
               Distro: Ubuntu 16.04 xenial
    

    Tomorrow I will add the corresponding bar chart, for visual clarity.

    @mmalohlava @michalkurka @jakubhava @arnocandel Can you review, add right labels and accept, please?

    enhancement please review WIP algos 
    opened by mathemage 43
  • Create test frame builder

    Create test frame builder

    Test frame builder which can be used in tests to create small frames.

    Example usage:

             final Frame testFrame = new TestFrameBuilder()
                     .withName("frameName")
                     .withColNames("ColA", "ColB")
                     .withVecTypes(Vec.T_NUM, Vec.T_STR)
                     .withDataForCol(0, ard(Double.NaN, 1, 2, 3, 4, 5.6, 7))
                     .withDataForCol(1, ar("A", "B", "C", "E", "F", "I", "J"))
                     .withChunkLayout(2, 2, 2, 1)
                     .build();
    

    If frame name is not set, it is created automatically. if col names are not set, they are created automatically. if chunk layout is not set, then only 1 chunk will be created

    I just separated the code from the other branch where I was using it and improved it a bit. Didn't test it properly though yet.


    This change is Reviewable

    opened by jakubhava 25
  • PUBDEV-2801: Starting H2O server from R ignores IP and port parameters

    PUBDEV-2801: Starting H2O server from R ignores IP and port parameters

    • Currently the H2O R package does not pass the ip and port parameters from h2o.init() to .h2o.startJar()
    • This PR passes the given IP and port from h2o.init() to .h2o.startJar()

    This change is Reviewable

    opened by navdeep-G 25
  • PUBDEV-3793: Auto-generate r bindings

    PUBDEV-3793: Auto-generate r bindings

    Following image summarizes the changes in new R bindings in comparison to existing ones: screen shot 2016-11-08 at 9 59 25 am

    These changes shouldn't affect existing users as only new columns/values are added and nothing has been removed. Other than these changes, we now set default value to NULL for objects and 0 for numbers if it is not already set with the exception of x, y and training_frame columns. Also, max_confusion_matrix_size, ignore_columns and response_columns are skipped and not added to R API.

    Tested branch using independent jenkins job: http://mr-0xa1:8080/view/Branch_Test/job/branch_build_all/28/


    This change is Reviewable

    API R approved 
    opened by surekhajadhwani 24
  • PUBDEV-6852 - Kubernetes support

    PUBDEV-6852 - Kubernetes support

    https://0xdata.atlassian.net/browse/PUBDEV-6852

    It is recommended to read the README.md introduced in this PR.

    Screenshot from 2020-02-29 14-12-26

    Works even locally on minikube:

    Screenshot from 2020-02-29 21-12-22

    Example output of a Pod

    02-29 13:09:06.927 10.129.20.65:54321    10     main      INFO: ----- H2O started  -----
    02-29 13:09:06.945 10.129.20.65:54321    10     main      INFO: Build git branch: pavel/pubdev-6852
    02-29 13:09:06.945 10.129.20.65:54321    10     main      INFO: Build git hash: e6623b22c55e5f43359348efeaac94bfe2f4433d
    02-29 13:09:06.945 10.129.20.65:54321    10     main      INFO: Build git describe: jenkins-3.28.0.4-11-ge6623b22c5-dirty
    02-29 13:09:06.946 10.129.20.65:54321    10     main      INFO: Build project version: 3.28.0.99999
    02-29 13:09:06.946 10.129.20.65:54321    10     main      INFO: Build age: -55 minutes
    02-29 13:09:06.946 10.129.20.65:54321    10     main      INFO: Built by: 'pavel'
    02-29 13:09:06.946 10.129.20.65:54321    10     main      INFO: Built on: '2020-02-29 14:04:48'
    02-29 13:09:06.947 10.129.20.65:54321    10     main      INFO: Found H2O Core extensions: [XGBoost, KrbStandalone]
    02-29 13:09:06.948 10.129.20.65:54321    10     main      INFO: Processed H2O arguments: []
    02-29 13:09:06.948 10.129.20.65:54321    10     main      INFO: Java availableProcessors: 1
    02-29 13:09:06.948 10.129.20.65:54321    10     main      INFO: Java heap totalMemory: 12.9 MB
    02-29 13:09:06.949 10.129.20.65:54321    10     main      INFO: Java heap maxMemory: 123.8 MB
    02-29 13:09:06.949 10.129.20.65:54321    10     main      INFO: Java version: Java 11.0.6 (from Ubuntu)
    02-29 13:09:06.949 10.129.20.65:54321    10     main      INFO: JVM launch parameters: []
    02-29 13:09:06.949 10.129.20.65:54321    10     main      INFO: JVM process id: 10@example-0
    02-29 13:09:06.950 10.129.20.65:54321    10     main      INFO: OS version: Linux 4.18.0-147.5.1.el8_1.x86_64 (amd64)
    02-29 13:09:06.950 10.129.20.65:54321    10     main      INFO: Machine physical memory: 31.41 GB
    02-29 13:09:06.950 10.129.20.65:54321    10     main      INFO: Machine locale: en_US
    02-29 13:09:06.951 10.129.20.65:54321    10     main      INFO: X-h2o-cluster-id: 1582981745644
    02-29 13:09:06.951 10.129.20.65:54321    10     main      INFO: User name: '1002280000'
    02-29 13:09:06.951 10.129.20.65:54321    10     main      INFO: IPv6 stack selected: false
    02-29 13:09:06.951 10.129.20.65:54321    10     main      INFO: Possible IP Address: eth0 (eth0), fe80:0:0:0:80f0:dcff:fef8:1dcd%eth0
    02-29 13:09:06.951 10.129.20.65:54321    10     main      INFO: Possible IP Address: eth0 (eth0), 10.129.20.65
    02-29 13:09:06.952 10.129.20.65:54321    10     main      INFO: Possible IP Address: lo (lo), 0:0:0:0:0:0:0:1%lo
    02-29 13:09:06.954 10.129.20.65:54321    10     main      INFO: Possible IP Address: lo (lo), 127.0.0.1
    02-29 13:09:06.954 10.129.20.65:54321    10     main      INFO: H2O node running in unencrypted mode.
    02-29 13:09:06.956 10.129.20.65:54321    10     main      INFO: Internal communication uses port: 54322
    02-29 13:09:06.956 10.129.20.65:54321    10     main      INFO: Listening for HTTP and REST traffic on http://10.129.20.65:54321/
    02-29 13:09:06.960 10.129.20.65:54321    10     main      INFO: Initializing H2O Kubernetes cluster
    02-29 13:09:07.032 10.129.20.65:54321    10     main      INFO: Timeout for node discovery is set to 120 seconds.
    02-29 13:09:07.033 10.129.20.65:54321    10     main      INFO: Desired cluster size is set to 3 nodes.
    02-29 13:09:07.120 10.129.20.65:54321    10     main      INFO: New H2O pod with DNS record 'example-0.h2o-service.h2o-statefulset.svc.cluster.local./10.129.20.65' discovered.
    02-29 13:09:21.199 10.129.20.65:54321    10     main      INFO: New H2O pod with DNS record 'example-1.h2o-service.h2o-statefulset.svc.cluster.local./10.131.10.165' discovered.
    02-29 13:09:52.344 10.129.20.65:54321    10     main      INFO: New H2O pod with DNS record 'example-2.h2o-service.h2o-statefulset.svc.cluster.local./10.130.8.110' discovered.
    02-29 13:09:53.345 10.129.20.65:54321    10     main      INFO: Using the following pods to form H2O cluster: [10.131.10.165,10.130.8.110,10.129.20.65]
    02-29 13:09:53.346 10.129.20.65:54321    10     main      INFO: Dynamically loaded 'H2OKubernetesEmbeddedConfigProvider' as AbstractEmbeddedH2OConfigProvider.
    02-29 13:09:53.352 10.129.20.65:54321    10     main      INFO: H2O cloud name: '1002280000' on /10.129.20.65:54321, discovery address /234.145.114.231:60049
    02-29 13:09:53.353 10.129.20.65:54321    10     main      INFO: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
    02-29 13:09:53.353 10.129.20.65:54321    10     main      INFO:   1. Open a terminal and run 'ssh -L 55555:localhost:54321 [email protected]'
    02-29 13:09:53.353 10.129.20.65:54321    10     main      INFO:   2. Point your browser to http://localhost:55555
    02-29 13:09:54.128 10.129.20.65:54321    10     main      INFO: Log dir: '/tmp/h2o-1002280000/h2ologs'
    02-29 13:09:54.128 10.129.20.65:54321    10     main      INFO: Cur dir: '/'
    02-29 13:09:54.136 10.129.20.65:54321    10     main      INFO: Subsystem for distributed import from HTTP/HTTPS successfully initialized
    02-29 13:09:54.136 10.129.20.65:54321    10     main      INFO: HDFS subsystem successfully initialized
    02-29 13:09:54.139 10.129.20.65:54321    10     main      INFO: S3 subsystem successfully initialized
    02-29 13:09:54.209 10.129.20.65:54321    10     main      INFO: GCS subsystem successfully initialized
    02-29 13:09:54.211 10.129.20.65:54321    10     main      INFO: Flow dir: '//h2oflows'
    02-29 13:09:54.237 10.129.20.65:54321    10     main      INFO: Cloud of size 1 formed [example-0.h2o-service.h2o-statefulset.svc.cluster.local/10.129.20.65:54321]
    02-29 13:09:54.237 10.129.20.65:54321    10     main      INFO: Created cluster of size 1, leader node IP is 'example-0.h2o-service.h2o-statefulset.svc.cluster.local/10.129.20.65'
    02-29 13:09:54.251 10.129.20.65:54321    10     main      INFO: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
    02-29 13:09:54.255 10.129.20.65:54321    10     main      INFO: XGBoost extension initialized
    02-29 13:09:54.256 10.129.20.65:54321    10     main      INFO: KrbStandalone extension initialized
    02-29 13:09:54.312 10.129.20.65:54321    10     main      INFO: Registered 2 core extensions in: 475ms
    02-29 13:09:54.313 10.129.20.65:54321    10     main      INFO: Registered H2O core extensions: [XGBoost, KrbStandalone]
    02-29 13:09:54.932 10.129.20.65:54321    10     main      INFO: Found XGBoost backend with library: xgboost4j_minimal
    02-29 13:09:54.932 10.129.20.65:54321    10     main      WARN: Your system supports only minimal version of XGBoost (no GPUs, no multithreading)!
    02-29 13:09:55.233 10.129.20.65:54321    10     main      INFO: Registered: 187 REST APIs in: 920ms
    02-29 13:09:55.233 10.129.20.65:54321    10     main      INFO: Registered REST API extensions: [Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4]
    02-29 13:09:55.707 10.129.20.65:54321    10     main      INFO: Registered: 279 schemas in 474ms
    02-29 13:09:55.707 10.129.20.65:54321    10     main      INFO: H2O started in 50056ms
    02-29 13:09:55.707 10.129.20.65:54321    10     main      INFO: 
    02-29 13:09:55.708 10.129.20.65:54321    10     main      INFO: Open H2O Flow in your web browser: http://10.129.20.65:54321
    02-29 13:09:55.708 10.129.20.65:54321    10     main      INFO: 
    02-29 13:09:57.764 10.129.20.65:54321    10     FJ-126-1  INFO: Cloud of size 3 formed [example-0.h2o-service.h2o-statefulset.svc.cluster.local/10.129.20.65:54321, example-2.h2o-service.h2o-statefulset.svc.cluster.local/10.130.8.110:54321, example-1.h2o-service.h2o-statefulset.svc.cluster.local/10.131.10.165:54321]
    02-29 13:09:57.765 10.129.20.65:54321    10     FJ-126-1  INFO: Created cluster of size 3, leader node IP is 'example-0.h2o-service.h2o-statefulset.svc.cluster.local/10.129.20.65'
    
    
    please review 
    opened by Pscheidl 23
  • Fix in chunk fp precision.

    Fix in chunk fp precision.

    Scaled chunks and (mantissa, exponent) -> double conversion in NewChunk treat negative exponents as multiplication by the inverse instead of division. However, division is more numerically stable in this case (i.e. a/b is more precise than a * (1/b) for some values ) leading to possible small discrepancy between h2o parse and Java standard parse of double.

    For example: System.out.println(Double.parseDouble(“-0.09375”)); System.out.println(-9375*Math.pow(10,-5)); System.out.println(-9375/Math.pow(10,5));

    Produces: -0.09375 -0.09375000000000001 -0.09375

    opened by tomasnykodym 20
  • [PUBDEV-4572][PUBDEV-4548] Move initialization of XGBoost into H2O co…

    [PUBDEV-4572][PUBDEV-4548] Move initialization of XGBoost into H2O co…

    …re extension

    Use RestApiContext

    • Use RestApiContext inside registerEndpoint method -> Less static calls, easier to change backend in the future
    • Create AlgoAbstractRegister with method registering algos. It was previously in the AbstractRegister where it semantically does not belong
    • Remove RegisterSourceRoots from extensions. It's not rest api extension. Instead call it explicitly in the same location where it would be called as the extension. This also allows us to remove the relativeResourcePath argument in register method
    • Remove H2O.register method as the rest api should be registered via the restApiContext or RequestServer

    Fix formating and change log level to warn

    Use singleton instead of static methods

    Fix test

    remove deprecated method in tests

    opened by jakubhava 20
  • [PUBDEV-5959] PySparking client is hanging after re-connecting to the H2O external backend

    [PUBDEV-5959] PySparking client is hanging after re-connecting to the H2O external backend

    The Issue

    The original issue was that when a client disconnected and connected to the cluster before the cluster discovered the original client is gone, it blocked the connection of the new client because of previously opened connections

    This was caused by using the old client entry in the Client hash map as the cluster was still trying to contact the old client on the old sockets. We need to make sure the new sockets are created.

    The Fix

    In order to fix this, we needed to add a new entry the the AutoBuffer head. This entry contains the information whether the node is client and the unique id. This entry is of size 2 bytes to minimize to communication overhead. The 1. bit is info whether we are client or not and the last 15 bits are the unique id.

    This is needed because we need to be able to discover whether the node is client and if it is, different one, before we call INTERN.put of this newly created node.

    Implications

    In the future, this additional info can help us separate client from the H2O core as we can now tell early whether the node is client or not

    Also as part of this PR, the Client hash map was removed as it was duplication information already available in INTERN.

    The method reportClient was also removed. The information about the connected client is now printed as part of the intern method as that is always the first place where the new node is created. So the code is not scattered around.

    4RELEASE 
    opened by jakubhava 18
  • PUBDEV-5669: Hive import performance improvement

    PUBDEV-5669: Hive import performance improvement

    https://0xdata.atlassian.net/browse/PUBDEV-5669

    • Idea is to first use one frame for optimal data retrieval (minimum amount of chunks fitting optimal amount of connections for max parallelism).
    • Expected performance improvements, for example using 3 Hive nodes, 3 H2O nodes (m4.xlarge on AWS):
      • data extraction, up to 20% perf improvements when loading 10M rows on Hive from ORC tables: 60 -> 45s on 3 nodes.
      • training perf improvement from 40 to 60%: DRF 65 -> 35s with 1M rows, GBM 80 -> 30s with 1M rows, DRF/GBM 330 -> 175s with 10M rows.
    • Also added possibility to use a subselect instead of a temp table as table creation can become very expensive on large tables: the drawback is that we lose some isolation when querying the table, but it should not be a major concern on hive tables (the amount of rows being retrieved is computed first): for now could be activated just for hive.

    More recent and detailed benchmarks results are available at https://s3.eu-central-1.amazonaws.com/sebp/sql_benchmark/results/sql_import_results.md

    opened by sebhrusen 18
  • PUBDEV-5328 Adding persistance on Google Cloud Storage

    PUBDEV-5328 Adding persistance on Google Cloud Storage

    This is a continuation of https://github.com/h2oai/h2o-3/pull/2053 - the PR was made from a personal branch - we have no access to make changes there.

    All comments caught in the original PR are addressed in commits already present.

    opened by Pscheidl 17
  •  PUBDEV-8637 - row to tree assignments implemented model.rowToTreeAssignment API

    PUBDEV-8637 - row to tree assignments implemented model.rowToTreeAssignment API

    https://h2oai.atlassian.net/browse/PUBDEV-8637

    Hello, here is the implementation for rowToTreeAssignment output from the GBM model. The usage is:

    fr = h2o.import_file(path=pyunit_utils.locate("smalldata/prostate/prostate.csv"))
    target = "CAPSULE"
    fr[target] = fr[target].asfactor()
    
    gbm = H2OGradientBoostingEstimator(ntrees=3,
                                       sample_rate=0.6,
                                       seed=95)
    gbm.train(y=target, training_frame=fr)
    
    row_to_tree_assignment = gbm.row_to_tree_assignment(fr)
    
      row_id    tree_1    tree_2    tree_3
           0         0         1         1
           1         1         1         1
           2         1         0         0
           3         1         1         0
           4         0         1         1
           5         1         1         1
           6         1         0         0
           7         0         1         0
           8         0         1         1
           9         1         0         0
    

    can you please validate the API before I finish it?

    opened by valenad1 0
  • PUBDEV-8047: AutoML Save/Load implementation

    PUBDEV-8047: AutoML Save/Load implementation

    https://h2oai.atlassian.net/browse/PUBDEV-8047

    Known issues:

    • missing automl_training_frame
      • ~get_leaderboard("ALL") on loaded object produces NAs for predict_time_per_row_ms on loaded models~
        • fixed by computing the extended leaderboard before saving, the predict_time_per_row_ms is successfully saved and loaded.
      • ~make_leaderboard("ALL") on loaded object without provided leaderboard_frame fails on predict_time_per_row_ms~
        • this still happens but the behavior is the same as for normal automl object so Save/Load doesn't change anything in this case.
    please review python R 
    opened by tomasfryda 0
  • PUBDEV-8094: Restructuring algorithm parameters and grid search hyperparameters

    PUBDEV-8094: Restructuring algorithm parameters and grid search hyperparameters

    This is the draft PR for the new restructuring of the algorithm parameters.

    What I need in the initial review:

    • [ ] are the algorithm-specific parameters correct for each algorithm
    • [ ] does it make sense to have the sub-groups for: - [ ] GLM Family parameters (i.e., shared parameters for GLM, GAM, MS, & ANOVA) - [ ] Tree algorithm parameters (i.e., shared parameters for GBM, DRF, XGBoost, Isolation Forest, Ex. Isolation Forest)
    • [ ] are the hyperparameter lists on the grid search page correct - [ ] does it make sense to have the shared tables for: - [ ] common hyperparameters for supervised algos - [ ] common hyperparameters for unsupervised algos - [ ] shared tree algorithm hyperparameters - [ ] shared GLM family hyperparameters

    Let me know if you need help building the docs to look through them easier or if anything doesn't make sense!

    docs do not merge 
    opened by hannah-tillman 4
Owner
H2O.ai
Fast Scalable Machine Learning For Smarter Applications
H2O.ai
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 20.6k Feb 13, 2021
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 8, 2023
Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have undergone breast cancer surgery.

Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have undergone breast cancer surgery.

Nafis Ahmed 1 Dec 28, 2021
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 4, 2023
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 5.7k Feb 12, 2021
Pca-on-genotypes - Mini bioinformatics project - PCA on genotypes

Mini bioinformatics project: PCA on genotypes This repo contains the code from t

Maria Nattestad 8 Dec 4, 2022
Video lie detector using xgboost - A video lie detector using OpenFace and xgboost

video_lie_detector_using_xgboost a video lie detector using OpenFace and xgboost

null 2 Jan 11, 2022
Company clustering with K-means/GMM and visualization with PCA, t-SNE, using SSAN relation extraction

RE results graph visualization and company clustering Installation pip install -r requirements.txt python -m nltk.downloader stopwords python3.7 main.

Jieun Han 1 Oct 6, 2022
Neural networks applied in recognizing guitar chords using python, AutoML.NET with C# and .NET Core

Chord Recognition Demo application The demo application is written in C# with .NETCore. As of July 9, 2020, the only version available is for windows

Andres Mauricio Rondon Patiño 24 Oct 22, 2022
Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.

Vector AI is a framework designed to make the process of building production grade vector based applications as quickly and easily as possible. Create

Vector AI 267 Dec 23, 2022
Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

Linear algebra in python Number of operations and problems in Linear Algebra and

Alireza 5 Oct 9, 2022
U-Net for GBM

My Final Year Project(FYP) In National University of Singapore(NUS) You need Pytorch(stable 1.9.1) Both cuda version and cpu version are OK File Str

PinkR1ver 1 Oct 27, 2021
The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient (paper) @misc{zhang2021compress,

null 46 Dec 7, 2022
This code is a near-infrared spectrum modeling method based on PCA and pls

Nirs-Pls-Corn This code is a near-infrared spectrum modeling method based on PCA and pls 近红外光谱分析技术属于交叉领域,需要化学、计算机科学、生物科学等多领域的合作。为此,在(北邮邮电大学杨辉华老师团队)指导下

Fu Pengyou 6 Dec 17, 2022
ThunderSVM: A Fast SVM Library on GPUs and CPUs

What's new We have recently released ThunderGBM, a fast GBDT and Random Forest library on GPUs. add scikit-learn interface, see here Overview The miss

Xtra Computing Group 1.4k Dec 22, 2022
Face recognition system using MTCNN, FACENET, SVM and FAST API to track participants of Big Brother Brasil in real time.

BBB Face Recognizer Face recognition system using MTCNN, FACENET, SVM and FAST API to track participants of Big Brother Brasil in real time. Instalati

Rafael Azevedo 232 Dec 24, 2022
A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling in Python.

Xcessiv Xcessiv is a tool to help you create the biggest, craziest, and most excessive stacked ensembles you can think of. Stacked ensembles are simpl

Reiichiro Nakano 1.3k Nov 17, 2022
GLM (General Language Model)

GLM GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language underst

THUDM 421 Jan 4, 2023