Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Overview

eXtreme Gradient Boosting

Build Status Build Status Build Status XGBoost-CI Documentation Status GitHub license CRAN Status Badge PyPI version Conda version Optuna Twitter

Community | Documentation | Resources | Contributors | Release Notes

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, MPI, Dask) and can solve problems beyond billions of examples.

License

© Contributors, 2019. Licensed under an Apache-2 license.

Contribute to XGBoost

XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone. Checkout the Community Page.

Reference

  • Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016
  • XGBoost originates from research project at University of Washington.

Sponsors

Become a sponsor and get a logo here. See details at Sponsoring the XGBoost Project. The funds are used to defray the cost of continuous integration and testing infrastructure (https://xgboost-ci.net).

Open Source Collective sponsors

Backers on Open Collective Sponsors on Open Collective

Sponsors

[Become a sponsor]

NVIDIA

Backers

[Become a backer]

Other sponsors

The sponsors in this list are donating cloud hours in lieu of cash donation.

Amazon Web Services

Comments
  • Predict error in R as of 1.1.1

    Predict error in R as of 1.1.1

    R version: 3.6.1 (Action of the Toes) xgboost version: 1.1.1.1

    This error can be produced when attempting to call predict on an xgboost model developed pre-1.0

    Error: Error in predict.xgb.Booster(model, data) : [11:24:23] amalgamation/../src/learner.cc:506: Check failed: mparam_.num_feature != 0 (0 vs. 0) : 0 feature is supplied. Are you using raw Booster interface?

    opened by jrausch12 103
  • [jvm-packages] Scala implementation of the Rabit tracker.

    [jvm-packages] Scala implementation of the Rabit tracker.

    Motivation

    The Java implementation of RabitTracker in xgboost4j depends on the Python script tracker.py in dmlc-core to handle all socket connections / loggings.

    The reliance on Python code has a few weaknesses:

    • It makes xgboost4j-spark and xgboost4j-flink, which use RabitTracker, more susceptible to random failures on worker nodes due to Python versions.
    • It increases difficulty for debugging tracker-related issues.
    • Since the Python code handles all socket connection logic, it is difficult to handle timeout due to connection issues, and thus the tracker may hang indefinitely if the workers fail to connect due to networking issues.

    To address the above issues, this PR was created to introduce a pure Scala implementation of the RabitTracker, that is interchangeable with the Java implementation at interface level, but with the Python dependency completely removed.

    The implementation was tested in a Spark cluster running on YARN with up to 16 distributed workers. More thorough tests (local mode, more nodes etc.) of this PR is still WIP.

    Implementation details

    The Scala version of RabitTracker replicates the functionality of the RabitTracker class in tracker.py, that is, to handle incoming connections from Rabit clients of the worker nodes, compute the link map and rank for each given worker, and print tracker logging information.

    The tracker handles connections in asynchronous and non-blocking fashion using Akka, and resolves the inter-dependency between worker connections properly.

    Timeouts

    The Scala RabitTracker implements timeout logic at multiple entry points.

    • RabitTracker.start()may time out if the tracker fails to bind to a socket address within certain time limit.
    • RabitTracker.waitFor() may time out if at least one worker fails to connect to the tracker within certain time limit. This prevents the tracker from hanging forever.
    • RabitTracker.waitFor() may time out after a given maximum execution time limit.

    Checklist

    The following tasks are to be completed:

    • [x] Add options to switch between Python-based tracker and Scala-based tracker in xgboost4j-spark and xgboost4j-flink.
    • [x] Refactoring of RabitTracker.scala to separate the components into different files.
    • [x] Unit tests for individual actors (using akka-testkit).
    • [x] Test with rabit clients (Allreduce, checkpoint, simulated connection issus.)
    • [x] Test in production.
    opened by xydrolase 83
  • Model produced in 1.0.0 cannot be loaded into 0.90

    Model produced in 1.0.0 cannot be loaded into 0.90

    Following the instructions here: https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html

    > install.packages("drat", repos="https://cran.rstudio.com")
    trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/drat_0.1.5.zip'
    Content type 'application/zip' length 87572 bytes (85 KB)
    downloaded 85 KB
    
    package ‘drat’ successfully unpacked and MD5 sums checked
    
    The downloaded binary packages are in
            C:\Users\lee\AppData\Local\Temp\RtmpiE0N3D\downloaded_packages
    > drat:::addRepo("dmlc")
    > install.packages("xgboost", repos="http://dmlc.ml/drat/", type = "source")
    Warning: unable to access index for repository http://dmlc.ml/drat/src/contrib:
      Line starting '<!DOCTYPE html> ...' is malformed!
    Warning message:
    package ‘xgboost’ is not available (for R version 3.6.0) 
    

    It also fails on R 3.6.2 with the same error.

    Note: I would much prefer to use the CRAN version. But models I train on linux and Mac and save using the saveRDS function don't predict on another system (windows), they just produce numeric(0). If anyone has any guidelines on how to save an XGBoost model for use on other computers, please let me know. I've tried xgb.save.raw and xgb.load - both produce numeric(0) as well. But on the computer I trained the model on (a month ago), readRDS in R works just fine. Absolutely baffling to me.

    opened by leedrake5 74
  • pip install failure

    pip install failure

    root@0c6c17725a7b:/# pip install xgboost Downloading/unpacking xgboost Could not find a version that satisfies the requirement xgboost (from versions: 0.4a12, 0.4a13) Cleaning up... No distributions matching the version for xgboost Storing debug log for failure in /root/.pip/pip.log

    You can repeat in docker with: docker run -it --rm ubuntu:trusty

    apt-get update
    apt-get install python-pip
    pip install xgboost
    

    see this also:

    http://stackoverflow.com/questions/32258463/install-xgboost-under-python-failing

    opened by cliveseldon 72
  • OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.

    OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.

    For bugs or installation issues, please provide the following information. The more information you provide, the more easily we will be able to offer help and advice.

    Environment info

    Operating System: Mac OSX Sierra 10.12.1

    Compiler:

    Package used (python):

    xgboost version used: xgboost 0.6a2

    If you are using python package, please provide

    1. The python version and distribution: Pythong 2.7.12
    2. The command to install xgboost if you are not installing from source pip install xgboost

    Steps to reproduce

    1. from xgboost import XGBClassifier import numpy as np import matplotlib.pyplot as plt x = np.array([[1,2],[3,4]]) y = np.array([0,1]) clf = XGBClassifier(base_score = 0.005) clf.fit(x,y) plt.hist(clf.feature_importances_)

    What have you tried?

    See the error message: "OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized. OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/."

    I tried: import os os.environ['KMP_DUPLICATE_LIB_OK']='True'

    It can do the job for me. But it is kind of ugly.


    I know it might be not the problem of xgboost, but I'm pretty sure this problem happened after I upgrade xgboost using 'pip install xgboost'. I post the issue here to see if someone had the same problem as me. I have very little knowledge about OpenMP. Please help!
    Thanks in advance!

    opened by symPhysics 71
  • RMM integration plugin

    RMM integration plugin

    Fixes #5861.

    Depends on #5871. Will rebase after #5871 is merged.

    Depends on #5966. Will rebase after #5966 is merged.

    ~Currently, the C++ tests are crashing with an out-of-memory error.~ The OOM has been fixed.

    status: need review 
    opened by hcho3 66
  • [DISCUSSION] Adopting JSON-like format as next-generation model format

    [DISCUSSION] Adopting JSON-like format as next-generation model format

    As discussed in #3878 and #3886 , we might want a more extendable format for saving XGBoost model.

    For now my plan is utilizing the JSONReader and JSONWriter implemented in dmlc-core to add experimental support for saving/loading model into Json file. Due to the fact that related area of code is quite messy and is dangerous to change, I want to share my plan and possibly an early PR as soon as possible so that someone could point out my mistakes earlier(there will be mistakes), and we don't make duplicated work. :)

    @hcho3

    type: roadmap 
    opened by trivialfis 57
  • XGBoost 0.90 Roadmap

    XGBoost 0.90 Roadmap

    This thread is to keep track of all the good things that will be included in 0.90 release. It will be updated as the planned release date (~May 1, 2019~ as soon as Spark 2.4.3 is out) approaches.

    • [x] XGBoost will no longer support Python 2.7, since it is reaching its end-of-life soon. This decision was reached in #4379.
    • [x] XGBoost4J-Spark will now require Spark 2.4+, as Spark 2.3 is reaching its end-of-life in a few months (#4377) (https://github.com/dmlc/xgboost/issues/4409)
    • [x] XGBoost4J now supports up to JDK 12 (#4351)
    • [x] Additional optimizations for gpu_hist (#4248, #4283)
    • [x] XGBoost as CMake target; C API example (#4323, #4333)
    • [x] GPU multi-class metrics (#4368)
    • [x] Scikit-learn-like random forest API (#4148)
    • [x] Bugfix: Fix GPU histogram allocation (#4347)
    • [x] [BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction https://github.com/dmlc/xgboost/pull/4388
    • [x] Roadmap: additional optimizations for hist on multi-core Intel CPUs (#4310)
    • [x] Roadmap: hardened Rabit; see RFC #4250
    • [x] Robust handling of missing values in XGBoost4J-Spark https://github.com/dmlc/xgboost/pull/4349
    • [x] External memory with GPU predictor (#4284, #4438)
    • [x] Use feature interaction constraints to narrow split search space (#4341)
    • [x] Re-vamp Continuous Integration pipeline; see RFC #4234
    • [x] Bugfix: AUC, AUCPR metrics should handle weights correctly for learning-to-rank task (#4216)
    • [x] Ignore comments in LIBSVM files (#4430)
    • [x] Bugfix: Fix AUCPR metric for ranking (#4436)
    type: roadmap 
    opened by hcho3 56
  • 1.5.0 Release Candidate

    1.5.0 Release Candidate

    Roadmap https://github.com/dmlc/xgboost/issues/6846 . Draft of release note: https://github.com/dmlc/xgboost/pull/7271 .

    We are about to release version 1.5.0 of XGBoost. In the next two weeks, we invite everyone to try out the release candidate (RC).

    Feedback period: until the end of October 13, 2021. No new feature will be added to the release; only critical bug fixes will be added.

    @dmlc/xgboost-committer

    Available packages:

    • Python packages:
    pip install xgboost==1.5.0rc1
    
    • R packages: Linux x86_64: https://github.com/dmlc/xgboost/releases/download/v1.5.0rc1/xgboost_r_gpu_linux.tar.gz Windows x86_64: https://github.com/dmlc/xgboost/releases/download/v1.5.0rc1/xgboost_r_gpu_win64.tar.gz
    R CMD INSTALL ./xgboost_r_gpu_linux.tar.gz
    
    • JVM packages
    Show instructions (Maven/SBT)

    Maven

    <dependencies>
      ...
      <dependency>
          <groupId>ml.dmlc</groupId>
          <artifactId>xgboost4j_2.12</artifactId>
          <version>1.5.0-RC1</version>
      </dependency>
      <dependency>
          <groupId>ml.dmlc</groupId>
          <artifactId>xgboost4j-spark_2.12</artifactId>
          <version>1.5.0-RC1</version>
      </dependency>
    </dependencies>
    
    <repositories>
      <repository>
        <id>XGBoost4J Release Repo</id>
        <name>XGBoost4J Release Repo</name>
        <url>https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/release/</url>
      </repository>
    </repositories>
    

    SBT

    libraryDependencies ++= Seq(
      "ml.dmlc" %% "xgboost4j" % "1.5.0-RC1",
      "ml.dmlc" %% "xgboost4j-spark" % "1.5.0-RC1"
    )
    resolvers += ("XGBoost4J Release Repo"
                  at "https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/release/")
    

    Starting from 1.2.0, XGBoost4J-Spark supports training with NVIDIA GPUs. To enable this capability, download artifacts suffixed with -gpu, as follows:

    Show instructions (Maven/SBT)

    Maven

    <dependencies>
      ...
      <dependency>
          <groupId>ml.dmlc</groupId>
          <artifactId>xgboost4j-gpu_2.12</artifactId>
          <version>1.5.0-RC1</version>
      </dependency>
      <dependency>
          <groupId>ml.dmlc</groupId>
          <artifactId>xgboost4j-spark-gpu_2.12</artifactId>
          <version>1.5.0-RC1</version>
      </dependency>
    </dependencies>
    
    <repositories>
      <repository>
        <id>XGBoost4J Release Repo</id>
        <name>XGBoost4J Release Repo</name>
        <url>https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/release/</url>
      </repository>
    </repositories>
    

    SBT

    libraryDependencies ++= Seq(
      "ml.dmlc" %% "xgboost4j-gpu" % "1.5.0-RC1",
      "ml.dmlc" %% "xgboost4j-spark-gpu" % "1.5.0-RC1"
    )
    resolvers += ("XGBoost4J Release Repo"
                  at "https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/release/")
    

    TO-DOs

    • [x] Release pip rc package.
    • [x] Test on R-hub.
    • [x] Release R rc package.
    • [x] Release jvm rc packages.

    PRs to be backported

    • [x] Fix gamma negative log likelihood (https://github.com/dmlc/xgboost/pull/7275)
    • [x] Fix verbose_eval in Python cv function. (https://github.com/dmlc/xgboost/pull/7291)
    • [x] Fix weighted samples in multi-class AUC (https://github.com/dmlc/xgboost/pull/7300)
    • [x] Fix prediction with categorical dataframe using sklearn interface. (https://github.com/dmlc/xgboost/pull/7306)
    type: roadmap 
    opened by trivialfis 54
  • [DISCUSSION] Integration with PySpark

    [DISCUSSION] Integration with PySpark

    I just noticed that there are some requests for integration with PySpark http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html

    I also received some emails from the users discussing the same topic

    I would like to initialize a discussion here on whether/when we shall start this work

    @tqchen @terrytangyuan

    type: python 
    opened by CodingCat 53
  • [Roadmap] XGBoost 1.0.0 Roadmap

    [Roadmap] XGBoost 1.0.0 Roadmap

    @dmlc/xgboost-committer please add your items here by editing this post. Let's ensure that

    • each item has to be associated with a ticket

    • major design/refactoring are associated with a RFC before committing the code

    • blocking issue must be marked as blocking

    • breaking change must be marked as breaking

    for other contributors who have no permission to edit the post, please comment here about what you think should be in 1.0.0

    I have created three new types labels, 1.0.0, Blocking, Breaking

    • [x] Improve installation experience on Mac OSX (#4477)
    • [x] Remove old GPU objectives.
    • [x] Remove gpu_exact updater (deprecated) #4527
    • [x] Remove multi threaded multi gpu support (deprecated) #4531
    • [x] External memory for gpu and associated dmatrix refactoring #4357 #4354
    • [ ] Spark Checkpoint Performance Improvement (https://github.com/dmlc/xgboost/issues/3946)
    • [x] [BLOCKING] the sync mechanism in hist method in master branch is broken due to the inconsistent shape of tree in different workers (https://github.com/dmlc/xgboost/pull/4716, https://github.com/dmlc/xgboost/issues/4679)
    • [x] Per-node sync slows down distributed training with 'hist' (#4679)
    • [x] Regression tests including binary IO compatibility, output stability, performance regressions.
    type: roadmap 
    opened by CodingCat 52
  • [CI] fix git errors related to directory ownership

    [CI] fix git errors related to directory ownership

    The Test R package on Debian CI job is currently failing on master (build link) and on #8627 (build link) with the following error at the step where the repo is git clone'd.

    ValueError: ('Failed to check git repository status.', CompletedProcess(args=['git', 'clean', '-xdf', '--dry-run'], returncode=128, stdout=b'', stderr=b"fatal: detected dubious ownership in repository at '/__w/xgboost/xgboost'\nTo add an exception for this directory, call:\n\n\tgit config --global --add safe.directory /__w/xgboost/xgboost\n"))

    I suspect this means that that job started getting a newer version of git recently, which contains a patch for the CVE described at https://github.blog/2022-04-12-git-security-vulnerability-announced/.

    This PR proposes a patch that I think will address it. We've been using something similar over in LightGBM for the last 8 months (https://github.com/microsoft/LightGBM/pull/5152) without issue.

    opened by jameslamb 0
  • remove unused variables in JSON-parsing code

    remove unused variables in JSON-parsing code

    Proposes removing two unused variables in src/common/json.cc. Hopefully this will cut a bit of processing time and a some unnecessary allocations out of JSON-reading operations.

    Notes for Reviewers

    I found these by running cppcheck over the project's source code.

    cppcheck \
        --force \
        --enable=all \
        --std=c++14 \
        -I include/ \
        -UDEBUG \
        src/ \
    > cppcheck.txt 2>&1
    
    cat cppcheck.txt | grep 'unusedVar'
    

    These are the only two such warnings cppcheck found.

    src/common/json.cc:440:22: style: Unused variable: output [unusedVariable] src/common/json.cc:659:15: style: Unused variable: buffer [unusedVariable]

    opened by jameslamb 0
  • Update custom_metric_obj.rst

    Update custom_metric_obj.rst

    For the codeblock given in line 291, in the softprob_obj method definition, the variable 'classes' is not defined. It seems it should be defined as given in the changes.

    opened by mrbaloglu 1
  • Bump rapids-4-spark_2.12 from 21.08.0 to 22.12.0 in /jvm-packages

    Bump rapids-4-spark_2.12 from 21.08.0 to 22.12.0 in /jvm-packages

    Bumps rapids-4-spark_2.12 from 21.08.0 to 22.12.0.

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • [CI] cudaErrorInsufficientDriver returned from cudaGetDevice

    [CI] cudaErrorInsufficientDriver returned from cudaGetDevice

    https://github.com/dmlc/xgboost/blob/c6a8754c62496e43452e6edf49eb0eb89ffcdc70/tests/cpp/helpers.cc#L632

    This line is currently failing in the CI due to a driver issue. It's part of the GTest with RMM Until the CI machine can be updated with the latest driver, we'll disable the GTest with RMM.

    opened by hcho3 0
  • QuantileDMatrix no longer take libsvm file as an input?

    QuantileDMatrix no longer take libsvm file as an input?

    XGBoost 1.7.2

    dtrain=xgb.QuantileDMatrix('/Users/weitian/tmp/data.test') Traceback (most recent call last): File "", line 1, in File "/Users/weitian/opt/miniconda3/envs/turi/lib/python3.9/site-packages/xgboost/core.py", line 620, in inner_f return func(**kwargs) File "/Users/weitian/opt/miniconda3/envs/turi/lib/python3.9/site-packages/xgboost/core.py", line 1386, in init self._init( File "/Users/weitian/opt/miniconda3/envs/turi/lib/python3.9/site-packages/xgboost/core.py", line 1445, in _init it.reraise() File "/Users/weitian/opt/miniconda3/envs/turi/lib/python3.9/site-packages/xgboost/core.py", line 488, in reraise raise exc # pylint: disable=raising-bad-type File "/Users/weitian/opt/miniconda3/envs/turi/lib/python3.9/site-packages/xgboost/core.py", line 469, in _handle_exception return fn() File "/Users/weitian/opt/miniconda3/envs/turi/lib/python3.9/site-packages/xgboost/core.py", line 534, in return self._handle_exception(lambda: self.next(input_data), 0) File "/Users/weitian/opt/miniconda3/envs/turi/lib/python3.9/site-packages/xgboost/data.py", line 1172, in next input_data(**self.kwargs) File "/Users/weitian/opt/miniconda3/envs/turi/lib/python3.9/site-packages/xgboost/core.py", line 620, in inner_f return func(**kwargs) File "/Users/weitian/opt/miniconda3/envs/turi/lib/python3.9/site-packages/xgboost/core.py", line 519, in input_data new, cat_codes, feature_names, feature_types = _proxy_transform( File "/Users/weitian/opt/miniconda3/envs/turi/lib/python3.9/site-packages/xgboost/data.py", line 1206, in _proxy_transform raise TypeError("Value type is not supported for data iterator:" + str(type(data))) TypeError: Value type is not supported for data iterator:<class 'str'>

    opened by weitian 3
Releases(v1.7.2)
  • v1.7.2(Dec 8, 2022)

    v1.7.2 (2022 Dec 8)

    This is a patch release for bug fixes.

    • Work with newer thrust and libcudacxx (#8432)

    • Support null value in CUDA array interface namespace. (#8486)

    • Use getsockname instead of SO_DOMAIN on AIX. (#8437)

    • [pyspark] Make QDM optional based on a cuDF check (#8471)

    • [pyspark] sort qid for SparkRanker. (#8497)

    • [dask] Properly await async method client.wait_for_workers. (#8558)

    • [R] Fix CRAN test notes. (#8428)

    • [doc] Fix outdated document [skip ci]. (#8527)

    • [CI] Fix github action mismatched glibcxx. (#8551)

    Artifacts

    You can verify the downloaded packages by running this on your Unix shell:

    echo "<hash> <artifact>" | shasum -a 256 --check
    
    15be5a96e86c3c539112a2052a5be585ab9831119cd6bc3db7048f7e3d356bac  xgboost_r_gpu_linux_1.7.2.tar.gz
    0dd38b08f04ab15298ec21c4c43b17c667d313eada09b5a4ac0d35f8d9ba15d7  xgboost_r_gpu_win64_1.7.2.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
    xgboost_r_gpu_linux_1.7.2.tar.gz(69.59 MB)
    xgboost_r_gpu_win64_1.7.2.tar.gz(84.99 MB)
  • v1.7.1(Nov 3, 2022)

    v1.7.1 (2022 November 3)

    This is a patch release to incorporate the following hotfix:

    • Add back xgboost.rabit for backwards compatibility (#8411)
    Source code(tar.gz)
    Source code(zip)
  • v1.7.0(Oct 31, 2022)

    Note. The source distribution of Python XGBoost 1.7.0 was defective (#8415). Since PyPI does not allow us to replace existing artifacts, we released 1.7.0.post0 version to upload the new source distribution. Everything in 1.7.0.post0 is identical to 1.7.0 otherwise.

    v1.7.0 (2022 Oct 20)

    We are excited to announce the feature packed XGBoost 1.7 release. The release note will walk through some of the major new features first, then make a summary for other improvements and language-binding-specific changes.

    PySpark

    XGBoost 1.7 features initial support for PySpark integration. The new interface is adapted from the existing PySpark XGBoost interface developed by databricks with additional features like QuantileDMatrix and the rapidsai plugin (GPU pipeline) support. The new Spark XGBoost Python estimators not only benefit from PySpark ml facilities for powerful distributed computing but also enjoy the rest of the Python ecosystem. Users can define a custom objective, callbacks, and metrics in Python and use them with this interface on distributed clusters. The support is labeled as experimental with more features to come in future releases. For a brief introduction please visit the tutorial on XGBoost's document page. (#8355, #8344, #8335, #8284, #8271, #8283, #8250, #8231, #8219, #8245, #8217, #8200, #8173, #8172, #8145, #8117, #8131, #8088, #8082, #8085, #8066, #8068, #8067, #8020, #8385)

    Due to its initial support status, the new interface has some limitations; categorical features and multi-output models are not yet supported.

    Development of categorical data support

    More progress on the experimental support for categorical features. In 1.7, XGBoost can handle missing values in categorical features and features a new parameter max_cat_threshold, which limits the number of categories that can be used in the split evaluation. The parameter is enabled when the partitioning algorithm is used and helps prevent over-fitting. Also, the sklearn interface can now accept the feature_types parameter to use data types other than dataframe for categorical features. (#8280, #7821, #8285, #8080, #7948, #7858, #7853, #8212, #7957, #7937, #7934)

    Experimental support for federated learning and new communication collective

    An exciting addition to XGBoost is the experimental federated learning support. The federated learning is implemented with a gRPC federated server that aggregates allreduce calls, and federated clients that train on local data and use existing tree methods (approx, hist, gpu_hist). Currently, this only supports horizontal federated learning (samples are split across participants, and each participant has all the features and labels). Future plans include vertical federated learning (features split across participants), and stronger privacy guarantees with homomorphic encryption and differential privacy. See Demo with NVFlare integration for example usage with nvflare.

    As part of the work, XGBoost 1.7 has replaced the old rabit module with the new collective module as the network communication interface with added support for runtime backend selection. In previous versions, the backend is defined at compile time and can not be changed once built. In this new release, users can choose between rabit and federated. (#8029, #8351, #8350, #8342, #8340, #8325, #8279, #8181, #8027, #7958, #7831, #7879, #8257, #8316, #8242, #8057, #8203, #8038, #7965, #7930, #7911)

    The feature is available in the public PyPI binary package for testing.

    Quantile DMatrix

    Before 1.7, XGBoost has an internal data structure called DeviceQuantileDMatrix (and its distributed version). We now extend its support to CPU and renamed it to QuantileDMatrix. This data structure is used for optimizing memory usage for the hist and gpu_hist tree methods. The new feature helps reduce CPU memory usage significantly, especially for dense data. The new QuantileDMatrix can be initialized from both CPU and GPU data, and regardless of where the data comes from, the constructed instance can be used by both the CPU algorithm and GPU algorithm including training and prediction (with some overhead of conversion if the device of data and training algorithm doesn't match). Also, a new parameter ref is added to QuantileDMatrix, which can be used to construct validation/test datasets. Lastly, it's set as default in the scikit-learn interface when a supported tree method is specified by users. (#7889, #7923, #8136, #8215, #8284, #8268, #8220, #8346, #8327, #8130, #8116, #8103, #8094, #8086, #7898, #8060, #8019, #8045, #7901, #7912, #7922)

    Mean absolute error

    The mean absolute error is a new member of the collection of objectives in XGBoost. It's noteworthy since MAE has zero hessian value, which is unusual to XGBoost as XGBoost relies on Newton optimization. Without valid Hessian values, the convergence speed can be slow. As part of the support for MAE, we added line searches into the XGBoost training algorithm to overcome the difficulty of training without valid Hessian values. In the future, we will extend the line search to other objectives where it's appropriate for faster convergence speed. (#8343, #8107, #7812, #8380)

    XGBoost on Browser

    With the help of the pyodide project, you can now run XGBoost on browsers. (#7954, #8369)

    Experimental IPv6 Support for Dask

    With the growing adaption of the new internet protocol, XGBoost joined the club. In the latest release, the Dask interface can be used on IPv6 clusters, see XGBoost's Dask tutorial for details. (#8225, #8234)

    Optimizations

    We have new optimizations for both the hist and gpu_hist tree methods to make XGBoost's training even more efficient.

    • Hist Hist now supports optional by-column histogram build, which is automatically configured based on various conditions of input data. This helps the XGBoost CPU hist algorithm to scale better with different shapes of training datasets. (#8233, #8259). Also, the build histogram kernel now can better utilize CPU registers (#8218)

    • GPU Hist GPU hist performance is significantly improved for wide datasets. GPU hist now supports batched node build, which reduces kernel latency and increases throughput. The improvement is particularly significant when growing deep trees with the default depthwise policy. (#7919, #8073, #8051, #8118, #7867, #7964, #8026)

    Breaking Changes

    Breaking changes made in the 1.7 release are summarized below.

    • The grow_local_histmaker updater is removed. This updater is rarely used in practice and has no test. We decided to remove it and focus have XGBoot focus on other more efficient algorithms. (#7992, #8091)
    • Single precision histogram is removed due to its lack of accuracy caused by significant floating point error. In some cases the error can be difficult to detect due to log-scale operations, which makes the parameter dangerous to use. (#7892, #7828)
    • Deprecated CUDA architectures are no longer supported in the release binaries. (#7774)
    • As part of the federated learning development, the rabit module is replaced with the new collective module. It's a drop-in replacement with added runtime backend selection, see the federated learning section for more details (#8257)

    General new features and improvements

    Before diving into package-specific changes, some general new features other than those listed at the beginning are summarized here.

    • Users of DMatrix and QuantileDMatrix can get the data from XGBoost. In previous versions, only getters for meta info like labels are available. The new method is available in Python (DMatrix::get_data) and C. (#8269, #8323)
    • In previous versions, the GPU histogram tree method may generate phantom gradient for missing values due to floating point error. We fixed such an error in this release and XGBoost is much better equated to handle floating point errors when training on GPU. (#8274, #8246)
    • Parameter validation is no longer experimental. (#8206)
    • C pointer parameters and JSON parameters are vigorously checked. (#8254, #8254)
    • Improved handling of JSON model input. (#7953, #7918)
    • Support IBM i OS (#7920, #8178)

    Fixes

    Some noteworthy bug fixes that are not related to specific language binding are listed in this section.

    • Rename misspelled config parameter for pseudo-Huber (#7904)
    • Fix feature weights with nested column sampling. (#8100)
    • Fix loading DMatrix binary in distributed env. (#8149)
    • Force auc.cc to be statically linked for unusual compiler platforms. (#8039)
    • New logic for detecting libomp on macos (#8384).

    Python Package

    • Python 3.8 is now the minimum required Python version. (#8071)

    • More progress on type hint support. Except for the new PySpark interface, the XGBoost module is fully typed. (#7742, #7945, #8302, #7914, #8052)

    • XGBoost now validates the feature names in inplace_predict, which also affects the predict function in scikit-learn estimators as it uses inplace_predict internally. (#8359)

    • Users can now get the data from DMatrix using DMatrix::get_data or QuantileDMatrix::get_data.

    • Show libxgboost.so path in build info. (#7893)

    • Raise import error when using the sklearn module while scikit-learn is missing. (#8049)

    • Use config_context in the sklearn interface. (#8141)

    • Validate features for inplace prediction. (#8359)

    • Pandas dataframe handling is refactored to reduce data fragmentation. (#7843)

    • Support more pandas nullable types (#8262)

    • Remove pyarrow workaround. (#7884)

    • Binary wheel size We aim to enable as many features as possible in XGBoost's default binary distribution on PyPI (package installed with pip), but there's a upper limit on the size of the binary wheel. In 1.7, XGBoost reduces the size of the wheel by pruning unused CUDA architectures. (#8179, #8152, #8150)

    • Fixes Some noteworthy fixes are listed here:

      • Fix the Dask interface with the latest cupy. (#8210)
      • Check cuDF lazily to avoid potential errors with cuda-python. (#8084)
    • Fix potential error in DMatrix constructor on 32-bit platform. (#8369)

    • Maintenance work

      • Linter script is moved from dmlc-core to XGBoost with added support for formatting, mypy, and parallel run, along with some fixes (#7967, #8101, #8216)
      • We now require the use of isort and black for selected files. (#8137, #8096)
      • Code cleanups. (#7827)
      • Deprecate use_label_encoder in XGBClassifier. The label encoder has already been deprecated and removed in the previous version. These changes only affect the indicator parameter (#7822)
      • Remove the use of distutils. (#7770)
      • Refactor and fixes for tests (#8077, #8064, #8078, #8076, #8013, #8010, #8244, #7833)
    • Documents

      • [dask] Fix potential error in demo. (#8079)
      • Improved documentation for the ranker. (#8356, #8347)
      • Indicate lack of py-xgboost-gpu on Windows (#8127)
      • Clarification for feature importance. (#8151)
      • Simplify Python getting started example (#8153)

    R Package

    We summarize improvements for the R package briefly here:

    • Feature info including names and types are now passed to DMatrix in preparation for categorical feature support. (#804)
    • XGBoost 1.7 can now gracefully load old R models from RDS for better compatibility with 3-party tuning libraries (#7864)
    • The R package now can be built with parallel compilation, along with fixes for warnings in CRAN tests. (#8330)
    • Emit error early if DiagrammeR is missing (#8037)
    • Fix R package Windows build. (#8065)

    JVM Packages

    The consistency between JVM packages and other language bindings is greatly improved in 1.7, improvements range from model serialization format to the default value of hyper-parameters.

    • Java package now supports feature names and feature types for DMatrix in preparation for categorical feature support. (#7966)
    • Models trained by the JVM packages can now be safely used with other language bindings. (#7896, #7907)
    • Users can specify the model format when saving models with a stream. (#7940, #7955)
    • The default value for training parameters is now sourced from XGBoost directly, which helps JVM packages be consistent with other packages. (#7938)
    • Set the correct objective if the user doesn't explicitly set it (#7781)
    • Auto-detection of MUSL is replaced by system properties (#7921)
    • Improved error message for launching tracker. (#7952, #7968)
    • Fix a race condition in parameter configuration. (#8025)
    • [Breaking] timeoutRequestWorkers is now removed. With the support for barrier mode, this parameter is no longer needed. (#7839)
    • Dependencies updates. (#7791, #8157, #7801, #8240)

    Documents

    • Document for the C interface is greatly improved and is now displayed at the sphinx document page. Thanks to the breathe project, you can view the C API just like the Python API. (#8300)
    • We now avoid having XGBoost internal text parser in demos and recommend users use dedicated libraries for loading data whenever it's feasible. (#7753)
    • Python survival training demos are now displayed at sphinx gallery. (#8328)
    • Some typos, links, format, and grammar fixes. (#7800, #7832, #7861, #8099, #8163, #8166, #8229, #8028, #8214, #7777, #7905, #8270, #8309, d70e59fef, #7806)
    • Updated winning solution under readme.md (#7862)
    • New security policy. (#8360)
    • GPU document is overhauled as we consider CUDA support to be feature-complete. (#8378)

    Maintenance

    • Code refactoring and cleanups. (#7850, #7826, #7910, #8332, #8204)
    • Reduce compiler warnings. (#7768, #7916, #8046, #8059, #7974, #8031, #8022)
    • Compiler workarounds. (#8211, #8314, #8226, #8093)
    • Dependencies update. (#8001, #7876, #7973, #8298, #7816)
    • Remove warnings emitted in previous versions. (#7815)
    • Small fixes occurred during development. (#8008)

    CI and Tests

    • We overhauled the CI infrastructure to reduce the CI cost and lift the maintenance burdens. Jenkins is replaced with buildkite for better automation, with which, finer control of test runs is implemented to reduce overall cost. Also, we refactored some of the existing tests to reduce their runtime, drooped the size of docker images, and removed multi-GPU C++ tests. Lastly, pytest-timeout is added as an optional dependency for running Python tests to keep the test time in check. (#7772, #8291, #8286, #8276, #8306, #8287, #8243, #8313, #8235, #8288, #8303, #8142, #8092, #8333, #8312, #8348)
    • New documents for how to reproduce the CI environment (#7971, #8297)
    • Improved automation for JVM release. (#7882)
    • GitHub Action security-related updates. (#8263, #8267, #8360)
    • Other fixes and maintenance work. (#8154, #7848, #8069, #7943)
    • Small updates and fixes to GitHub action pipelines. (#8364, #8321, #8241, #7950, #8011)
    Source code(tar.gz)
    Source code(zip)
  • v1.7.0rc1(Oct 20, 2022)

    Roadmap: https://github.com/dmlc/xgboost/issues/8282 Release note: https://github.com/dmlc/xgboost/pull/8374

    Release status: https://github.com/dmlc/xgboost/issues/8366

    Source code(tar.gz)
    Source code(zip)
  • v1.6.2(Aug 23, 2022)

    This is a patch release for bug fixes.

    • Remove pyarrow workaround. (#7884)
    • Fix monotone constraint with tuple input. (#7891)
    • Verify shared object version at load. (#7928)
    • Fix LTR with weighted Quantile DMatrix. (#7975)
    • Fix Python package source install. (#8036)
    • Limit max_depth to 30 for GPU. (#8098)
    • Fix compatibility with the latest cupy. (#8129)
    • [dask] Deterministic rank assignment. (#8018)
    • Fix loading DMatrix binary in distributed env. (#8149)
    Source code(tar.gz)
    Source code(zip)
  • v1.6.1(May 9, 2022)

    v1.6.1 (2022 May 9)

    This is a patch release for bug fixes and Spark barrier mode support. The R package is unchanged.

    Experimental support for categorical data

    • Fix segfault when the number of samples is smaller than the number of categories. (https://github.com/dmlc/xgboost/pull/7853)
    • Enable partition-based split for all model types. (https://github.com/dmlc/xgboost/pull/7857)

    JVM packages

    We replaced the old parallelism tracker with spark barrier mode to improve the robustness of the JVM package and fix the GPU training pipeline.

    • Fix GPU training pipeline quantile synchronization. (#7823, #7834)
    • Use barrier model in spark package. (https://github.com/dmlc/xgboost/pull/7836, https://github.com/dmlc/xgboost/pull/7840, https://github.com/dmlc/xgboost/pull/7845, https://github.com/dmlc/xgboost/pull/7846)
    • Fix shared object loading on some platforms. (https://github.com/dmlc/xgboost/pull/7844)

    Artifacts

    You can verify the downloaded packages by running this on your Unix shell:

    echo "<hash> <artifact>" | shasum -a 256 --check
    
    2633f15e7be402bad0660d270e0b9a84ad6fcfd1c690a5d454efd6d55b4e395b  ./xgboost.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
    xgboost.tar.gz(2.54 MB)
  • v1.6.0(Apr 16, 2022)

    v1.6.0 (2022 Apr 16)

    After a long period of development, XGBoost v1.6.0 is packed with many new features and improvements. We summarize them in the following sections starting with an introduction to some major new features, then moving on to language binding specific changes including new features and notable bug fixes for that binding.

    Development of categorical data support

    This version of XGBoost features new improvements and full coverage of experimental categorical data support in Python and C package with tree model. Both hist, approx and gpu_hist now support training with categorical data. Also, partition-based categorical split is introduced in this release. This split type is first available in LightGBM in the context of gradient boosting. The previous XGBoost release supported one-hot split where the splitting criteria is of form x \in {c}, i.e. the categorical feature x is tested against a single candidate. The new release allows for more expressive conditions: x \in S where the categorical feature x is tested against multiple candidates. Moreover, it is now possible to use any tree algorithms (hist, approx, gpu_hist) when creating categorical splits. For more information, please see our tutorial on categorical data, along with examples linked on that page. (#7380, #7708, #7695, #7330, #7307, #7322, #7705, #7652, #7592, #7666, #7576, #7569, #7529, #7575, #7393, #7465, #7385, #7371, #7745, #7810)

    In the future, we will continue to improve categorical data support with new features and optimizations. Also, we are looking forward to bringing the feature beyond Python binding, contributions and feedback are welcomed! Lastly, as a result of experimental status, the behavior might be subject to change, especially the default value of related hyper-parameters.

    Experimental support for multi-output model

    XGBoost 1.6 features initial support for the multi-output model, which includes multi-output regression and multi-label classification. Along with this, the XGBoost classifier has proper support for base margin without to need for the user to flatten the input. In this initial support, XGBoost builds one model for each target similar to the sklearn meta estimator, for more details, please see our quick introduction.

    (#7365, #7736, #7607, #7574, #7521, #7514, #7456, #7453, #7455, #7434, #7429, #7405, #7381)

    External memory support

    External memory support for both approx and hist tree method is considered feature complete in XGBoost 1.6. Building upon the iterator-based interface introduced in the previous version, now both hist and approx iterates over each batch of data during training and prediction. In previous versions, hist concatenates all the batches into an internal representation, which is removed in this version. As a result, users can expect higher scalability in terms of data size but might experience lower performance due to disk IO. (#7531, #7320, #7638, #7372)

    Rewritten approx

    The approx tree method is rewritten based on the existing hist tree method. The rewrite closes the feature gap between approx and hist and improves the performance. Now the behavior of approx should be more aligned with hist and gpu_hist. Here is a list of user-visible changes:

    • Supports both max_leaves and max_depth.
    • Supports grow_policy.
    • Supports monotonic constraint.
    • Supports feature weights.
    • Use max_bin to replace sketch_eps.
    • Supports categorical data.
    • Faster performance for many of the datasets.
    • Improved performance and robustness for distributed training.
    • Supports prediction cache.
    • Significantly better performance for external memory when depthwise policy is used.

    New serialization format

    Based on the existing JSON serialization format, we introduce UBJSON support as a more efficient alternative. Both formats will be available in the future and we plan to gradually phase out support for the old binary model format. Users can opt to use the different formats in the serialization function by providing the file extension json or ubj. Also, the save_raw function in all supported languages bindings gains a new parameter for exporting the model in different formats, available options are json, ubj, and deprecated, see document for the language binding you are using for details. Lastly, the default internal serialization format is set to UBJSON, which affects Python pickle and R RDS. (#7572, #7570, #7358, #7571, #7556, #7549, #7416)

    General new features and improvements

    Aside from the major new features mentioned above, some others are summarized here:

    • Users can now access the build information of XGBoost binary in Python and C interface. (#7399, #7553)
    • Auto-configuration of seed_per_iteration is removed, now distributed training should generate closer results to single node training when sampling is used. (#7009)
    • A new parameter huber_slope is introduced for the Pseudo-Huber objective.
    • During source build, XGBoost can choose cub in the system path automatically. (#7579)
    • XGBoost now honors the CPU counts from CFS, which is usually set in docker environments. (#7654, #7704)
    • The metric aucpr is rewritten for better performance and GPU support. (#7297, #7368)
    • Metric calculation is now performed in double precision. (#7364)
    • XGBoost no longer mutates the global OpenMP thread limit. (#7537, #7519, #7608, #7590, #7589, #7588, #7687)
    • The default behavior of max_leave and max_depth is now unified (#7302, #7551).
    • CUDA fat binary is now compressed. (#7601)
    • Deterministic result for evaluation metric and linear model. In previous versions of XGBoost, evaluation results might differ slightly for each run due to parallel reduction for floating-point values, which is now addressed. (#7362, #7303, #7316, #7349)
    • XGBoost now uses double for GPU Hist node sum, which improves the accuracy of gpu_hist. (#7507)

    Performance improvements

    Most of the performance improvements are integrated into other refactors during feature developments. The approx should see significant performance gain for many datasets as mentioned in the previous section, while the hist tree method also enjoys improved performance with the removal of the internal pruner along with some other refactoring. Lastly, gpu_hist no longer synchronizes the device during training. (#7737)

    General bug fixes

    This section lists bug fixes that are not specific to any language binding.

    • The num_parallel_tree is now a model parameter instead of a training hyper-parameter, which fixes model IO with random forest. (#7751)
    • Fixes in CMake script for exporting configuration. (#7730)
    • XGBoost can now handle unsorted sparse input. This includes text file formats like libsvm and scipy sparse matrix where column index might not be sorted. (#7731)
    • Fix tree param feature type, this affects inputs with the number of columns greater than the maximum value of int32. (#7565)
    • Fix external memory with gpu_hist and subsampling. (#7481)
    • Check the number of trees in inplace predict, this avoids a potential segfault when an incorrect value for iteration_range is provided. (#7409)
    • Fix non-stable result in cox regression (#7756)

    Changes in the Python package

    Other than the changes in Dask, the XGBoost Python package gained some new features and improvements along with small bug fixes.

    • Python 3.7 is required as the lowest Python version. (#7682)
    • Pre-built binary wheel for Apple Silicon. (#7621, #7612, #7747) Apple Silicon users will now be able to run pip install xgboost to install XGBoost.
    • MacOS users no longer need to install libomp from Homebrew, as the XGBoost wheel now bundles libomp.dylib library.
    • There are new parameters for users to specify the custom metric with new behavior. XGBoost can now output transformed prediction values when a custom objective is not supplied. See our explanation in the tutorial for details.
    • For the sklearn interface, following the estimator guideline from scikit-learn, all parameters in fit that are not related to input data are moved into the constructor and can be set by set_params. (#6751, #7420, #7375, #7369)
    • Apache arrow format is now supported, which can bring better performance to users' pipeline (#7512)
    • Pandas nullable types are now supported (#7760)
    • A new function get_group is introduced for DMatrix to allow users to get the group information in the custom objective function. (#7564)
    • More training parameters are exposed in the sklearn interface instead of relying on the **kwargs. (#7629)
    • A new attribute feature_names_in_ is defined for all sklearn estimators like XGBRegressor to follow the convention of sklearn. (#7526)
    • More work on Python type hint. (#7432, #7348, #7338, #7513, #7707)
    • Support the latest pandas Index type. (#7595)
    • Fix for Feature shape mismatch error on s390x platform (#7715)
    • Fix using feature names for constraints with multiple groups (#7711)
    • We clarified the behavior of the callback function when it contains mutable states. (#7685)
    • Lastly, there are some code cleanups and maintenance work. (#7585, #7426, #7634, #7665, #7667, #7377, #7360, #7498, #7438, #7667, #7752, #7749, #7751)

    Changes in the Dask interface

    • Dask module now supports user-supplied host IP and port address of scheduler node. Please see introduction and API document for reference. (#7645, #7581)
    • Internal DMatrix construction in dask now honers thread configuration. (#7337)
    • A fix for nthread configuration using the Dask sklearn interface. (#7633)
    • The Dask interface can now handle empty partitions. An empty partition is different from an empty worker, the latter refers to the case when a worker has no partition of an input dataset, while the former refers to some partitions on a worker that has zero sizes. (#7644, #7510)
    • Scipy sparse matrix is supported as Dask array partition. (#7457)
    • Dask interface is no longer considered experimental. (#7509)

    Changes in the R package

    This section summarizes the new features, improvements, and bug fixes to the R package.

    • load.raw can optionally construct a booster as return. (#7686)
    • Fix parsing decision stump, which affects both transforming text representation to data table and plotting. (#7689)
    • Implement feature weights. (#7660)
    • Some improvements for complying the CRAN release policy. (#7672, #7661, #7763)
    • Support CSR data for predictions (#7615)
    • Document update (#7263, #7606)
    • New maintainer for the CRAN package (#7691, #7649)
    • Handle non-standard installation of toolchain on macos (#7759)

    Changes in JVM-packages

    Some new features for JVM-packages are introduced for a more integrated GPU pipeline and better compatibility with musl-based Linux. Aside from this, we have a few notable bug fixes.

    • User can specify the tracker IP address for training, which helps running XGBoost on restricted network environments. (#7808)
    • Add support for detecting musl-based Linux (#7624)
    • Add DeviceQuantileDMatrix to Scala binding (#7459)
    • Add Rapids plugin support, now more of the JVM pipeline can be accelerated by RAPIDS (#7491, #7779, #7793, #7806)
    • The setters for CPU and GPU are more aligned (#7692, #7798)
    • Control logging for early stopping (#7326)
    • Do not repartition when nWorker = 1 (#7676)
    • Fix the prediction issue for multi:softmax (#7694)
    • Fix for serialization of custom objective and eval (#7274)
    • Update documentation about Python tracker (#7396)
    • Remove jackson from dependency, which fixes CVE-2020-36518. (#7791)
    • Some refactoring to the training pipeline for better compatibility between CPU and GPU. (#7440, #7401, #7789, #7784)
    • Maintenance work. (#7550, #7335, #7641, #7523, #6792, #4676)

    Deprecation

    Other than the changes in the Python package and serialization, we removed some deprecated features in previous releases. Also, as mentioned in the previous section, we plan to phase out the old binary format in future releases.

    • Remove old warning in 1.3 (#7279)
    • Remove label encoder deprecated in 1.3. (#7357)
    • Remove old callback deprecated in 1.3. (#7280)
    • Pre-built binary will no longer support deprecated CUDA architectures including sm35 and sm50. Users can continue to use these platforms with source build. (#7767)

    Documentation

    This section lists some of the general changes to XGBoost's document, for language binding specific change please visit related sections.

    • Document is overhauled to use the new RTD theme, along with integration of Python examples using Sphinx gallery. Also, we replaced most of the hard-coded URLs with sphinx references. (#7347, #7346, #7468, #7522, #7530)
    • Small update along with fixes for broken links, typos, etc. (#7684, #7324, #7334, #7655, #7628, #7623, #7487, #7532, #7500, #7341, #7648, #7311)
    • Update document for GPU. [skip ci] (#7403)
    • Document the status of RTD hosting. (#7353)
    • Update document for building from source. (#7664)
    • Add note about CRAN release [skip ci] (#7395)

    Maintenance

    This is a summary of maintenance work that is not specific to any language binding.

    • Add CMake option to use /MD runtime (#7277)
    • Add clang-format configuration. (#7383)
    • Code cleanups (#7539, #7536, #7466, #7499, #7533, #7735, #7722, #7668, #7304, #7293, #7321, #7356, #7345, #7387, #7577, #7548, #7469, #7680, #7433, #7398)
    • Improved tests with better coverage and latest dependency (#7573, #7446, #7650, #7520, #7373, #7723, #7611, #7771)
    • Improved automation of the release process. (#7278, #7332, #7470)
    • Compiler workarounds (#7673)
    • Change shebang used in CLI demo. (#7389)
    • Update affiliation (#7289)

    CI

    Some fixes and update to XGBoost's CI infrastructure. (#7739, #7701, #7382, #7662, #7646, #7582, #7407, #7417, #7475, #7474, #7479, #7472, #7626)

    Source code(tar.gz)
    Source code(zip)
    xgboost.tar.gz(2.54 MB)
    xgboost_r_gpu_linux_1.6.0.tar.gz(94.00 MB)
    xgboost_r_gpu_win64_1.6.0.tar.gz(120.28 MB)
  • v1.6.0rc1(Mar 30, 2022)

  • v1.5.2(Jan 17, 2022)

    This is a patch release for compatibility with latest dependencies and bug fixes.

    • [dask] Fix asyncio with latest dask and distributed.
    • [R] Fix single sample SHAP prediction.
    • [Python] Update python classifier to indicate support for latest Python versions.
    • [Python] Fix with latest mypy and pylint.
    • Fix indexing type for bitfield, which may affect missing value and categorical data.
    • Fix num_boosted_rounds for linear model.
    • Fix early stopping with linear model.
    Source code(tar.gz)
    Source code(zip)
  • v1.5.1(Nov 23, 2021)

    This is a patch release for compatibility with the latest dependencies and bug fixes. Also, all GPU-compatible binaries are built with CUDA 11.0.

    • [Python] Handle missing values in dataframe with category dtype. (#7331)

    • [R] Fix R CRAN failures about prediction and some compiler warnings.

    • [JVM packages] Fix compatibility with latest Spark (#7438, #7376)

    • Support building with CTK11.5. (#7379)

    • Check user input for iteration in inplace predict.

    • Handle OMP_THREAD_LIMIT environment variable.

    • [doc] Fix broken links. (#7341)

    Artifacts

    You can verify the downloaded packages by running this on your Unix shell:

    echo "<hash> <artifact>" | shasum -a 256 --check
    
    3a6cc7526c0dff1186f01b53dcbac5c58f12781988400e2d340dda61ef8d14ca  xgboost_r_gpu_linux_afb9dfd4210e8b8db8fe03380f83b404b1721443.tar.gz
    6f74deb62776f1e2fd030e1fa08b93ba95b32ac69cc4096b4bcec3821dd0a480  xgboost_r_gpu_win64_afb9dfd4210e8b8db8fe03380f83b404b1721443.tar.gz
    565dea0320ed4b6f807dbb92a8a57e86ec16db50eff9a3f405c651d1f53a259d  xgboost.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
    xgboost.tar.gz(2.29 MB)
    xgboost_r_gpu_linux_afb9dfd4210e8b8db8fe03380f83b404b1721443.tar.gz(80.41 MB)
    xgboost_r_gpu_win64_afb9dfd4210e8b8db8fe03380f83b404b1721443.tar.gz(100.50 MB)
  • v1.5.0(Oct 17, 2021)

    This release comes with many exciting new features and optimizations, along with some bug fixes. We will describe the experimental categorical data support and the external memory interface independently. Package-specific new features will be listed in respective sections.

    Development on categorical data support

    In version 1.3, XGBoost introduced an experimental feature for handling categorical data natively, without one-hot encoding. XGBoost can fit categorical splits in decision trees. (Currently, the generated splits will be of form x \in {v}, where the input is compared to a single category value. A future version of XGBoost will generate splits that compare the input against a list of multiple category values.)

    Most of the other features, including prediction, SHAP value computation, feature importance, and model plotting were revised to natively handle categorical splits. Also, all Python interfaces including native interface with and without quantized DMatrix, scikit-learn interface, and Dask interface now accept categorical data with a wide range of data structures support including numpy/cupy array and cuDF/pandas/modin dataframe. In practice, the following are required for enabling categorical data support during training:

    • Use Python package.
    • Use gpu_hist to train the model.
    • Use JSON model file format for saving the model.

    Once the model is trained, it can be used with most of the features that are available on the Python package. For a quick introduction, see https://xgboost.readthedocs.io/en/latest/tutorials/categorical.html

    Related PRs: (#7011, #7001, #7042, #7041, #7047, #7043, #7036, #7054, #7053, #7065, #7213, #7228, #7220, #7221, #7231, #7306)

    • Next steps

      • Revise the CPU training algorithm to handle categorical data natively and generate categorical splits
      • Extend the CPU and GPU algorithms to generate categorical splits of form x \in S where the input is compared with multiple category values. split. (#7081)

    External memory

    This release features a brand-new interface and implementation for external memory (also known as out-of-core training). (#6901, #7064, #7088, #7089, #7087, #7092, #7070, #7216). The new implementation leverages the data iterator interface, which is currently used to create DeviceQuantileDMatrix. For a quick introduction, see https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html#data-iterator . During the development of this new interface, lz4 compression is removed. (#7076). Please note that external memory support is still experimental and not ready for production use yet. All future development will focus on this new interface and users are advised to migrate. (You are using the old interface if you are using a URL suffix to use external memory.)

    New features in Python package

    • Support numpy array interface and all numeric types from numpy in DMatrix construction and inplace_predict (#6998, #7003). Now XGBoost no longer makes data copy when input is numpy array view.
    • The early stopping callback in Python has a new min_delta parameter to control the stopping behavior (#7137)
    • Python package now supports calculating feature scores for the linear model, which is also available on R package. (#7048)
    • Python interface now supports configuring constraints using feature names instead of feature indices.
    • Typehint support for more Python code including scikit-learn interface and rabit module. (#6799, #7240)
    • Add tutorial for XGBoost-Ray (#6884)

    New features in R package

    • In 1.4 we have a new prediction function in the C API which is used by the Python package. This release revises the R package to use the new prediction function as well. A new parameter iteration_range for the predict function is available, which can be used for specifying the range of trees for running prediction. (#6819, #7126)
    • R package now supports the nthread parameter in DMatrix construction. (#7127)

    New features in JVM packages

    • Support GPU dataframe and DeviceQuantileDMatrix (#7195). Constructing DMatrix with GPU data structures and the interface for quantized DMatrix were first introduced in the Python package and are now available in the xgboost4j package.
    • JVM packages now support saving and getting early stopping attributes. (#7095) Here is a quick example in JAVA (#7252).

    General new features

    • We now have a pre-built binary package for R on Windows with GPU support. (#7185)
    • CUDA compute capability 86 is now part of the default CMake build configuration with newly added support for CUDA 11.4. (#7131, #7182, #7254)
    • XGBoost can be compiled using system CUB provided by CUDA 11.x installation. (#7232)

    Optimizations

    The performance for both hist and gpu_hist has been significantly improved in 1.5 with the following optimizations:

    • GPU multi-class model training now supports prediction cache. (#6860)
    • GPU histogram building is sped up and the overall training time is 2-3 times faster on large datasets (#7180, #7198). In addition, we removed the parameter deterministic_histogram and now the GPU algorithm is always deterministic.
    • CPU hist has an optimized procedure for data sampling (#6922)
    • More performance optimization in regression and binary classification objectives on CPU (#7206)
    • Tree model dump is now performed in parallel (#7040)

    Breaking changes

    • n_gpus was deprecated in 1.0 release and is now removed.
    • Feature grouping in CPU hist tree method is removed, which was disabled long ago. (#7018)
    • C API for Quantile DMatrix is changed to be consistent with the new external memory implementation. (#7082)

    Notable general bug fixes

    • XGBoost no long changes global CUDA device ordinal when gpu_id is specified (#6891, #6987)
    • Fix gamma negative likelihood evaluation metric. (#7275)
    • Fix integer value of verbose_eal for xgboost.cv function in Python. (#7291)
    • Remove extra sync in CPU hist for dense data, which can lead to incorrect tree node statistics. (#7120, #7128)
    • Fix a bug in GPU hist when data size is larger than UINT32_MAX with missing values. (#7026)
    • Fix a thread safety issue in prediction with the softmax objective. (#7104)
    • Fix a thread safety issue in CPU SHAP value computation. (#7050) Please note that all prediction functions in Python are thread-safe.
    • Fix model slicing. (#7149, #7078)
    • Workaround a bug in old GCC which can lead to segfault during construction of DMatrix. (#7161)
    • Fix histogram truncation in GPU hist, which can lead to slightly-off results. (#7181)
    • Fix loading GPU linear model pickle files on CPU-only machine. (#7154)
    • Check input value is duplicated when CPU quantile queue is full (#7091)
    • Fix parameter loading with training continuation. (#7121)
    • Fix CMake interface for exposing C library by specifying dependencies. (#7099)
    • Callback and early stopping are explicitly disabled for the scikit-learn interface random forest estimator. (#7236)
    • Fix compilation error on x86 (32-bit machine) (#6964)
    • Fix CPU memory usage with extremely sparse datasets (#7255)
    • Fix a bug in GPU multi-class AUC implementation with weighted data (#7300)

    Python package

    Other than the items mentioned in the previous sections, there are some Python-specific improvements.

    • Change development release postfix to dev (#6988)
    • Fix early stopping behavior with MAPE metric (#7061)
    • Fixed incorrect feature mismatch error message (#6949)
    • Add predictor to skl constructor. (#7000, #7159)
    • Re-enable feature validation in predict proba. (#7177)
    • scikit learn interface regression estimator now can pass the scikit-learn estimator check and is fully compatible with scikit-learn utilities. __sklearn_is_fitted__ is implemented as part of the changes (#7130, #7230)
    • Conform the latest pylint. (#7071, #7241)
    • Support latest panda range index in DMatrix construction. (#7074)
    • Fix DMatrix construction from pandas series. (#7243)
    • Fix typo and grammatical mistake in error message (#7134)
    • [dask] disable work stealing explicitly for training tasks (#6794)
    • [dask] Set dataframe index in predict. (#6944)
    • [dask] Fix prediction on df with latest dask. (#6969)
    • [dask] Fix dask predict on DaskDMatrix with iteration_range. (#7005)
    • [dask] Disallow importing non-dask estimators from xgboost.dask (#7133)

    R package

    Improvements other than new features on R package:

    • Optimization for updating R handles in-place (#6903)
    • Removed the magrittr dependency. (#6855, #6906, #6928)
    • The R package now hides all C++ symbols to avoid conflicts. (#7245)
    • Other maintenance including code cleanups, document updates. (#6863, #6915, #6930, #6966, #6967)

    JVM packages

    Improvements other than new features on JVM packages:

    • Constructors with implicit missing value are deprecated due to confusing behaviors. (#7225)
    • Reduce scala-compiler, scalatest dependency scopes (#6730)
    • Making the Java library loader emit helpful error messages on missing dependencies. (#6926)
    • JVM packages now use the Python tracker in XGBoost instead of dmlc. The one in XGBoost is shared between JVM packages and Python Dask and enjoys better maintenance (#7132)
    • Fix "key not found: train" error (#6842)
    • Fix model loading from stream (#7067)

    General document improvements

    • Overhaul the installation documents. (#6877)
    • A few demos are added for AFT with dask (#6853), callback with dask (#6995), inference in C (#7151), process_type. (#7135)
    • Fix PDF format of document. (#7143)
    • Clarify the behavior of use_rmm. (#6808)
    • Clarify prediction function. (#6813)
    • Improve tutorial on feature interactions (#7219)
    • Add small example for dask sklearn interface. (#6970)
    • Update Python intro. (#7235)
    • Some fixes/updates (#6810, #6856, #6935, #6948, #6976, #7084, #7097, #7170, #7173, #7174, #7226, #6979, #6809, #6796, #6979)

    Maintenance

    • Some refactoring around CPU hist, which lead to better performance but are listed under general maintenance tasks:

      • Extract evaluate splits from CPU hist. (#7079)
      • Merge lossgude and depthwise strategies for CPU hist (#7007)
      • Simplify sparse and dense CPU hist kernels (#7029)
      • Extract histogram builder from CPU Hist. (#7152)
    • Others

      • Fix gpu_id with custom objective. (#7015)
      • Fix typos in AUC. (#6795)
      • Use constexpr in dh::CopyIf. (#6828)
      • Update dmlc-core. (#6862)
      • Bump version to 1.5.0 snapshot in master. (#6875)
      • Relax shotgun test. (#6900)
      • Guard against index error in prediction. (#6982)
      • Hide symbols in CI build + hide symbols for C and CUDA (#6798)
      • Persist data in dask test. (#7077)
      • Fix typo in arguments of PartitionBuilder::Init (#7113)
      • Fix typo in src/common/hist.cc BuildHistKernel (#7116)
      • Use upstream URI in distributed quantile tests. (#7129)
      • Include cpack (#7160)
      • Remove synchronization in monitor. (#7164)
      • Remove unused code. (#7175)
      • Fix building on CUDA 11.0. (#7187)
      • Better error message for ncclUnhandledCudaError. (#7190)
      • Add noexcept to JSON objects. (#7205)
      • Improve wording for warning (#7248)
      • Fix typo in release script. [skip ci] (#7238)
      • Relax shotgun test. (#6918)
      • Relax test for decision stump in distributed environment. (#6919)
      • [dask] speed up tests (#7020)

    CI

    • [CI] Rotate access keys for uploading MacOS artifacts from Travis CI (#7253)
    • Reduce Travis environment setup time. (#6912)
    • Restore R cache on github action. (#6985)
    • [CI] Remove stray build artifact to avoid error in artifact packaging (#6994)
    • [CI] Move appveyor tests to action (#6986)
    • Remove appveyor badge. [skip ci] (#7035)
    • [CI] Configure RAPIDS, dask, modin (#7033)
    • Test on s390x. (#7038)
    • [CI] Upgrade to CMake 3.14 (#7060)
    • [CI] Update R cache. (#7102)
    • [CI] Pin libomp to 11.1.0 (#7107)
    • [CI] Upgrade build image to CentOS 7 + GCC 8; require CUDA 10.1 and later (#7141)
    • [dask] Work around segfault in prediction. (#7112)
    • [dask] Remove the workaround for segfault. (#7146)
    • [CI] Fix hanging Python setup in Windows CI (#7186)
    • [CI] Clean up in beginning of each task in Win CI (#7189)
    • Fix travis. (#7237)

    Acknowledgement

    • Contributors: Adam Pocock (@Craigacp), Jeff H (@JeffHCross), Johan Hansson (@JohanWork), Jose Manuel Llorens (@JoseLlorensRipolles), Benjamin Szőke (@Livius90), @ReeceGoding, @ShvetsKS, Robert Zabel (@ZabelTech), Ali (@ali5h), Andrew Ziem (@az0), Andy Adinets (@canonizer), @david-cortes, Daniel Saxton (@dsaxton), Emil Sadek (@esadek), @farfarawayzyt, Gil Forsyth (@gforsyth), @giladmaya, @graue70, Philip Hyunsu Cho (@hcho3), James Lamb (@jameslamb), José Morales (@jmoralez), Kai Fricke (@krfricke), Christian Lorentzen (@lorentzenchr), Mads R. B. Kristensen (@madsbk), Anton Kostin (@masguit42), Martin Petříček (@mpetricek-corp), @naveenkb, Taewoo Kim (@oOTWK), Viktor Szathmáry (@phraktle), Robert Maynard (@robertmaynard), TP Boudreau (@tpboudreau), Jiaming Yuan (@trivialfis), Paul Taylor (@trxcllnt), @vslaykovsky, Bobby Wang (@wbo4958),
    • Reviewers: Nan Zhu (@CodingCat), Adam Pocock (@Craigacp), Jose Manuel Llorens (@JoseLlorensRipolles), Kodi Arfer (@Kodiologist), Benjamin Szőke (@Livius90), Mark Guryanov (@MarkGuryanov), Rory Mitchell (@RAMitchell), @ReeceGoding, @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Andrew Ziem (@az0), @candalfigomoro, Andy Adinets (@canonizer), Dante Gama Dessavre (@dantegd), @david-cortes, Daniel Saxton (@dsaxton), @farfarawayzyt, Gil Forsyth (@gforsyth), Harutaka Kawamura (@harupy), Philip Hyunsu Cho (@hcho3), @jakirkham, James Lamb (@jameslamb), José Morales (@jmoralez), James Bourbeau (@jrbourbeau), Christian Lorentzen (@lorentzenchr), Martin Petříček (@mpetricek-corp), Nikolay Petrov (@napetrov), @naveenkb, Viktor Szathmáry (@phraktle), Robin Teuwens (@rteuwens), Yuan Tang (@terrytangyuan), TP Boudreau (@tpboudreau), Jiaming Yuan (@trivialfis), @vkuzmin-uber, Bobby Wang (@wbo4958), William Hicks (@wphicks)

    Artifacts

    You can verify the downloaded packages by running this on your unix shell:

    echo "<hash> <artifact>" | shasum -a 256 --check
    
    2c63e8abd3e89795ac9371688daa31109a9514eebd9db06956ba5aa41d0c0e20  xgboost_r_gpu_linux_1.5.0.tar.gz
    8b19f817dcb6b601b0abffa9cf943ee92c3e9a00f56fa3f4fcdfe98cd3777c04  xgboost_r_gpu_win64_1.5.0.tar.gz
    25ee3adb9925d0529575c0f00a55ba42202a1cdb5fdd3fb6484b4088571326a5  xgboost.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
    xgboost.tar.gz(2.26 MB)
    xgboost_r_gpu_linux_1.5.0.tar.gz(80.41 MB)
    xgboost_r_gpu_win64_1.5.0.tar.gz(100.50 MB)
  • v1.5.0rc1(Sep 26, 2021)

  • v1.4.2(May 13, 2021)

    This is a patch release for Python package with following fixes:

    • Handle the latest version of cupy.ndarray in inplace_predict. https://github.com/dmlc/xgboost/pull/6933
    • Ensure output array from predict_leaf is (n_samples, ) when there's only 1 tree. 1.4.0 outputs (n_samples, 1). https://github.com/dmlc/xgboost/pull/6889
    • Fix empty dataset handling with multi-class AUC. https://github.com/dmlc/xgboost/pull/6947
    • Handle object type from pandas in inplace_predict. https://github.com/dmlc/xgboost/pull/6927

    You can verify the downloaded source code xgboost.tar.gz by running this on your unix shell:

    echo "3ffd4a90cd03efde596e51cadf7f344c8b6c91aefd06cc92db349cd47056c05a *xgboost.tar.gz" | shasum -a 256 --check
    
    Source code(tar.gz)
    Source code(zip)
    xgboost.tar.gz(2.20 MB)
  • v1.4.1(Apr 20, 2021)

  • v1.4.0(Apr 11, 2021)

    Introduction of pre-built binary package for R, with GPU support

    Starting with release 1.4.0, users now have the option of installing {xgboost} without having to build it from the source. This is particularly advantageous for users who want to take advantage of the GPU algorithm (gpu_hist), as previously they'd have to build {xgboost} from the source using CMake and NVCC. Now installing {xgboost} with GPU support is as easy as: R CMD INSTALL ./xgboost_r_gpu_linux.tar.gz. (#6827)

    See the instructions at https://xgboost.readthedocs.io/en/latest/build.html

    Improvements on prediction functions

    XGBoost has many prediction types including shap value computation and inplace prediction. In 1.4 we overhauled the underlying prediction functions for C API and Python API with an unified interface. (#6777, #6693, #6653, #6662, #6648, #6668, #6804)

    • Starting with 1.4, sklearn interface prediction will use inplace predict by default when input data is supported.
    • Users can use inplace predict with dart booster and enable GPU acceleration just like gbtree.
    • Also all prediction functions with tree models are now thread-safe. Inplace predict is improved with base_margin support.
    • A new set of C predict functions are exposed in the public interface.
    • A user-visible change is a newly added parameter called strict_shape. See https://xgboost.readthedocs.io/en/latest/prediction.html for more details.

    Improvement on Dask interface

    • Starting with 1.4, the Dask interface is considered to be feature-complete, which means all of the models found in the single node Python interface are now supported in Dask, including but not limited to ranking and random forest. Also, the prediction function is significantly faster and supports shap value computation.

      • Most of the parameters found in single node sklearn interface are supported by Dask interface. (#6471, #6591)
      • Implements learning to rank. On the Dask interface, we use the newly added support of query ID to enable group structure. (#6576)
      • The Dask interface has Python type hints support. (#6519)
      • All models can be safely pickled. (#6651)
      • Random forest estimators are now supported. (#6602)
      • Shap value computation is now supported. (#6575, #6645, #6614)
      • Evaluation result is printed on the scheduler process. (#6609)
      • DaskDMatrix (and device quantile dmatrix) now accepts all meta-information. (#6601)
    • Prediction optimization. We enhanced and speeded up the prediction function for the Dask interface. See the latest Dask tutorial page in our document for an overview of how you can optimize it even further. (#6650, #6645, #6648, #6668)

    • Bug fixes

      • If you are using the latest Dask and distributed where distributed.MultiLock is present, XGBoost supports training multiple models on the same cluster in parallel. (#6743)
      • A bug fix for when using dask.client to launch async task, XGBoost might use a different client object internally. (#6722)
    • Other improvements on documents, blogs, tutorials, and demos. (#6389, #6366, #6687, #6699, #6532, #6501)

    Python package

    With changes from Dask and general improvement on prediction, we have made some enhancements on the general Python interface and IO for booster information. Starting from 1.4, booster feature names and types can be saved into the JSON model. Also some model attributes like best_iteration, best_score are restored upon model load. On sklearn interface, some attributes are now implemented as Python object property with better documents.

    • Breaking change: All data parameters in prediction functions are renamed to X for better compliance to sklearn estimator interface guidelines.

    • Breaking change: XGBoost used to generate some pseudo feature names with DMatrix when inputs like np.ndarray don't have column names. The procedure is removed to avoid conflict with other inputs. (#6605)

    • Early stopping with training continuation is now supported. (#6506)

    • Optional import for Dask and cuDF are now lazy. (#6522)

    • As mentioned in the prediction improvement summary, the sklearn interface uses inplace prediction whenever possible. (#6718)

    • Booster information like feature names and feature types are now saved into the JSON model file. (#6605)

    • All DMatrix interfaces including DeviceQuantileDMatrix and counterparts in Dask interface (as mentioned in the Dask changes summary) now accept all the meta-information like group and qid in their constructor for better consistency. (#6601)

    • Booster attributes are restored upon model load so users don't have to call attr manually. (#6593)

    • On sklearn interface, all models accept base_margin for evaluation datasets. (#6591)

    • Improvements over the setup script including smaller sdist size and faster installation if the C++ library is already built (#6611, #6694, #6565).

    • Bug fixes for Python package:

      • Don't validate feature when number of rows is 0. (#6472)
      • Move metric configuration into booster. (#6504)
      • Calling XGBModel.fit() should clear the Booster by default (#6562)
      • Support _estimator_type. (#6582)
      • [dask, sklearn] Fix predict proba. (#6566, #6817)
      • Restore unknown data support. (#6595)
      • Fix learning rate scheduler with cv. (#6720)
      • Fixes small typo in sklearn documentation (#6717)
      • [python-package] Fix class Booster: feature_types = None (#6705)
      • Fix divide by 0 in feature importance when no split is found. (#6676)

    JVM package

    • [jvm-packages] fix early stopping doesn't work even without custom_eval setting (#6738)
    • fix potential TaskFailedListener's callback won't be called (#6612)
    • [jvm] Add ability to load booster direct from byte array (#6655)
    • [jvm-packages] JVM library loader extensions (#6630)

    R package

    • R documentation: Make construction of DMatrix consistent.
    • Fix R documentation for xgb.train. (#6764)

    ROC-AUC

    We re-implemented the ROC-AUC metric in XGBoost. The new implementation supports multi-class classification and has better support for learning to rank tasks that are not binary. Also, it has a better-defined average on distributed environments with additional handling for invalid datasets. (#6749, #6747, #6797)

    Global configuration.

    Starting from 1.4, XGBoost's Python, R and C interfaces support a new global configuration model where users can specify some global parameters. Currently, supported parameters are verbosity and use_rmm. The latter is experimental, see rmm plugin demo and related README file for details. (#6414, #6656)

    Other New features.

    • Better handling for input data types that support __array_interface__. For some data types including GPU inputs and scipy.sparse.csr_matrix, XGBoost employs __array_interface__ for processing the underlying data. Starting from 1.4, XGBoost can accept arbitrary array strides (which means column-major is supported) without making data copies, potentially reducing a significant amount of memory consumption. Also version 3 of __cuda_array_interface__ is now supported. (#6776, #6765, #6459, #6675)
    • Improved parameter validation, now feeding XGBoost with parameters that contain whitespace will trigger an error. (#6769)
    • For Python and R packages, file paths containing the home indicator ~ are supported.
    • As mentioned in the Python changes summary, the JSON model can now save feature information of the trained booster. The JSON schema is updated accordingly. (#6605)
    • Development of categorical data support is continued. Newly added weighted data support and dart booster support. (#6508, #6693)
    • As mentioned in Dask change summary, ranking now supports the qid parameter for query groups. (#6576)
    • DMatrix.slice can now consume a numpy array. (#6368)

    Other breaking changes

    • Aside from the feature name generation, there are 2 breaking changes:
      • Drop saving binary format for memory snapshot. (#6513, #6640)
      • Change default evaluation metric for binary:logitraw objective to logloss (#6647)

    CPU Optimization

    • Aside from the general changes on predict function, some optimizations are applied on CPU implementation. (#6683, #6550, #6696, #6700)
    • Also performance for sampling initialization in hist is improved. (#6410)

    Notable fixes in the core library

    These fixes do not reside in particular language bindings:

    • Fixes for gamma regression. This includes checking for invalid input values, fixes for gamma deviance metric, and better floating point guard for gamma negative log-likelihood metric. (#6778, #6537, #6761)
    • Random forest with gpu_hist might generate low accuracy in previous versions. (#6755)
    • Fix a bug in GPU sketching when data size exceeds limit of 32-bit integer. (#6826)
    • Memory consumption fix for row-major adapters (#6779)
    • Don't estimate sketch batch size when rmm is used. (#6807) (#6830)
    • Fix in-place predict with missing value. (#6787)
    • Re-introduce double buffer in UpdatePosition, to fix perf regression in gpu_hist (#6757)
    • Pass correct split_type to GPU predictor (#6491)
    • Fix DMatrix feature names/types IO. (#6507)
    • Use view for SparsePage exclusively to avoid some data access races. (#6590)
    • Check for invalid data. (#6742)
    • Fix relocatable include in CMakeList (#6734) (#6737)
    • Fix DMatrix slice with feature types. (#6689)

    Other deprecation notices:

    • This release will be the last release to support CUDA 10.0. (#6642)

    • Starting in the next release, the Python package will require Pip 19.3+ due to the use of manylinux2014 tag. Also, CentOS 6, RHEL 6 and other old distributions will not be supported.

    Known issue:

    MacOS build of the JVM packages doesn't support multi-threading out of the box. To enable multi-threading with JVM packages, MacOS users will need to build the JVM packages from the source. See https://xgboost.readthedocs.io/en/latest/jvm/index.html#installation-from-source

    Doc

    • Dedicated page for tree_method parameter is added. (#6564, #6633)
    • [doc] Add FLAML as a fast tuning tool for XGBoost (#6770)
    • Add document for tests directory. [skip ci] (#6760)
    • Fix doc string of config.py to use correct versionadded (#6458)
    • Update demo for prediction. (#6789)
    • [Doc] Document that AUCPR is for binary classification/ranking (#5899)
    • Update the C API comments (#6457)
    • Fix document. [skip ci] (#6669)

    Maintenance: Testing, continuous integration

    • Use CPU input for test_boost_from_prediction. (#6818)
    • [CI] Upload xgboost4j.dll to S3 (#6781)
    • Update dmlc-core submodule (#6745)
    • [CI] Use manylinux2010_x86_64 container to vendor libgomp (#6485)
    • Add conda-forge badge (#6502)
    • Fix merge conflict. (#6512)
    • [CI] Split up main.yml, add mypy. (#6515)
    • [Breaking] Upgrade cuDF and RMM to 0.18 nightlies; require RMM 0.18+ for RMM plugin (#6510)
    • "featue_map" typo changed to "feature_map" (#6540)
    • Add script for generating release tarball. (#6544)
    • Add credentials to .gitignore (#6559)
    • Remove warnings in tests. (#6554)
    • Update dmlc-core submodule and conform to new API (#6431)
    • Suppress hypothesis health check for dask client. (#6589)
    • Fix pylint. (#6714)
    • [CI] Clear R package cache (#6746)
    • Exclude dmlc test on github action. (#6625)
    • Tests for regression metrics with weights. (#6729)
    • Add helper script and doc for releasing pip package. (#6613)
    • Support pylint 2.7.0 (#6726)
    • Remove R cache in github action. (#6695)
    • [CI] Do not mix up stashed executable built for ARM and x86_64 platforms (#6646)
    • [CI] Add ARM64 test to Jenkins pipeline (#6643)
    • Disable s390x and arm64 tests on travis for now. (#6641)
    • Move sdist test to action. (#6635)
    • [dask] Rework base margin test. (#6627)

    Maintenance: Refactor code for legibility and maintainability

    • Improve OpenMP exception handling (#6680)
    • Improve string view to reduce string allocation. (#6644)
    • Simplify Span checks. (#6685)
    • Use generic dispatching routine for array interface. (#6672)

    You can verify the downloaded source code xgboost.tar.gz by running this on your unix shell:

    echo "ff77130a86aebd83a8b996c76768a867b0a6e5012cce89212afc3df4c4ee6b1c *xgboost.tar.gz" | shasum -a 256 --check
    
    Source code(tar.gz)
    Source code(zip)
    xgboost.tar.gz(2.20 MB)
    xgboost_r_gpu_linux_1.4.0.tar.gz(74.32 MB)
  • v1.3.3(Jan 20, 2021)

  • v1.3.2(Jan 13, 2021)

    • Fix compatibility with newer scikit-learn. (https://github.com/dmlc/xgboost/pull/6555)
    • Fix wrong best_ntree_limit in multi-class. (https://github.com/dmlc/xgboost/pull/6569)
    • Ensure that Rabit can be compiled on Solaris (https://github.com/dmlc/xgboost/pull/6578)
    • Fix best_ntree_limit for linear and dart. (https://github.com/dmlc/xgboost/pull/6579)
    • Remove duplicated DMatrix creation in scikit-learn interface. (https://github.com/dmlc/xgboost/pull/6592)
    • Fix evals_result in XGBRanker. (#https://github.com/dmlc/xgboost/pull/6594)
    Source code(tar.gz)
    Source code(zip)
  • v1.3.1(Dec 22, 2020)

    • Enable loading model from <1.0.0 trained with objective='binary:logitraw' (#6517)
    • Fix handling of print period in EvaluationMonitor (#6499)
    • Fix a bug in metric configuration after loading model. (#6504)
    • Fix save_best early stopping option (#6523)
    • Remove cupy.array_equal, since it's not compatible with cuPy 7.8 (#6528)

    You can verify the downloaded source code xgboost.tar.gz by running this on your unix shell:

    echo "fd51e844dd0291fd9e7129407be85aaeeda2309381a6e3fc104938b27fb09279 *xgboost.tar.gz" | shasum -a 256 --check
    
    Source code(tar.gz)
    Source code(zip)
    xgboost.tar.gz(2.13 MB)
  • v1.3.0(Dec 9, 2020)

    XGBoost4J-Spark: Exceptions should cancel jobs gracefully instead of killing SparkContext (#6019).

    • By default, exceptions in XGBoost4J-Spark causes the whole SparkContext to shut down, necessitating the restart of the Spark cluster. This behavior is often a major inconvenience.
    • Starting from 1.3.0 release, XGBoost adds a new parameter killSparkContextOnWorkerFailure to optionally prevent killing SparkContext. If this parameter is set, exceptions will gracefully cancel training jobs instead of killing SparkContext.

    GPUTreeSHAP: GPU acceleration of the TreeSHAP algorithm (#6038, #6064, #6087, #6099, #6163, #6281, #6332)

    • SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain predictions of machine learning models. It computes feature importance scores for individual examples, establishing how each feature influences a particular prediction. TreeSHAP is an optimized SHAP algorithm specifically designed for decision tree ensembles.
    • Starting with 1.3.0 release, it is now possible to leverage CUDA-capable GPUs to accelerate the TreeSHAP algorithm. Check out the demo notebook.
    • The CUDA implementation of the TreeSHAP algorithm is hosted at rapidsai/GPUTreeSHAP. XGBoost imports it as a Git submodule.

    New style Python callback API (#6199, #6270, #6320, #6348, #6376, #6399, #6441)

    • The XGBoost Python package now offers a re-designed callback API. The new callback API lets you design various extensions of training in idomatic Python. In addition, the new callback API allows you to use early stopping with the native Dask API (xgboost.dask). Check out the tutorial and the demo.

    Enable the use of DeviceQuantileDMatrix / DaskDeviceQuantileDMatrix with large data (#6201, #6229, #6234).

    • DeviceQuantileDMatrix can achieve memory saving by avoiding extra copies of the training data, and the saving is bigger for large data. Unfortunately, large data with more than 2^31 elements was triggering integer overflow bugs in CUB and Thrust. Tracking issue: #6228.
    • This release contains a series of work-arounds to allow the use of DeviceQuantileDMatrix with large data:
      • Loop over copy_if (#6201)
      • Loop over thrust::reduce (#6229)
      • Implement the inclusive scan algorithm in-house, to handle large offsets (#6234)

    Support slicing of tree models (#6302)

    • Accessing the best iteration of a model after the application of early stopping used to be error-prone, need to manually pass the ntree_limit argument to the predict() function.
    • Now we provide a simple interface to slice tree models by specifying a range of boosting rounds. The tree ensemble can be split into multiple sub-ensembles via the slicing interface. Check out an example.
    • In addition, the early stopping callback now supports save_best option. When enabled, XGBoost will save (persist) the model at the best boosting round and discard the trees that were fit subsequent to the best round.

    Weighted subsampling of features (columns) (#5962)

    • It is now possible to sample features (columns) via weighted subsampling, in which features with higher weights are more likely to be selected in the sample. Weighted subsampling allows you to encode domain knowledge by emphasizing a particular set of features in the choice of tree splits. In addition, you can prevent particular features from being used in any splits, by assigning them zero weights.
    • Check out the demo.

    Improved integration with Dask

    • Support reverse-proxy environment such as Google Kubernetes Engine (#6343, #6475)
    • An XGBoost training job will no longer use all available workers. Instead, it will only use the workers that contain input data (#6343).
    • The new callback API works well with the Dask training API.
    • The predict() and fit() function of DaskXGBClassifier and DaskXGBRegressor now accept a base margin (#6155).
    • Support more meta data in the Dask API (#6130, #6132, #6333).
    • Allow passing extra keyword arguments as kwargs in predict() (#6117)
    • Fix typo in dask interface: sample_weights -> sample_weight (#6240)
    • Allow empty data matrix in AFT survival, as Dask may produce empty partitions (#6379)
    • Speed up prediction by overlapping prediction jobs in all workers (#6412)

    Experimental support for direct splits with categorical features (#6028, #6128, #6137, #6140, #6164, #6165, #6166, #6179, #6194, #6219)

    • Currently, XGBoost requires users to one-hot-encode categorical variables. This has adverse performance implications, as the creation of many dummy variables results into higher memory consumption and may require fitting deeper trees to achieve equivalent model accuracy.
    • The 1.3.0 release of XGBoost contains an experimental support for direct handling of categorical variables in test nodes. Each test node will have the condition of form feature_value \in match_set, where the match_set on the right hand side contains one or more matching categories. The matching categories in match_set represent the condition for traversing to the right child node. Currently, XGBoost will only generate categorical splits with only a single matching category ("one-vs-rest split"). In a future release, we plan to remove this restriction and produce splits with multiple matching categories in match_set.
    • The categorical split requires the use of JSON model serialization. The legacy binary serialization method cannot be used to save (persist) models with categorical splits.
    • Note. This feature is currently highly experimental. Use it at your own risk. See the detailed list of limitations at #5949.

    Experimental plugin for RAPIDS Memory Manager (#5873, #6131, #6146, #6150, #6182)

    • RAPIDS Memory Manager library (rapidsai/rmm) provides a collection of efficient memory allocators for NVIDIA GPUs. It is now possible to use XGBoost with memory allocators provided by RMM, by enabling the RMM integration plugin. With this plugin, XGBoost is now able to share a common GPU memory pool with other applications using RMM, such as the RAPIDS data science packages.
    • See the demo for a working example, as well as directions for building XGBoost with the RMM plugin.
    • The plugin will be soon considered non-experimental, once #6297 is resolved.

    Experimental plugin for oneAPI programming model (#5825)

    • oneAPI is a programming interface developed by Intel aimed at providing one programming model for many types of hardware such as CPU, GPU, FGPA and other hardware accelerators.
    • XGBoost now includes an experimental plugin for using oneAPI for the predictor and objective functions. The plugin is hosted in the directory plugin/updater_oneapi.
    • Roadmap: #5442

    Pickling the XGBoost model will now trigger JSON serialization (#6027)

    • The pickle will now contain the JSON string representation of the XGBoost model, as well as related configuration.

    Performance improvements

    • Various performance improvement on multi-core CPUs
      • Optimize DMatrix build time by up to 3.7x. (#5877)
      • CPU predict performance improvement, by up to 3.6x. (#6127)
      • Optimize CPU sketch allreduce for sparse data (#6009)
      • Thread local memory allocation for BuildHist, leading to speedup up to 1.7x. (#6358)
      • Disable hyperthreading for DMatrix creation (#6386). This speeds up DMatrix creation by up to 2x.
      • Simple fix for static shedule in predict (#6357)
    • Unify thread configuration, to make it easy to utilize all CPU cores (#6186)
    • [jvm-packages] Clean the way deterministic paritioning is computed (#6033)
    • Speed up JSON serialization by implementing an intrusive pointer class (#6129). It leads to 1.5x-2x performance boost.

    API additions

    • [R] Add SHAP summary plot using ggplot2 (#5882)
    • Modin DataFrame can now be used as input (#6055)
    • [jvm-packages] Add getNumFeature method (#6075)
    • Add MAPE metric (#6119)
    • Implement GPU predict leaf. (#6187)
    • Enable cuDF/cuPy inputs in XGBClassifier (#6269)
    • Document tree method for feature weights. (#6312)
    • Add fail_on_invalid_gpu_id parameter, which will cause XGBoost to terminate upon seeing an invalid value of gpu_id (#6342)

    Breaking: the default evaluation metric for classification is changed to logloss / mlogloss (#6183)

    • The default metric used to be accuracy, and it is not statistically consistent to perform early stopping with the accuracy metric when we are really optimizing the log loss for the binary:logistic objective.
    • For statistical consistency, the default metric for classification has been changed to logloss. Users may choose to preserve the old behavior by explicitly specifying eval_metric.

    Breaking: skmaker is now removed (#5971)

    • The skmaker updater has not been documented nor tested.

    Breaking: the JSON model format no longer stores the leaf child count (#6094).

    • The leaf child count field has been deprecated and is not used anywhere in the XGBoost codebase.

    Breaking: XGBoost now requires MacOS 10.14 (Mojave) and later.

    • Homebrew has dropped support for MacOS 10.13 (High Sierra), so we are not able to install the OpenMP runtime (libomp) from Homebrew on MacOS 10.13. Please use MacOS 10.14 (Mojave) or later.

    Deprecation notices

    • The use of LabelEncoder in XGBClassifier is now deprecated and will be removed in the next minor release (#6269). The deprecation is necessary to support multiple types of inputs, such as cuDF data frames or cuPy arrays.
    • The use of certain positional arguments in the Python interface is deprecated (#6365). Users will use deprecation warnings for the use of position arguments for certain function parameters. New code should use keyword arguments as much as possible. We have not yet decided when we will fully require the use of keyword arguments.

    Bug-fixes

    • On big-endian arch, swap the byte order in the binary serializer to enable loading models that were produced by a little-endian machine (#5813).
    • [jvm-packages] Fix deterministic partitioning with dataset containing Double.NaN (#5996)
    • Limit tree depth for GPU hist to 31 to prevent integer overflow (#6045)
    • [jvm-packages] Set maxBins to 256 to align with the default value in the C++ code (#6066)
    • [R] Fix CRAN check (#6077)
    • Add back support for scipy.sparse.coo_matrix (#6162)
    • Handle duplicated values in sketching. (#6178)
    • Catch all standard exceptions in C API. (#6220)
    • Fix linear GPU input (#6255)
    • Fix inplace prediction interval. (#6259)
    • [R] allow xgb.plot.importance() calls to fill a grid (#6294)
    • Lazy import dask libraries. (#6309)
    • Deterministic data partitioning for external memory (#6317)
    • Avoid resetting seed for every configuration. (#6349)
    • Fix label errors in graph visualization (#6369)
    • [jvm-packages] fix potential unit test suites aborted issue due to race condition (#6373)
    • [R] Fix warnings from R check --as-cran (#6374)
    • [R] Fix a crash that occurs with noLD R (#6378)
    • [R] Do not convert continuous labels to factors (#6380)
    • [R] remove uses of exists() (#6387)
    • Propagate parameters to the underlying Booster handle from XGBClassifier.set_param / XGBRegressor.set_param. (#6416)
    • [R] Fix R package installation via CMake (#6423)
    • Enforce row-major order in cuPy array (#6459)
    • Fix filtering callable objects in the parameters passed to the scikit-learn API. (#6466)

    Maintenance: Testing, continuous integration, build system

    • [CI] Improve JVM test in GitHub Actions (#5930)
    • Refactor plotting test so that it can run independently (#6040)
    • [CI] Cancel builds on subsequent pushes (#6011)
    • Fix Dask Pytest fixture (#6024)
    • [CI] Migrate linters to GitHub Actions (#6035)
    • [CI] Remove win2016 JVM test from GitHub Actions (#6042)
    • Fix CMake build with BUILD_STATIC_LIB option (#6090)
    • Don't link imported target in CMake (#6093)
    • Work around a compiler bug in MacOS AppleClang 11 (#6103)
    • [CI] Fix CTest by running it in a correct directory (#6104)
    • [R] Check warnings explicitly for model compatibility tests (#6114)
    • [jvm-packages] add xgboost4j-gpu/xgboost4j-spark-gpu module to facilitate release (#6136)
    • [CI] Time GPU tests. (#6141)
    • [R] remove warning in configure.ac (#6152)
    • [CI] Upgrade cuDF and RMM to 0.16 nightlies; upgrade to Ubuntu 18.04 (#6157)
    • [CI] Test C API demo (#6159)
    • Option for generating device debug info. (#6168)
    • Update .gitignore (#6175, #6193, #6346)
    • Hide C++ symbols from dmlc-core (#6188)
    • [CI] Added arm64 job in Travis-CI (#6200)
    • [CI] Fix Docker build for CUDA 11 (#6202)
    • [CI] Move non-OpenMP gtest to GitHub Actions (#6210)
    • [jvm-packages] Fix up build for xgboost4j-gpu, xgboost4j-spark-gpu (#6216)
    • Add more tests for categorical data support (#6219)
    • [dask] Test for data initializaton. (#6226)
    • Bump junit from 4.11 to 4.13.1 in /jvm-packages/xgboost4j (#6230)
    • Bump junit from 4.11 to 4.13.1 in /jvm-packages/xgboost4j-gpu (#6233)
    • [CI] Reduce testing load with RMM (#6249)
    • [CI] Build a Python wheel for aarch64 platform (#6253)
    • [CI] Time the CPU tests on Jenkins. (#6257)
    • [CI] Skip Dask tests on ARM. (#6267)
    • Fix a typo in is_arm() in testing.py (#6271)
    • [CI] replace egrep with grep -E (#6287)
    • Support unity build. (#6295)
    • [CI] Mark flaky tests as XFAIL (#6299)
    • [CI] Use separate Docker cache for each CUDA version (#6305)
    • Added USE_NCCL_LIB_PATH option to enable user to set NCCL_LIBRARY during build (#6310)
    • Fix flaky data initialization test. (#6318)
    • Add a badge for GitHub Actions (#6321)
    • Optional find_package for sanitizers. (#6329)
    • Use pytest conventions consistently in Python tests (#6337)
    • Fix missing space in warning message (#6340)
    • Update custom_metric_obj.rst (#6367)
    • [CI] Run R check with --as-cran flag on GitHub Actions (#6371)
    • [CI] Remove R check from Jenkins (#6372)
    • Mark GPU external memory test as XFAIL. (#6381)
    • [CI] Add noLD R test (#6382)
    • Fix MPI build. (#6403)
    • [CI] Upgrade to MacOS Mojave image (#6406)
    • Fix flaky sparse page dmatrix test. (#6417)
    • [CI] Upgrade cuDF and RMM to 0.17 nightlies (#6434)
    • [CI] Fix CentOS 6 Docker images (#6467)
    • [CI] Vendor libgomp in the manylinux Python wheel (#6461)
    • [CI] Hot fix for libgomp vendoring (#6482)

    Maintenance: Clean up and merge the Rabit submodule (#6023, #6095, #6096, #6105, #6110, #6262, #6275, #6290)

    • The Rabit submodule is now maintained as part of the XGBoost codebase.
    • Tests for Rabit are now part of the test suites of XGBoost.
    • Rabit can now be built on the Windows platform.
    • We made various code re-formatting for the C++ code with clang-tidy.
    • Public headers of XGBoost no longer depend on Rabit headers.
    • Unused CMake targets for Rabit were removed.
    • Single-point model recovery has been dropped and removed from Rabit, simplifying the Rabit code greatly. The single-point model recovery feature has not been adequately maintained over the years.
    • We removed the parts of Rabit that were not useful for XGBoost.

    Maintenance: Refactor code for legibility and maintainability

    • Unify CPU hist sketching (#5880)
    • [R] fix uses of 1:length(x) and other small things (#5992)
    • Unify evaluation functions. (#6037)
    • Make binary bin search reusable. (#6058)
    • Unify set index data. (#6062)
    • [R] Remove stringi dependency (#6109)
    • Merge extract cuts into QuantileContainer. (#6125)
    • Reduce C++ compiler warnings (#6197, #6198, #6213, #6286, #6325)
    • Cleanup Python code. (#6223)
    • Small cleanup to evaluator. (#6400)

    Usability Improvements, Documentation

    • [jvm-packages] add example to handle missing value other than 0 (#5677)
    • Add DMatrix usage examples to the C API demo (#5854)
    • List DaskDeviceQuantileDMatrix in the doc. (#5975)
    • Update Python custom objective demo. (#5981)
    • Update the JSON model schema to document more objective functions. (#5982)
    • [Python] Fix warning when missing field is not used. (#5969)
    • Fix typo in tracker logging (#5994)
    • Move a warning about empty dataset, so that it's shown for all objectives and metrics (#5998)
    • Fix the instructions for installing the nightly build. (#6004)
    • [Doc] Add dtreeviz as a showcase example of integration with 3rd-party software (#6013)
    • [jvm-packages] [doc] Update install doc for JVM packages (#6051)
    • Fix typo in xgboost.callback.early_stop docstring (#6071)
    • Add cache suffix to the files used in the external memory demo. (#6088)
    • [Doc] Document the parameter kill_spark_context_on_worker_failure (#6097)
    • Fix link to the demo for custom objectives (#6100)
    • Update Dask doc. (#6108)
    • Validate weights are positive values. (#6115)
    • Document the updated CMake version requirement. (#6123)
    • Add demo for DaskDeviceQuantileDMatrix. (#6156)
    • Cosmetic fixes in faq.rst (#6161)
    • Fix error message. (#6176)
    • [Doc] Add list of winning solutions in data science competitions using XGBoost (#6177)
    • Fix a comment in demo to use correct reference (#6190)
    • Update the list of winning solutions using XGBoost (#6192)
    • Consistent style for build status badge (#6203)
    • [Doc] Add info on GPU compiler (#6204)
    • Update the list of winning solutions (#6222, #6254)
    • Add link to XGBoost's Twitter handle (#6244)
    • Fix minor typos in XGBClassifier methods' docstrings (#6247)
    • Add sponsors link to FUNDING.yml (#6252)
    • Group CLI demo into subdirectory. (#6258)
    • Reduce warning messages from gbtree. (#6273)
    • Create a tutorial for using the C API in a C/C++ application (#6285)
    • Update plugin instructions for CMake build (#6289)
    • [doc] make Dask distributed example copy-pastable (#6345)
    • [Python] Add option to use libxgboost.so from the system path (#6362)
    • Fixed few grammatical mistakes in doc (#6393)
    • Fix broken link in CLI doc (#6396)
    • Improve documentation for the Dask API (#6413)
    • Revise misleading exception information: no such param of allow_non_zero_missing (#6418)
    • Fix CLI ranking demo. (#6439)
    • Fix broken links. (#6455)

    Acknowledgement

    Contributors: Nan Zhu (@CodingCat), @FelixYBW, Jack Dunn (@JackDunnNZ), Jean Lescut-Muller (@JeanLescut), Boris Feld (@Lothiraldan), Nikhil Choudhary (@Nikhil1O1), Rory Mitchell (@RAMitchell), @ShvetsKS, Anthony D'Amato (@Totoketchup), @Wittty-Panda, neko (@akiyamaneko), Alexander Gugel (@alexanderGugel), @dependabot[bot], DIVYA CHAUHAN (@divya661), Daniel Steinberg (@dstein64), Akira Funahashi (@funasoul), Philip Hyunsu Cho (@hcho3), Tong He (@hetong007), Hristo Iliev (@hiliev), Honza Sterba (@honzasterba), @hzy001, Igor Moura (@igormp), @jameskrach, James Lamb (@jameslamb), Naveed Ahmed Saleem Janvekar (@janvekarnaveed), Kyle Nicholson (@kylejn27), lacrosse91 (@lacrosse91), Christian Lorentzen (@lorentzenchr), Manikya Bardhan (@manikyabard), @nabokovas, John Quitto-Graham (@nvidia-johnq), @odidev, Qi Zhang (@qzhang90), Sergio Gavilán (@sgavil), Tanuja Kirthi Doddapaneni (@tanuja3), Cuong Duong (@tcuongd), Yuan Tang (@terrytangyuan), Jiaming Yuan (@trivialfis), vcarpani (@vcarpani), Vladislav Epifanov (@vepifanov), Vitalie Spinu (@vspinu), Bobby Wang (@wbo4958), Zeno Gantner (@zenogantner), zhang_jf (@zuston)

    Reviewers: Nan Zhu (@CodingCat), John Zedlewski (@JohnZed), Rory Mitchell (@RAMitchell), @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Anthony D'Amato (@Totoketchup), @Wittty-Panda, Alexander Gugel (@alexanderGugel), Codecov Comments Bot (@codecov-commenter), Codecov (@codecov-io), DIVYA CHAUHAN (@divya661), Devin Robison (@drobison00), Geoffrey Blake (@geoffreyblake), Mark Harris (@harrism), Philip Hyunsu Cho (@hcho3), Honza Sterba (@honzasterba), Igor Moura (@igormp), @jakirkham, @jameskrach, James Lamb (@jameslamb), Janakarajan Natarajan (@janaknat), Jake Hemstad (@jrhemstad), Keith Kraus (@kkraus14), Kyle Nicholson (@kylejn27), Christian Lorentzen (@lorentzenchr), Michael Mayer (@mayer79), Nikolay Petrov (@napetrov), @odidev, PSEUDOTENSOR / Jonathan McKinney (@pseudotensor), Qi Zhang (@qzhang90), Sergio Gavilán (@sgavil), Scott Lundberg (@slundberg), Cuong Duong (@tcuongd), Yuan Tang (@terrytangyuan), Jiaming Yuan (@trivialfis), vcarpani (@vcarpani), Vladislav Epifanov (@vepifanov), Vincent Nijs (@vnijs), Vitalie Spinu (@vspinu), Bobby Wang (@wbo4958), William Hicks (@wphicks)

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0rc1(Nov 23, 2020)

  • v1.2.1(Oct 14, 2020)

  • v1.2.0(Aug 23, 2020)

    XGBoost4J-Spark now supports the GPU algorithm (#5171)

    • Now XGBoost4J-Spark is able to leverage NVIDIA GPU hardware to speed up training.
    • There is on-going work for accelerating the rest of the data pipeline with NVIDIA GPUs (#5950, #5972).

    XGBoost now supports CUDA 11 (#5808)

    • It is now possible to build XGBoost with CUDA 11. Note that we do not yet distribute pre-built binaries built with CUDA 11; all current distributions use CUDA 10.0.

    Better guidance for persisting XGBoost models in an R environment (#5940, #5964)

    • Users are strongly encouraged to use xgb.save() and xgb.save.raw() instead of saveRDS(). This is so that the persisted models can be accessed with future releases of XGBoost.
    • The previous release (1.1.0) had problems loading models that were saved with saveRDS(). This release adds a compatibility layer to restore access to the old RDS files. Note that this is meant to be a temporary measure; users are advised to stop using saveRDS() and migrate to xgb.save() and xgb.save.raw().

    New objectives and metrics

    • The pseudo-Huber loss reg:pseudohubererror is added (#5647). The corresponding metric is mphe. Right now, the slope is hard-coded to 1.
    • The Accelerated Failure Time objective for survival analysis (survival:aft) is now accelerated on GPUs (#5714, #5716). The survival metrics aft-nloglik and interval-regression-accuracy are also accelerated on GPUs.

    Improved integration with scikit-learn

    • Added n_features_in_ attribute to the scikit-learn interface to store the number of features used (#5780). This is useful for integrating with some scikit-learn features such as StackingClassifier. See this link for more details.
    • XGBoostError now inherits ValueError, which conforms scikit-learn's exception requirement (#5696).

    Improved integration with Dask

    • The XGBoost Dask API now exposes an asynchronous interface (#5862). See the document for details.
    • Zero-copy ingestion of GPU arrays via DaskDeviceQuantileDMatrix (#5623, #5799, #5800, #5803, #5837, #5874, #5901): Previously, the Dask interface had to make 2 data copies: one for concatenating the Dask partition/block into a single block and another for internal representation. To save memory, we introduce DaskDeviceQuantileDMatrix. As long as Dask partitions are resident in the GPU memory, DaskDeviceQuantileDMatrix is able to ingest them directly without making copies. This matrix type wraps DeviceQuantileDMatrix.
    • The prediction function now returns GPU Series type if the input is from Dask-cuDF (#5710). This is to preserve the input data type.

    Robust handling of external data types (#5689, #5893)

    • As we support more and more external data types, the handling logic has proliferated all over the code base and became hard to keep track. It also became unclear how missing values and threads are handled. We refactored the Python package code to collect all data handling logic to a central location, and now we have an explicit list of of all supported data types.

    Improvements in GPU-side data matrix (DeviceQuantileDMatrix)

    • The GPU-side data matrix now implements its own quantile sketching logic, so that data don't have to be transported back to the main memory (#5700, #5747, #5760, #5846, #5870, #5898). The GK sketching algorithm is also now better documented.
      • Now we can load extremely sparse dataset like URL, although performance is still sub-optimal.
    • The GPU-side data matrix now exposes an iterative interface (#5783), so that users are able to construct a matrix from a data iterator. See the Python demo.

    New language binding: Swift (#5728)

    • Visit https://github.com/kongzii/SwiftXGBoost for more details.

    Robust model serialization with JSON (#5772, #5804, #5831, #5857, #5934)

    • We continue efforts from the 1.0.0 release to adopt JSON as the format to save and load models robustly.
    • JSON model IO is significantly faster and produces smaller model files.
    • Round-trip reproducibility is guaranteed, via the introduction of an efficient float-to-string conversion algorithm known as the Ryū algorithm. The conversion is locale-independent, producing consistent numeric representation regardless of the locale setting of the user's machine.
    • We fixed an issue in loading large JSON files to memory.
    • It is now possible to load a JSON file from a remote source such as S3.

    Performance improvements

    • CPU hist tree method optimization
      • Skip missing lookup in hist row partitioning if data is dense. (#5644)
      • Specialize training procedures for CPU hist tree method on distributed environment. (#5557)
      • Add single point histogram for CPU hist. Previously gradient histogram for CPU hist is hard coded to be 64 bit, now users can specify the parameter single_precision_histogram to use 32 bit histogram instead for faster training performance. (#5624, #5811)
    • GPU hist tree method optimization
      • Removed some unnecessary synchronizations and better memory allocation pattern. (#5707)
      • Optimize GPU Hist for wide dataset. Previously for wide dataset the atomic operation is performed on global memory, now it can run on shared memory for faster histogram building. But there's a known small regression on GeForce cards with dense data. (#5795, #5926, #5948, #5631)

    API additions

    • Support passing fmap to importance plot (#5719). Now importance plot can show actual names of features instead of default ones.
    • Support 64bit seed. (#5643)
    • A new C API XGBoosterGetNumFeature is added for getting number of features in booster (#5856).
    • Feature names and feature types are now stored in C++ core and saved in binary DMatrix (#5858).

    Breaking: The predict() method of DaskXGBClassifier now produces class predictions (#5986). Use predict_proba() to obtain probability predictions.

    • Previously, DaskXGBClassifier.predict() produced probability predictions. This is inconsistent with the behavior of other scikit-learn classifiers, where predict() returns class predictions. We make a breaking change in 1.2.0 release so that DaskXGBClassifier.predict() now correctly produces class predictions and thus behave like other scikit-learn classifiers. Furthermore, we introduce the predict_proba() method for obtaining probability predictions, again to be in line with other scikit-learn classifiers.

    Breaking: Custom evaluation metric now receives raw prediction (#5954)

    • Previously, the custom evaluation metric received a transformed prediction result when used with a classifier. Now the custom metric will receive a raw (untransformed) prediction and will need to transform the prediction itself. See demo/guide-python/custom_softmax.py for an example.
    • This change is to make the custom metric behave consistently with the custom objective, which already receives raw prediction (#5564).

    Breaking: XGBoost4J-Spark now requires Spark 3.0 and Scala 2.12 (#5836, #5890)

    • Starting with version 3.0, Spark can manage GPU resources and allocate them among executors.
    • Spark 3.0 dropped support for Scala 2.11 and now only supports Scala 2.12. Thus, XGBoost4J-Spark also only supports Scala 2.12.

    Breaking: XGBoost Python package now requires Python 3.6 and later (#5715)

    • Python 3.6 has many useful features such as f-strings.

    Breaking: XGBoost now adopts the C++14 standard (#5664)

    • Make sure to use a sufficiently modern C++ compiler that supports C++14, such as Visual Studio 2017, GCC 5.0+, and Clang 3.4+.

    Bug-fixes

    • Fix a data race in the prediction function (#5853). As a byproduct, the prediction function now uses a thread-local data store and became thread-safe.
    • Restore capability to run prediction when the test input has fewer features than the training data (#5955). This capability is necessary to support predicting with LIBSVM inputs. The previous release (1.1) had broken this capability, so we restore it in this version with better tests.
    • Fix OpenMP build with CMake for R package, to support CMake 3.13 (#5895).
    • Fix Windows 2016 build (#5902, #5918).
    • Fix edge cases in scikit-learn interface with Pandas input by disabling feature validation. (#5953)
    • [R] Enable weighted learning to rank (#5945)
    • [R] Fix early stopping with custom objective (#5923)
    • Fix NDK Build (#5886)
    • Add missing explicit template specializations for greater portability (#5921)
    • Handle empty rows in data iterators correctly (#5929). This bug affects file loader and JVM data frames.
    • Fix IsDense (#5702)
    • [jvm-packages] Fix wrong method name setAllowZeroForMissingValue (#5740)
    • Fix shape inference for Dask predict (#5989)

    Usability Improvements, Documentation

    • [Doc] Document that CUDA 10.0 is required (#5872)
    • Refactored command line interface (CLI). Now CLI is able to handle user errors and output basic document. (#5574)
    • Better error handling in Python: use raise from syntax to preserve full stacktrace (#5787).
    • The JSON model dump now has a formal schema (#5660, #5818). The benefit is to prevent dump_model() function from breaking. See this document to understand the difference between saving and dumping models.
    • Add a reference to the GPU external memory paper (#5684)
    • Document more objective parameters in the R package (#5682)
    • Document the existence of pre-built binary wheels for MacOS (#5711)
    • Remove max.depth in the R gblinear example. (#5753)
    • Added conda environment file for building docs (#5773)
    • Mention dask blog post in the doc, which introduces using Dask with GPU and some internal workings. (#5789)
    • Fix rendering of Markdown docs (#5821)
    • Document new objectives and metrics available on GPUs (#5909)
    • Better message when no GPU is found. (#5594)
    • Remove the use of silent parameter from R demos. (#5675)
    • Don't use masked array in array interface. (#5730)
    • Update affiliation of @terrytangyuan: Ant Financial -> Ant Group (#5827)
    • Move dask tutorial closer other distributed tutorials (#5613)
    • Update XGBoost + Dask overview documentation (#5961)
    • Show n_estimators in the docstring of the scikit-learn interface (#6041)
    • Fix a type in a doctring of the scikit-learn interface (#5980)

    Maintenance: testing, continuous integration, build system

    • [CI] Remove CUDA 9.0 from CI (#5674, #5745)
    • Require CUDA 10.0+ in CMake build (#5718)
    • [R] Remove dependency on gendef for Visual Studio builds (fixes #5608) (#5764). This enables building XGBoost with GPU support with R 4.x.
    • [R-package] Reduce duplication in configure.ac (#5693)
    • Bump com.esotericsoftware to 4.0.2 (#5690)
    • Migrate some tests from AppVeyor to GitHub Actions to speed up the tests. (#5911, #5917, #5919, #5922, #5928)
    • Reduce cost of the Jenkins CI server (#5884, #5904, #5892). We now enforce a daily budget via an automated monitor. We also dramatically reduced the workload for the Windows platform, since the cloud VM cost is vastly greater for Windows.
    • [R] Set up automated R linter (#5944)
    • [R] replace uses of T and F with TRUE and FALSE (#5778)
    • Update Docker container 'CPU' (#5956)
    • Simplify CMake build with modern CMake techniques (#5871)
    • Use hypothesis package for testing (#5759, #5835, #5849).
    • Define _CRT_SECURE_NO_WARNINGS to remove unneeded warnings in MSVC (#5434)
    • Run all Python demos in CI, to ensure that they don't break (#5651)
    • Enhance nvtx support (#5636). Now we can use unified timer between CPU and GPU. Also CMake is able to find nvtx automatically.
    • Speed up python test. (#5752)
    • Add helper for generating batches of data. (#5756)
    • Add c-api-demo to .gitignore (#5855)
    • Add option to enable all compiler warnings in GCC/Clang (#5897)
    • Make Python model compatibility test runnable locally (#5941)
    • Add cupy to Windows CI (#5797)
    • [CI] Fix cuDF install; merge 'gpu' and 'cudf' test suite (#5814)
    • Update rabit submodule (#5680, #5876)
    • Force colored output for Ninja build. (#5959)
    • [CI] Assign larger /dev/shm to NCCL (#5966)
    • Add missing Pytest marks to AsyncIO unit test (#5968)
    • [CI] Use latest cuDF and dask-cudf (#6048)
    • Add CMake flag to log C API invocations, to aid debugging (#5925)
    • Fix a unit test on CLI, to handle RC versions (#6050)
    • [CI] Use mgpu machine to run gpu hist unit tests (#6050)
    • [CI] Build GPU-enabled JAR artifact and deploy to xgboost-maven-repo (#6050)

    Maintenance: Refactor code for legibility and maintainability

    • Remove dead code in DMatrix initialization. (#5635)
    • Catch dmlc error by ref. (#5678)
    • Refactor the gpu_hist split evaluation in preparation for batched nodes enumeration. (#5610)
    • Remove column major specialization. (#5755)
    • Remove unused imports in Python (#5776)
    • Avoid including c_api.h in header files. (#5782)
    • Remove unweighted GK quantile, which is unused. (#5816)
    • Add Python binding for rabit ops. (#5743)
    • Implement Empty method for host device vector. (#5781)
    • Remove print (#5867)
    • Enforce tree order in JSON (#5974)

    Acknowledgement

    Contributors: Nan Zhu (@CodingCat), @LionOrCatThatIsTheQuestion, Dmitry Mottl (@Mottl), Rory Mitchell (@RAMitchell), @ShvetsKS, Alex Wozniakowski (@a-wozniakowski), Alexander Gugel (@alexanderGugel), @anttisaukko, @boxdot, Andy Adinets (@canonizer), Ram Rachum (@cool-RR), Elliot Hershberg (@elliothershberg), Jason E. Aten, Ph.D. (@glycerine), Philip Hyunsu Cho (@hcho3), @jameskrach, James Lamb (@jameslamb), James Bourbeau (@jrbourbeau), Peter Jung (@kongzii), Lorenz Walthert (@lorenzwalthert), Oleksandr Kuvshynov (@okuvshynov), Rong Ou (@rongou), Shaochen Shi (@shishaochen), Yuan Tang (@terrytangyuan), Jiaming Yuan (@trivialfis), Bobby Wang (@wbo4958), Zhang Zhang (@zhangzhang10)

    Reviewers: Nan Zhu (@CodingCat), @LionOrCatThatIsTheQuestion, Hao Yang (@QuantHao), Rory Mitchell (@RAMitchell), @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Alex Wozniakowski (@a-wozniakowski), Amit Kumar (@aktech), Avinash Barnwal (@avinashbarnwal), @boxdot, Andy Adinets (@canonizer), Chandra Shekhar Reddy (@chandrureddy), Ram Rachum (@cool-RR), Cristiano Goncalves (@cristianogoncalves), Elliot Hershberg (@elliothershberg), Jason E. Aten, Ph.D. (@glycerine), Philip Hyunsu Cho (@hcho3), Tong He (@hetong007), James Lamb (@jameslamb), James Bourbeau (@jrbourbeau), Lee Drake (@leedrake5), DougM (@mengdong), Oleksandr Kuvshynov (@okuvshynov), RongOu (@rongou), Shaochen Shi (@shishaochen), Xu Xiao (@sperlingxx), Yuan Tang (@terrytangyuan), Theodore Vasiloudis (@thvasilo), Jiaming Yuan (@trivialfis), Bobby Wang (@wbo4958), Zhang Zhang (@zhangzhang10)

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0rc2(Aug 12, 2020)

  • v1.2.0rc1(Aug 2, 2020)

  • v1.1.1(Jun 7, 2020)

    This patch release applies the following patches to 1.1.0 release:

    • CPU performance improvement in the PyPI wheels (#5720)
    • Fix loading old model. (#5724)
    • Install pkg-config file (#5744)
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(May 17, 2020)

    Better performance on multi-core CPUs (#5244, #5334, #5522)

    • Poor performance scaling of the hist algorithm for multi-core CPUs has been under investigation (#3810). #5244 concludes the ongoing effort to improve performance scaling on multi-CPUs, in particular Intel CPUs. Roadmap: #5104
    • #5334 makes steps toward reducing memory consumption for the hist tree method on CPU.
    • #5522 optimizes random number generation for data sampling.

    Deterministic GPU algorithm for regression and classification (#5361)

    • GPU algorithm for regression and classification tasks is now deterministic.
    • Roadmap: #5023. Currently only single-GPU training is deterministic. Distributed training with multiple GPUs is not yet deterministic.

    Improve external memory support on GPUs (#5093, #5365)

    • Starting from 1.0.0 release, we added support for external memory on GPUs to enable training with larger datasets. Gradient-based sampling (#5093) speeds up the external memory algorithm by intelligently sampling a subset of the training data to copy into the GPU memory. Learn more about out-of-core GPU gradient boosting.
    • GPU-side data sketching now works with data from external memory (#5365).

    Parameter validation: detection of unused or incorrect parameters (#5477, #5569, #5508)

    • Mis-spelled training parameter is a common user mistake. In previous versions of XGBoost, mis-spelled parameters were silently ignored. Starting with 1.0.0 release, XGBoost will produce a warning message if there is any unused training parameters. The 1.1.0 release makes parameter validation available to the scikit-learn interface (#5477) and the R binding (#5569).

    Thread-safe, in-place prediction method (#5389, #5512)

    • Previously, the prediction method was not thread-safe (#5339). This release adds a new API function inplace_predict() that is thread-safe. It is now possible to serve concurrent requests for prediction using a shared model object.
    • It is now possible to compute prediction in-place for selected data formats (numpy.ndarray / scipy.sparse.csr_matrix / cupy.ndarray / cudf.DataFrame / pd.DataFrame) without creating a DMatrix object.

    Addition of Accelerated Failure Time objective for survival analysis (#4763, #5473, #5486, #5552, #5553)

    • Survival analysis (regression) models the time it takes for an event of interest to occur. The target label is potentially censored, i.e. the label is a range rather than a single number. We added a new objective survival:aft to support survival analysis. Also added is the new API to specify the ranged labels. Check out the tutorial and the demos.
    • GPU support is work in progress (#5714).

    Improved installation experience on Mac OSX (#5597, #5602, #5606, #5701)

    • It only takes two commands to install the XGBoost Python package: brew install libomp followed by pip install xgboost. The installed XGBoost will use all CPU cores. Even better, starting with this release, we distribute pre-compiled binary wheels targeting Mac OSX. Now the install command pip install xgboost finishes instantly, as it no longer compiles the C++ source of XGBoost. The last three Mac versions (High Sierra, Mojave, Catalina) are supported.
    • R package: the 1.1.0 release fixes the error Initializing libomp.dylib, but found libomp.dylib already initialized (#5701)

    Ranking metrics are now accelerated on GPUs (#5380, #5387, #5398)

    GPU-side data matrix to ingest data directly from other GPU libraries (#5420, #5465)

    • Previously, data on GPU memory had to be copied back to the main memory before it could be used by XGBoost. Starting with 1.1.0 release, XGBoost provides a dedicated interface (DeviceQuantileDMatrix) so that it can ingest data from GPU memory directly. The result is that XGBoost interoperates better with GPU-accelerated data science libraries, such as cuDF, cuPy, and PyTorch.
    • Set device in device dmatrix. (#5596)

    Robust model serialization with JSON (#5123, #5217)

    • We continue efforts from the 1.0.0 release to adopt JSON as the format to save and load models robustly. Refer to the release note for 1.0.0 to learn more.
    • It is now possible to store internal configuration of the trained model (Booster) object in R as a JSON string (#5123, #5217).

    Improved integration with Dask

    • Pass through verbose parameter for dask fit (#5413)
    • Use DMLC_TASK_ID. (#5415)
    • Order the prediction result. (#5416)
    • Honor nthreads from dask worker. (#5414)
    • Enable grid searching with scikit-learn. (#5417)
    • Check non-equal when setting threads. (#5421)
    • Accept other inputs for prediction. (#5428)
    • Fix missing value for scikit-learn interface. (#5435)

    XGBoost4J-Spark: Check number of columns in the data iterator (#5202, #5303)

    • Before, the native layer in XGBoost did not know the number of columns (features) ahead of time and had to guess the number of columns by counting the feature index when ingesting data. This method has a failure more in distributed setting: if the training data is highly sparse, some features may be completely missing in one or more worker partitions. Thus, one or more workers may deduce an incorrect data shape, leading to crashes or silently wrong models.
    • Enforce correct data shape by passing the number of columns explicitly from the JVM layer into the native layer.

    Major refactoring of the DMatrix class

    • Continued from 1.0.0 release.
    • Remove update prediction cache from predictors. (#5312)
    • Predict on Ellpack. (#5327)
    • Partial rewrite EllpackPage (#5352)
    • Use ellpack for prediction only when sparsepage doesn't exist. (#5504)
    • RFC: #4354, Roadmap: #5143

    Breaking: XGBoost Python package now requires Pip 19.0 and higher (#5589)

    • Your Linux machine may have an old version of Pip and may attempt to install a source package, leading to long installation time. This is because we are now using manylinux2010 tag in the binary wheel release. Ensure you have Pip 19.0 or newer by running python3 -m pip -V to check the version. Upgrade Pip with command
    python3 -m pip install --upgrade pip
    

    Upgrading to latest pip allows us to depend on newer versions of system libraries. TensorFlow also requires Pip 19.0+.

    Breaking: GPU algorithm now requires CUDA 10.0 and higher (#5649)

    • CUDA 10.0 is necessary to make the GPU algorithm deterministic (#5361).

    Breaking: silent parameter is now removed (#5476)

    • Please use verbosity instead.

    Breaking: Set output_margin to True for custom objectives (#5564)

    • Now both R and Python interface custom objectives get un-transformed (raw) prediction outputs.

    Breaking: Makefile is now removed. We use CMake exclusively to build XGBoost (#5513)

    • Exception: the R package uses Autotools, as the CRAN ecosystem did not yet adopt CMake widely.

    Breaking: distcol updater is now removed (#5507)

    • The distcol updater has been long broken, and currently we lack resources to implement a working implementation from scratch.

    Deprecation notices

    • Python 3.5. This release is the last release to support Python 3.5. The following release (1.2.0) will require Python 3.6.
    • Scala 2.11. Currently XGBoost4J supports Scala 2.11. However, if a future release of XGBoost adopts Spark 3, it will not support Scala 2.11, as Spark 3 requires Scala 2.12+. We do not yet know which XGBoost release will adopt Spark 3.

    Known limitations

    • (Python package) When early stopping is activated with early_stopping_rounds at training time, the prediction method (xgb.predict()) behaves in a surprising way. If XGBoost runs for M rounds and chooses iteration N (N < M) as the best iteration, then the prediction method will use M trees by default. To use the best iteration (N trees), users will need to manually take the best iteration field bst.best_iteration and pass it as the ntree_limit argument to xgb.predict(). See #5209 and #4052 for additional context.
    • GPU ranking objective is currently not deterministic (#5561).
    • When training parameter reg_lambda is set to zero, some leaf nodes may be assigned a NaN value. (See discussion.) For now, please set reg_lambda to a nonzero value.

    Community and Governance

    • The XGBoost Project Management Committee (PMC) is pleased to announce a new committer: Egor Smirnov (@SmirnovEgorRu). He has led a major initiative to improve the performance of XGBoost on multi-core CPUs.

    Bug-fixes

    • Improved compatibility with scikit-learn (#5255, #5505, #5538)
    • Remove f-string, since it's not supported by Python 3.5 (#5330). Note that Python 3.5 support is deprecated and schedule to be dropped in the upcoming release (1.2.0).
    • Fix the pruner so that it doesn't prune the same branch twice (#5335)
    • Enforce only major version in JSON model schema (#5336). Any major revision of the model schema would bump up the major version.
    • Fix a small typo in sklearn.py that broke multiple eval metrics (#5341)
    • Restore loading model from a memory buffer (#5360)
    • Define lazy isinstance for Python compat (#5364)
    • [R] fixed uses of class() (#5426)
    • Force compressed buffer to be 4 bytes aligned, to keep cuda-memcheck happy (#5441)
    • Remove warning for calling host function (std::max) on a GPU device (#5453)
    • Fix uninitialized value bug in xgboost callback (#5463)
    • Fix model dump in CLI (#5485)
    • Fix out-of-bound array access in WQSummary::SetPrune() (#5493)
    • Ensure that configured dmlc/build_config.h is picked up by Rabit and XGBoost, to fix build on Alpine (#5514)
    • Fix a misspelled method, made in a git merge (#5509)
    • Fix a bug in binary model serialization (#5532)
    • Fix CLI model IO (#5535)
    • Don't use uint for threads (#5542)
    • Fix R interaction constraints to handle more than 100000 features (#5543)
    • [jvm-packages] XGBoost Spark should deal with NaN when parsing evaluation output (#5546)
    • GPU-side data sketching is now aware of query groups in learning-to-rank data (#5551)
    • Fix DMatrix slicing for newly added fields (#5552)
    • Fix configuration status with loading binary model (#5562)
    • Fix build when OpenMP is disabled (#5566)
    • R compatibility patches (#5577, #5600)
    • gpu_hist performance fixes (#5558)
    • Don't set seed on CLI interface (#5563)
    • [R] When serializing model, preserve model attributes related to early stopping (#5573)
    • Avoid rabit calls in learner configuration (#5581)
    • Hide C++ symbols in libxgboost.so when building Python wheel (#5590). This fixes apache/incubator-tvm#4953.
    • Fix compilation on Mac OSX High Sierra (10.13) (#5597)
    • Fix build on big endian CPUs (#5617)
    • Resolve crash due to use of vector<bool>::iterator (#5642)
    • Validation JSON model dump using JSON schema (#5660)

    Performance improvements

    • Wide dataset quantile performance improvement (#5306)
    • Reduce memory usage of GPU-side data sketching (#5407)
    • Reduce span check overhead (#5464)
    • Serialise booster after training to free up GPU memory (#5484)
    • Use the maximum amount of GPU shared memory available to speed up the histogram kernel (#5491)
    • Use non-synchronising scan in Thrust (#5560)
    • Use cudaDeviceGetAttribute() instead of cudaGetDeviceProperties() for speed (#5570)

    API changes

    • Support importing data from a Pandas SparseArray (#5431)
    • HostDeviceVector (vector shared between CPU and GPU memory) now exposes HostSpan interface, to enable access on the CPU side with bound check (#5459)
    • Accept other gradient types for SplitEntry (#5467)

    Usability Improvements, Documentation

    • Add JVM_CHECK_CALL to prevent C++ exceptions from leaking into the JVM layer (#5199)
    • Updated Windows build docs (#5283)
    • Update affiliation of @hcho3 (#5292)
    • Display Sponsor button, link to OpenCollective (#5325)
    • Update docs for GPU external memory (#5332)
    • Add link to GPU documentation (#5437)
    • Small updates to GPU documentation (#5483)
    • Edits on tutorial for XGBoost job on Kubernetes (#5487)
    • Add reference to GPU external memory (#5490)
    • Fix typos (#5346, #5371, #5384, #5399, #5482, #5515)
    • Update Python doc (#5517)
    • Add Neptune and Optuna to list of examples (#5528)
    • Raise error if the number of data weights doesn't match the number of data sets (#5540)
    • Add a note about GPU ranking (#5572)
    • Clarify meaning of training parameter in the C API function XGBoosterPredict() (#5604)
    • Better error handling for situations where existing trees cannot be modified (#5406, #5418). This feature is enabled when process_type is set to update.

    Maintenance: testing, continuous integration, build system

    • Add C++ test coverage for data sketching (#5251)
    • Ignore gdb_history (#5257)
    • Rewrite setup.py. (#5271, #5280)
    • Use scikit-learn in extra dependencies (#5310)
    • Add CMake option to build static library (#5397)
    • [R] changed FindLibR to take advantage of CMake cache (#5427)
    • [R] fixed inconsistency in R -e calls in FindLibR.cmake (#5438)
    • Refactor tests with data generator (#5439)
    • Resolve failing Travis CI (#5445)
    • Update dmlc-core. (#5466)
    • [CI] Use clang-tidy 10 (#5469)
    • De-duplicate code for checking maximum number of nodes (#5497)
    • [CI] Use Ubuntu 18.04 LTS in JVM CI, because 19.04 is EOL (#5537)
    • [jvm-packages] [CI] Create a Maven repository to host SNAPSHOT JARs (#5533)
    • [jvm-packages] [CI] Publish XGBoost4J JARs with Scala 2.11 and 2.12 (#5539)
    • [CI] Use Vault repository to re-gain access to devtoolset-4 (#5589)

    Maintenance: Refactor code for legibility and maintainability

    • Move prediction cache to Learner (#5220, #5302)
    • Remove SimpleCSRSource (#5315)
    • Refactor SparsePageSource, delete cache files after use (#5321)
    • Remove unnecessary DMatrix methods (#5324)
    • Split up LearnerImpl (#5350)
    • Move segment sorter to common (#5378)
    • Move thread local entry into Learner (#5396)
    • Split up test helpers header (#5455)
    • Requires setting leaf stat when expanding tree (#5501)
    • Purge device_helpers.cuh (#5534)
    • Use thrust functions instead of custom functions (#5544)

    Acknowledgement

    Contributors: Nan Zhu (@CodingCat), Rory Mitchell (@RAMitchell), @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Andrew Kane (@ankane), Avinash Barnwal (@avinashbarnwal), Bart Broere (@bartbroere), Andy Adinets (@canonizer), Chen Qin (@chenqin), Daiki Katsuragawa (@daikikatsuragawa), David Díaz Vico (@daviddiazvico), Darius Kharazi (@dkharazi), Darby Payne (@dpayne), Jason E. Aten, Ph.D. (@glycerine), Philip Hyunsu Cho (@hcho3), James Lamb (@jameslamb), Jan Borchmann (@jborchma), Kamil A. Kaczmarek (@kamil-kaczmarek), Melissa Kohl (@mjkohl32), Nicolas Scozzaro (@nscozzaro), Paul Kaefer (@paulkaefer), Rong Ou (@rongou), Samrat Pandiri (@samratp), Sriram Chandramouli (@sriramch), Yuan Tang (@terrytangyuan), Jiaming Yuan (@trivialfis), Liang-Chi Hsieh (@viirya), Bobby Wang (@wbo4958), Zhang Zhang (@zhangzhang10)

    Reviewers: Nan Zhu (@CodingCat), @LeZhengThu, Rory Mitchell (@RAMitchell), @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Steve Bronder (@SteveBronder), Nikita Titov (@StrikerRUS), Andrew Kane (@ankane), Avinash Barnwal (@avinashbarnwal), @brydag, Andy Adinets (@canonizer), Chandra Shekhar Reddy (@chandrureddy), Chen Qin (@chenqin), Codecov (@codecov-io), David Díaz Vico (@daviddiazvico), Darby Payne (@dpayne), Jason E. Aten, Ph.D. (@glycerine), Philip Hyunsu Cho (@hcho3), James Lamb (@jameslamb), @johnny-cat, Mu Li (@mli), Mate Soos (@msoos), @rnyak, Rong Ou (@rongou), Sriram Chandramouli (@sriramch), Toby Dylan Hocking (@tdhock), Yuan Tang (@terrytangyuan), Oleksandr Pryimak (@trams), Jiaming Yuan (@trivialfis), Liang-Chi Hsieh (@viirya), Bobby Wang (@wbo4958)

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0rc2(May 4, 2020)

  • v1.1.0rc1(Apr 24, 2020)

  • v1.0.2(Mar 4, 2020)

    This patch release applies the following patches to 1.0.0 release:

    • Fix a small typo in sklearn.py that broke multiple eval metrics (#5341)
    • Restore loading model from buffer. (#5360)
    • Use type name for data type check. (#5364)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Feb 21, 2020)

    This release is identical to the 1.0.0 release, except that it fixes a small bug that rendered 1.0.0 incompatible with Python 3.5. See #5328.

    Source code(tar.gz)
    Source code(zip)
Owner
Distributed (Deep) Machine Learning Community
A Community of Awesome Machine Learning Projects
Distributed (Deep) Machine Learning Community
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 5, 2023
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

null 2.5k Dec 28, 2022
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 23.3k Dec 31, 2022
Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

Kubeflow 3.1k Jan 6, 2023
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 519 Jan 3, 2023
Houseprices - Predict sales prices and practice feature engineering, RFs, and gradient boosting

House Prices - Advanced Regression Techniques Predicting House Prices with Machine Learning This project is build to enhance my knowledge about machin

null 1 Jan 1, 2022
Uber Open Source 1.6k Dec 31, 2022
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

null 4.1k Jan 9, 2023
Distributed Deep learning with Keras & Spark

Elephas: Distributed Deep Learning with Keras & Spark Elephas is an extension of Keras, which allows you to run distributed deep learning models at sc

Max Pumperla 1.6k Dec 29, 2022
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

null 92 Dec 14, 2022
Microsoft Machine Learning for Apache Spark

Microsoft Machine Learning for Apache Spark MMLSpark is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark

Microsoft Azure 3.9k Dec 30, 2022
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

mlpack 4.2k Jan 1, 2023
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 4, 2023
[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

TensorFrames (Deprecated) Note: TensorFrames is deprecated. You can use pandas UDF instead. Experimental TensorFlow binding for Scala and Apache Spark

Databricks 757 Dec 31, 2022
Spark development environment for k8s

Local Spark Dev Env with Docker Development environment for k8s. Using the spark-operator image to ensure it will be the same environment. Start conta

Otacilio Filho 18 Jan 4, 2022
Code base of KU AIRS: SPARK Autonomous Vehicle Team

KU AIRS: SPARK Autonomous Vehicle Project Check this link for the blog post describing this project and the video of SPARK in simulation and on parkou

Mehmet Enes Erciyes 1 Nov 23, 2021
MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. The constraints are defined as upper bounds on sub-objective loss function. MooGBT uses a Augmented Lagrangian(AL) based constrained optimization framework with Gradient Boosted Trees, to optimize for multiple objectives.

Swiggy 66 Dec 6, 2022
🎛 Distributed machine learning made simple.

?? lazycluster Distributed machine learning made simple. Use your preferred distributed ML framework like a lazy engineer. Getting Started • Highlight

Machine Learning Tooling 44 Nov 27, 2022