Python module for performing linear regression for data with measurement errors and intrinsic scatter

Overview

Linear regression for data with measurement errors and intrinsic scatter (BCES)

Python module for performing robust linear regression on (X,Y) data points where both X and Y have measurement errors.

The fitting method is the bivariate correlated errors and intrinsic scatter (BCES) and follows the description given in Akritas & Bershady. 1996, ApJ. Some of the advantages of BCES regression compared to ordinary least squares fitting (quoted from Akritas & Bershady 1996):

  • it allows for measurement errors on both variables
  • it permits the measurement errors for the two variables to be dependent
  • it permits the magnitudes of the measurement errors to depend on the measurements
  • other "symmetric" lines such as the bisector and the orthogonal regression can be constructed.

In order to understand how to perform and interpret the regression results, please read the paper.

Installation

Using pip:

pip install bces

If that does not work, you can install it using the setup.py script:

python setup.py install

You may need to run the last command with sudo.

Alternatively, if you plan to modify the source then install the package with a symlink, so that changes to the source files will be immediately available:

python setup.py develop

Usage

import bces.bces as BCES
a,b,aerr,berr,covab=BCES.bcesp(x,xerr,y,yerr,cov)

Arguments:

  • x,y : 1D data arrays
  • xerr,yerr: measurement errors affecting x and y, 1D arrays
  • cov : covariance between the measurement errors, 1D array

If you have no reason to believe that your measurement errors are correlated (which is usually the case), you can provide an array of zeroes as input for cov:

cov = numpy.zeros_like(x)

Output:

  • a,b : best-fit parameters a,b of the linear regression such that y = Ax + B.
  • aerr,berr : the standard deviations in a,b
  • covab : the covariance between a and b (e.g. for plotting confidence bands)

Each element of the arrays a, b, aerr, berr and covab correspond to the result of one of the different BCES lines: y|x, x|y, bissector and orthogonal, as detailed in the table below. Please read the original BCES paper to understand what these different lines mean.

Element Method Description
0 y|x Assumes x as the independent variable
1 x|y Assumes y as the independent variable
2 bissector Line that bisects the y|x and x|y. This approach is self-inconsistent, do not use this method, cf. Hogg, D. et al. 2010, arXiv:1008.4686.
3 orthogonal Orthogonal least squares: line that minimizes orthogonal distances. Should be used when it is not clear which variable should be treated as the independent one

By default, bcesp run in parallel with bootstrapping.

Examples

bces-example.ipynb is a jupyter notebook including a practical, step-by-step example of how to use BCES to perform regression on data with uncertainties on x and y. It also illustrates how to plot the confidence band for a fit.

If you have suggestions of more examples, feel free to add them.

Running Tests

To test your installation, run the following command inside the BCES directory:

pytest -v

Requirements

See requirements.txt.

Citation

If you end up using this code in your paper, you are morally obliged to cite the following works

I spent considerable time writing this code, making sure it is correct and user-friendly, so I would appreciate your citation of the second paper in the above list as a token of gratitude.

If you are really happy with the code, you can buy me a beer.

Misc.

This python module is inspired on the (much faster) fortran routine originally written Akritas et al. I wrote it because I wanted something more portable and easier to use, trading off speed.

For a general tutorial on how to (and how not to) perform linear regression, please read this paper: Hogg, D. et al. 2010, arXiv:1008.4686. In particular, please refrain from using the bisector method.

If you want to plot confidence bands for your fits, have a look at nmmn package (in particular, modules nmmn.plots.fitconf and stats).

Bayesian linear regression

There are a couple of Bayesian approaches to perform linear regression which can be more powerful than BCES, some of which are described below.

A Gibbs Sampler for Multivariate Linear Regression: R code, arXiv:1509.00908. Linear regression in the fairly general case with errors in X and Y, errors may be correlated, intrinsic scatter. The prior distribution of covariates is modeled by a flexible mixture of Gaussians. This is an extension of the very nice work by Brandon Kelly (Kelly, B. 2007, ApJ).

LIRA: A Bayesian approach to linear regression in astronomy: R code, arXiv:1509.05778 Bayesian hierarchical modelling of data with heteroscedastic and possibly correlated measurement errors and intrinsic scatter. The method fully accounts for time evolution. The slope, the normalization, and the intrinsic scatter of the relation can evolve with the redshift. The intrinsic distribution of the independent variable is approximated using a mixture of Gaussian distributions whose means and standard deviations depend on time. The method can address scatter in the measured independent variable (a kind of Eddington bias), selection effects in the response variable (Malmquist bias), and departure from linearity in form of a knee.

AstroML: Machine Learning and Data Mining for Astronomy. Python example of a linear fit to data with correlated errors in x and y using AstroML. In the literature, this is often referred to as total least squares or errors-in-variables fitting.

Todo

If you have improvements to the code, suggestions of examples,speeding up the code etc, feel free to submit a pull request.

  • implement weighted least squares (WLS)
  • implement unit testing: bces
  • unit testing: bootstrap

Visit the author's web page and/or follow him on twitter (@nemmen).


Copyright (c) 2021, Rodrigo Nemmen. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Comments
  • Installation from Pypi fails

    Installation from Pypi fails

    Trying to install from Pypi fails with a missing LICENSE:

    Collecting bces
      Using cached https://files.pythonhosted.org/packages/a0/1e/2ac48fe352febd8924db926b34c3280252dfda936fd2b5bd12688cb6c571/bces-1.0.tar.gz
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-install-ECRe_1/bces/setup.py", line 6, in <module>
            with open('LICENSE') as f:
        IOError: [Errno 2] No such file or directory: 'LICENSE'
        
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-ECRe_1/bces/
    

    But istalling from git works.

    I think you have to declare the files you want included in package_data in the setup from setup.py or in the MANIFEST.in

    help wanted 
    opened by astrobarn 5
  • fast bootstrapping with numpy vectorization

    fast bootstrapping with numpy vectorization

    Fast bootstrapping using numpy vectorization. (Optional) bias correction on the slope and intercept uncertainties using the bias-corrected and accelerated (BCa) confidence interval method from Efron, An Introduction to the Bootstrap. Chapman & Hall 1993.

    opened by evangroopman 3
  • Add __init__py?

    Add __init__py?

    I'm not sure if it was just my dodgy install, but I couldn't actually use this as a module until I created an "__init__.py" file in the directory with the line "from BCES.bces import *".

    opened by hposborn 2
  • Unused variable + multiple RuntimeWarning

    Unused variable + multiple RuntimeWarning

    I'm testing the code on some random data:

    def getData(c):
        """Initiate random data."""
        import random
        random.seed(9001)
        np.random.seed(117)
        x = np.array([0, 1, 2, 3, 4, 5])
        y = np.array([i**2 + random.random() for i in x])
        xerr = c * np.array([random.random() for i in x])
        yerr = c * np.array([random.random() for i in x])
        return x, y, xerr, yerr
    
    x, y, xerr, yerr = getData(1.)
    cov = np.zeros(len(x))
    N_boot = 1000
    a, b, erra, errb, covab = bcesboot(x, xerr, y, yerr, cov, N_boot)
    

    Two thing I noticed:

    1. There is an unused covar_ba variable defined in bces().
    2. If I use bcesboot() with N_boot=1000 I get no warnings, but if I increase it to N=10000 (or more) I get multiple RuntimeWarning:
    Bootstrapping progress:
    bces_orig.py:54: RuntimeWarning: divide by zero encountered in double_scalars
      a[1] = (y2.var() - sig22var)/(covar_y1y2 - sig12var)	# X|Y
    bces_orig.py:55: RuntimeWarning: invalid value encountered in double_scalars
      a[2] = ( a[0]*a[1] - 1.0 + numpy.sqrt((1.0 + a[0]**2)*(1.0 + a[1]**2)) ) / (a[0]+a[1])	# bisector
    bces_orig.py:60: RuntimeWarning: divide by zero encountered in double_scalars
      a[3] = 0.5*((a[1]-(1./a[0])) + sign*numpy.sqrt(4.+(a[1]-(1./a[0]))**2))	# orthogonal
    bces_orig.py:60: RuntimeWarning: invalid value encountered in double_scalars
      a[3] = 0.5*((a[1]-(1./a[0])) + sign*numpy.sqrt(4.+(a[1]-(1./a[0]))**2))	# orthogonal
    bces_orig.py:68: RuntimeWarning: invalid value encountered in subtract
      xi.append(	( (y2-y2.mean()) * (y2-a[1]*y1-b[1]) - y2err**2 ) / covar_y1y2	)	# X|Y
    bces_orig.py:69: RuntimeWarning: invalid value encountered in multiply
      xi.append(	xi[0] * (1.+a[1]**2)*a[2] / ((a[0]+a[1])*numpy.sqrt((1.+a[0]**2)*(1.+a[1]**2))) + xi[1] * (1.+a[0]**2)*a[2] / ((a[0]+a[1])*numpy.sqrt((1.+a[0]**2)*(1.+a[1]**2)))	)	# bisector
    bces_orig.py:70: RuntimeWarning: divide by zero encountered in double_scalars
      xi.append(	xi[0] * a[3]/(a[0]**2*numpy.sqrt(4.+(a[1]-1./a[0])**2)) + xi[1]*a[3]/numpy.sqrt(4.+(a[1]-1./a[0])**2)	)	# orthogonal
    bces_orig.py:70: RuntimeWarning: invalid value encountered in double_scalars
      xi.append(	xi[0] * a[3]/(a[0]**2*numpy.sqrt(4.+(a[1]-1./a[0])**2)) + xi[1]*a[3]/numpy.sqrt(4.+(a[1]-1./a[0])**2)	)	# orthogonal
    bces_orig.py:176: RuntimeWarning: invalid value encountered in double_scalars
      erra[i]=numpy.sqrt( 1./(nsim-1) * ( numpy.sum(am[:,i]**2)-nsim*(am[:,i].mean())**2 ))
    bces_orig.py:177: RuntimeWarning: invalid value encountered in double_scalars
      errb[i]=numpy.sqrt( 1./(nsim-1) * ( numpy.sum(bm[:,i]**2)-nsim*(bm[:,i].mean())**2 ))
    bces_orig.py:178: RuntimeWarning: invalid value encountered in double_scalars
      covab[i]=1./(nsim-1) * ( numpy.sum(am[:,i]*bm[:,i])-nsim*am[:,i].mean()*bm[:,i].mean() )
    
    opened by Gabriel-p 2
  • Example 2 in Jupyter tutorial does not compile

    Example 2 in Jupyter tutorial does not compile

    Multiple minor bugs, but even when the minor bugs are solved, bces.bcesp() returns an error

    Traceback error:

    """ Traceback (most recent call last): File "/Users/dangeles/anaconda3/lib/python3.5/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/Users/dangeles/anaconda3/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar return list(map(*args)) File "/Users/dangeles/my_repos/BCES/bces.py", line 224, in ab for i in range(nsim): TypeError: 'float' object cannot be interpreted as an integer """

    opened by dangeles 2
Owner
Rodrigo Nemmen
Professor of Astronomy & Astrophysics
Rodrigo Nemmen
This repository contains the code to predict house price using Linear Regression Method

House-Price-Prediction-Using-Linear-Regression The dataset I used for this personal project is from Kaggle uploaded by aariyan panchal. Link of Datase

null 0 Jan 28, 2022
A linear regression model for house price prediction

Linear_Regression_Model A linear regression model for house price prediction. This code is using these packages, so please make sure your have install

ShawnWang 1 Nov 29, 2021
Multiple Linear Regression using the LinearRegression class from sklearn.linear_model library

Multiple-Linear-Regression-master - A python program to implement Multiple Linear Regression using the LinearRegression class from sklearn.linear model library

Kushal Shingote 1 Feb 6, 2022
Reproducibility and Replicability of Web Measurement Studies

Reproducibility and Replicability of Web Measurement Studies This repository holds additional material to the paper "Reproducibility and Replicability

null 6 Dec 31, 2022
scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly.

scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly. Its main purpose is the transformation of bilinear forms into sparse matrices and linear forms into vectors.

Tom Gustafsson 297 Dec 13, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL)

CyLP CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL). CyLP’s unique feature is that you can use i

COIN-OR Foundation 161 Dec 14, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 5, 2023
Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

Miles Cranmer 924 Jan 3, 2023
MIT-Machine Learning with Python–From Linear Models to Deep Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning | One of the 5 courses in MIT MicroMasters in Statistics & Data Science Welcome t

null 2 Aug 23, 2022
A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching.

A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching. The solver will solve equations of the type: A can be

Sanjeet N. Dasharath 3 Feb 15, 2022
Stats, linear algebra and einops for xarray

xarray-einstats Stats, linear algebra and einops for xarray ⚠️ Caution: This project is still in a very early development stage Installation To instal

ArviZ 30 Dec 28, 2022
icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

icepickle It's a cooler way to store simple linear models. The goal of icepickle is to allow a safe way to serialize and deserialize linear scikit-lea

vincent d warmerdam 24 Dec 9, 2022
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
Decision Tree Regression algorithm implemented on Python from scratch.

Decision_Tree_Regression I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when

null 1 Dec 22, 2021
Simple linear model implementations from scratch.

Hand Crafted Models Simple linear model implementations from scratch. Table of contents Overview Project Structure Getting started Citing this project

Jonathan Sadighian 2 Sep 13, 2021
Iterative stochastic gradient descent (SGD) linear regressor with regularization

SGD-Linear-Regressor Iterative stochastic gradient descent (SGD) linear regressor with regularization Dataset: Kaggle “Graduate Admission 2” https://w

Zechen Ma 1 Oct 29, 2021
An implementation of Relaxed Linear Adversarial Concept Erasure (RLACE)

Background This repository contains an implementation of Relaxed Linear Adversarial Concept Erasure (RLACE). Given a dataset X of dense representation

Shauli Ravfogel 4 Apr 13, 2022