Efficient matrix representations for working with tabular data

Overview

Efficient matrix representations for working with tabular data

CI

Installation

Simply install via conda-forge!

conda install -c conda-forge tabmat

Use case

TL;DR: We provide matrix classes for efficiently building statistical algorithms with data that is partially dense, partially sparse and partially categorical.

Data used in economics, actuarial science, and many other fields is often tabular, containing rows and columns. Further common properties are also common:

  • It often is very sparse.
  • It often contains a mix of dense and sparse columns.
  • It often contains categorical data, processed into many columns of indicator values created by "one-hot encoding."

High-performance statistical applications often require fast computation of certain operations, such as

  • Computing sandwich products of the data, transpose(X) @ diag(d) @ X. A sandwich product shows up in the solution to weighted least squares, as well as in the Hessian of the likelihood in generalized linear models such as Poisson regression.
  • Matrix-vector products, possibly on only a subset of the rows or columns. For example, when limiting computation to an "active set" in a L1-penalized coordinate descent implementation, we may only need to compute a matrix-vector product on a small subset of the columns.
  • Computing all operations on standardized predictors which have mean zero and standard deviation one. This helps with numerical stability and optimizer efficiency in a wide range of machine learning algorithms.

This library and its design

We designed this library with the above use cases in mind. We built this library first for estimating generalized linear models, but expect it will be useful in a variety of econometric and statistical use cases. This library was borne out of our need for speed, and its unified API is motivated by the desire to work with a unified matrix API internal to our statistical algorithms.

Design principles:

  • Speed and memory efficiency are paramount.
  • You don't need to sacrifice functionality by using this library: DenseMatrix and SparseMatrix subclass np.ndarray and scipy.sparse.csc_matrix respectively, and inherit behavior from those classes wherever it is not improved on.
  • As much as possible, syntax follows NumPy syntax, and dimension-reducing operations (like sum) return NumPy arrays, following NumPy dimensions about the dimensions of results. The aim is to make these classes as close as possible to being drop-in replacements for numpy.ndarray. This is not always possible, however, due to the differing APIs of numpy.ndarray and scipy.sparse.
  • Other operations, such as toarray, mimic Scipy sparse syntax.
  • All matrix classes support matrix-vector products, sandwich products, and getcol.

Individual subclasses may support significantly more operations.

Matrix types

  • DenseMatrix represents dense matrices, subclassing numpy nparray. It additionally supports methods getcol, toarray, sandwich, standardize, and unstandardize.
  • SparseMatrix represents column-major sparse data, subclassing scipy.sparse.csc_matrix. It additionally supports methods sandwich and standardize.
  • CategoricalMatrix represents one-hot encoded categorical matrices. Because all the non-zeros in these matrices are ones and because each row has only one non-zero, the data can be represented and multiplied much more efficiently than a generic sparse matrix.
  • SplitMatrix represents matrices with both dense, sparse and categorical parts, allowing for a significant speedup in matrix multiplications.
  • StandardizedMatrix efficiently and sparsely represents a matrix that has had its column normalized to have mean zero and variance one. Even if the underlying matrix is sparse, such a normalized matrix will be dense. However, by storing the scaling and shifting factors separately, StandardizedMatrix retains the original matrix sparsity.

Wide data set

Benchmarks

See here for detailed benchmarking.

API documentation

See here for detailed API documentation.

Comments
  • Poor performance on narrow sparse matrices.

    Poor performance on narrow sparse matrices.

    I've been investigating problems where our MKL-based sparse matrices are massively underperforming scipy.sparse. For example:

      operation           storage memory        time
    0    matvec  scipy.sparse csc      0  0.00211215
    1    matvec  quantcore.matrix      0   0.0266283
    

    This is a matrix with 3e6 rows and 3 columns

    It seems like having a small number of columns makes MKL perform quite poorly. I'm not sure why that's the case. But, it may be worth having a check and just falling back to scipy.sparse in narrow cases like this. This kind of narrow case may actually be the dominant use case for sparse matrices because they will be a small component of a SplitMatrix.

    help wanted 
    opened by tbenthompson 11
  • Swap n_rows with n_cols in matvec

    Swap n_rows with n_cols in matvec

    This might fix https://github.com/Quantco/quantcore.glm/issues/323. I think we were passing in the number of rows into matvec when we mean to pass in the number of columns. But maybe I'm misunderstanding what's going on.

    The function signature for matvec is

    https://github.com/Quantco/quantcore.matrix/blob/9ef54c6cb21e8d8063c0968fe47c300b79d3af4b/src/quantcore/matrix/ext/categorical.pyx#L61-L62

    but before we were passing in the number of rows as last argument.

    opened by jtilly 6
  • Build script in PyPI source version uses default `jemalloc`

    Build script in PyPI source version uses default `jemalloc`

    I see the build script for linux uses jemalloc with disable-tls: "./autogen.sh --disable-cxx --with-jemalloc-prefix=local --with-install-suffix=local --disable-tls --disable-initial-exec-tls",

    However, the source distribution in PyPI doesn't run that script when installing it through pip, relying instead on whatever jemalloc it finds when it tries to compile. If, for example, one tries to install tabmat from source through pip, it will later on fail to import, complaining about an error with jemalloc:

    cannot allocate memory in static TLS block
    
    opened by david-cortes 5
  • BUG: cannot allocate memory in static TLS block   when installing through pip

    BUG: cannot allocate memory in static TLS block when installing through pip

    The installation wia conda froge was getting stuck in the "Solving environment" part so I tried to install with pip, given that the package is available on Pypi. pip install glum runs in seconds, but then I am unable to import stuff from it, with the following error:

    In [1]: from glum import GeneralizedLinearRegressor
    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-1-0284693fe484> in <module>
    ----> 1 from glum import GeneralizedLinearRegressor
    
    ~/anaconda3/lib/python3.9/site-packages/glum/__init__.py in <module>
          1 import pkg_resources
          2 
    ----> 3 from ._distribution import TweedieDistribution
          4 from ._glm import GeneralizedLinearRegressor
          5 from ._glm_cv import GeneralizedLinearRegressorCV
    
    ~/anaconda3/lib/python3.9/site-packages/glum/_distribution.py in <module>
          6 import numpy as np
          7 from scipy import sparse, special
    ----> 8 from tabmat import MatrixBase, StandardizedMatrix
          9 
         10 from ._functions import (
    
    ~/anaconda3/lib/python3.9/site-packages/tabmat/__init__.py in <module>
    ----> 1 from .categorical_matrix import CategoricalMatrix
          2 from .constructor import from_csc, from_pandas
          3 from .dense_matrix import DenseMatrix
          4 from .matrix_base import MatrixBase
          5 from .sparse_matrix import SparseMatrix
    
    ~/anaconda3/lib/python3.9/site-packages/tabmat/categorical_matrix.py in <module>
        171 from .ext.split import sandwich_cat_cat, sandwich_cat_dense
        172 from .matrix_base import MatrixBase
    --> 173 from .sparse_matrix import SparseMatrix
        174 from .util import (
        175     check_matvec_out_shape,
    
    ~/anaconda3/lib/python3.9/site-packages/tabmat/sparse_matrix.py in <module>
          4 from scipy import sparse as sps
          5 
    ----> 6 from .ext.sparse import (
          7     csc_rmatvec,
          8     csc_rmatvec_unrestricted,
    
    ImportError: /home/mathurin/anaconda3/lib/python3.9/site-packages/tabmat/ext/../../tabmat.libs/libjemalloclocal-691a3dac.so.2: cannot allocate memory in static TLS block
    

    googling did not help, is there a way to make the pip-installed version work ?

    opened by mathurinm 5
  • Improvements to SplitMatrix

    Improvements to SplitMatrix

    • Allow SplitMatrix to be constructed from another SplitMatrix.
    • Allow inputs of SplitMatrix to be 1-d
    • Implement __getitem__ for column subset
    • Also had to implement column subsetting for CategoricalMatrix
    • __repr__ uses the __repr__ method of components instead of str()

    ToDo:

    • [ ] FIX BUG WITH _split_col_subsets (first confirm that it's a bug)
    • [ ] Add testing for new features

    Checklist

    • [ ] Added a CHANGELOG.rst entry
    opened by MarcAntoineSchmidtQC 5
  • Enable dropping one column from a CategoricalMatrix?

    Enable dropping one column from a CategoricalMatrix?

    Currently, CategoricalMatrix does not provide an easy way to drop a column. We are required to include a category for every row in the dataset, but in an unregularized setting, it is nice to sometimes drop one column.

    Something sort of like this is already implemented by the cols parameter to the matrix vector and sandwich functions.

    question on hold 
    opened by tbenthompson 5
  • BUG: segfault when fitting a GeneralizedLinearRegressor

    BUG: segfault when fitting a GeneralizedLinearRegressor

    Requirements: pip install libsvmdata

    The following script gives me a segfault:

    from libsvmdata import fetch_libsvm
    from glum import GeneralizedLinearRegressor
    
    X, y = fetch_libsvm("rcv1.binary")
    clf = GeneralizedLinearRegressor(alpha=0.01, fit_intercept=False,
                                     family="gaussian")
    clf.fit(X, y)
    

    Output:

    In [1]: %run glum_segfault.py
    Dataset: rcv1.binary
    [1]    271745 segmentation fault (core dumped)  ipython
    

    I'm using glum 2.0.3

    @qb3

    opened by mathurinm 4
  • Bump pypa/cibuildwheel from 2.2.2 to 2.3.0

    Bump pypa/cibuildwheel from 2.2.2 to 2.3.0

    Bumps pypa/cibuildwheel from 2.2.2 to 2.3.0.

    Release notes

    Sourced from pypa/cibuildwheel's releases.

    v2.3.0

    • 📈 cibuildwheel now defaults to manylinux2014 image for linux builds, rather than manylinux2010. If you want to stick with manylinux2010, it's simple to set this using the image options. (#926)
    • ✨ You can now pass environment variables from the host machine into the Docker container during a Linux build. Check out the docs for CIBW_ENVIRONMENT_PASS_LINUX for the details. (#914)
    • ✨ Added support for building PyPy 3.8 wheels. (#881)
    • ✨ Added support for building Windows arm64 CPython wheels on a Windows arm64 runner. We can't test this in CI yet, so for now, this is experimental. (#920)
    • 📚 Improved the deployment documentation (#911)
    • 🛠 Changed the escaping behaviour inside cibuildwheel's option placeholders e.g. {project} in before_build or {dest_dir} in repair_wheel_command. This allows bash syntax like ${SOME_VAR} to passthrough without being interpreted as a placeholder by cibuildwheel. See this section in the docs for more info. (#889)
    • 🛠 Pip updated to 21.3, meaning it now defaults to in-tree builds again. If this causes an issue with your project, setting environment variable PIP_USE_DEPRECATED=out-of-tree-build is available as a temporary flag to restore the old behaviour. However, be aware that this flag will probably be removed soon. (#881)
    • 🐛 You can now access the current Python interpreter using python3 within a build on Windows (#917)
    Changelog

    Sourced from pypa/cibuildwheel's changelog.

    v2.3.0

    26 November 2021

    • 📈 cibuildwheel now defaults to manylinux2014 image for linux builds, rather than manylinux2010. If you want to stick with manylinux2010, it's simple to set this using the image options. (#926)
    • ✨ You can now pass environment variables from the host machine into the Docker container during a Linux build. Check out the docs for CIBW_ENVIRONMENT_PASS_LINUX for the details. (#914)
    • ✨ Added support for building PyPy 3.8 wheels. (#881)
    • ✨ Added support for building Windows arm64 CPython wheels on a Windows arm64 runner. We can't test this in CI yet, so for now, this is experimental. (#920)
    • 📚 Improved the deployment documentation (#911)
    • 🛠 Changed the escaping behaviour inside cibuildwheel's option placeholders e.g. {project} in before_build or {dest_dir} in repair_wheel_command. This allows bash syntax like ${SOME_VAR} to passthrough without being interpreted as a placeholder by cibuildwheel. See this section in the docs for more info. (#889)
    • 🛠 Pip updated to 21.3, meaning it now defaults to in-tree builds again. If this causes an issue with your project, setting environment variable PIP_USE_DEPRECATED=out-of-tree-build is available as a temporary flag to restore the old behaviour. However, be aware that this flag will probably be removed soon. (#881)
    • 🐛 You can now access the current Python interpreter using python3 within a build on Windows (#917)
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 4
  • Use a namespaced version of `jemalloc`

    Use a namespaced version of `jemalloc`

    We are currently observing issues when using quantcore.matrix in conjunction with onnx and onnxruntime on MacOS. The call to python -c 'import onnx; import quantcore.matrix.ext.dense; import onnxruntime' fails with a bus error or segfault whereas the call DYLD_INSERT_LIBRARIES=$CONDA_PREFIX/lib/libjemalloc.dylib python -c 'import onnx; import quantcore.matrix.ext.dense; import onnxruntime' passes just fine. This indicates that using an unnamespaced jemalloc may be problematic here as the following traceback indicates:

    collecting ... Process 6259 stopped
    * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x4efffffff7)
        frame #0: 0x000000013c0f3704 libjemalloc.2.dylib`je_free_default + 240
    libjemalloc.2.dylib`je_free_default:
    ->  0x13c0f3704 <+240>: str    x20, [x8, w9, sxtw #3]
        0x13c0f3708 <+244>: ldr    w8, [x19, #0x200]
        0x13c0f370c <+248>: sub    w9, w8, #0x1              ; =0x1
        0x13c0f3710 <+252>: str    w9, [x19, #0x200]
    Target 0: (python) stopped.
    (lldb) bt
    * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x4efffffff7)
      * frame #0: 0x000000013c0f3704 libjemalloc.2.dylib`je_free_default + 240
        frame #1: 0x0000000142745010 onnxruntime_pybind11_state.so`std::__1::__hash_table<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, std::__1::__unordered_map_hasher<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_hash, pybind11::detail::type_equal_to, true>, std::__1::__unordered_map_equal<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_equal_to, pybind11::detail::type_hash, true>, std::__1::allocator<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > > > >::__rehash(unsigned long) + 76
        frame #2: 0x0000000142744dd0 onnxruntime_pybind11_state.so`std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, void*>*>, bool> std::__1::__hash_table<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, std::__1::__unordered_map_hasher<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_hash, pybind11::detail::type_equal_to, true>, std::__1::__unordered_map_equal<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_equal_to, pybind11::detail::type_hash, true>, std::__1::allocator<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > > > >::__emplace_unique_key_args<std::__1::type_index, std::__1::piecewise_construct_t const&, std::__1::tuple<std::__1::type_index const&>, std::__1::tuple<> >(std::__1::type_index const&, std::__1::piecewise_construct_t const&, std::__1::tuple<std::__1::type_index const&>&&, std::__1::tuple<>&&) + 480
        frame #3: 0x00000001427427dc onnxruntime_pybind11_state.so`pybind11::detail::generic_type::initialize(pybind11::detail::type_record const&) + 396
        frame #4: 0x0000000142751688 onnxruntime_pybind11_state.so`pybind11::class_<onnxruntime::ExecutionOrder>::class_<>(pybind11::handle, char const*) + 140
        frame #5: 0x00000001427513f8 onnxruntime_pybind11_state.so`pybind11::enum_<onnxruntime::ExecutionOrder>::enum_<>(pybind11::handle const&, char const*) + 52
        frame #6: 0x000000014272b5c8 onnxruntime_pybind11_state.so`onnxruntime::python::addObjectMethods(pybind11::module_&, onnxruntime::Environment&) + 296
        frame #7: 0x0000000142734e68 onnxruntime_pybind11_state.so`PyInit_onnxruntime_pybind11_state + 340
        frame #8: 0x000000010019f994 python`_imp_create_dynamic + 2412
        frame #9: 0x00000001000b40f8 python`cfunction_vectorcall_FASTCALL + 208
        frame #10: 0x000000010016bfd8 python`_PyEval_EvalFrameDefault + 30088
    

    My suggestion would be to add an output to the jemalloc-feedstock as described in https://github.com/conda-forge/jemalloc-feedstock/issues/23 that comes with a prefixed version of the library.

    opened by xhochy 4
  • Bump google-github-actions/setup-gcloud from 0.2.0 to 0.2.1

    Bump google-github-actions/setup-gcloud from 0.2.0 to 0.2.1

    Bumps google-github-actions/setup-gcloud from 0.2.0 to 0.2.1.

    Release notes

    Sourced from google-github-actions/setup-gcloud's releases.

    setup-gcloud v0.2.1

    Bug Fixes

    Changelog

    Sourced from google-github-actions/setup-gcloud's changelog.

    0.2.1 (2021-02-12)

    Bug Fixes

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

    Dependabot will merge this PR once CI passes on it, as requested by @xhochy.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 4
  • Update linter

    Update linter

    Updating the flake8 config to match the new flake8 config from glm_benchmarks.

    Changes:

    • changed linter according to the issue
    • added simple docstrings to public functions (most functions were in the main matrix classes)
    • preceded function names with underscores if the functions were only being used internally
    • added “no docstrings in magic function” flake8 error to list of ignores (didn’t seem helpful)
    • added # noqa in places where flake8 errors were just creating issues in unhelpful places

    Closes #45

    Checklist

    • [ ] Added a CHANGELOG.rst entry
    opened by MargueriteBastaQC 4
  • Bump pypa/cibuildwheel from 2.11.3 to 2.11.4

    Bump pypa/cibuildwheel from 2.11.3 to 2.11.4

    Bumps pypa/cibuildwheel from 2.11.3 to 2.11.4.

    Release notes

    Sourced from pypa/cibuildwheel's releases.

    v2.11.4

    • 🐛 Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)
    • 🛠 Updates CPython 3.11 to 3.11.1 (#1371)
    • 🛠 Updates PyPy 3.7 to 3.7.10, except on macOS which remains on 7.3.9 due to a bug. (#1371)
    • 📚 Added a reference to abi3audit to the docs (#1347)
    Changelog

    Sourced from pypa/cibuildwheel's changelog.

    v2.11.4

    24 Dec 2022

    • 🐛 Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)
    • 🛠 Updates CPython 3.11 to 3.11.1 (#1371)
    • 🛠 Updates PyPy to 7.3.10, except on macOS which remains on 7.3.9 due to a bug on that platform. (#1371)
    • 📚 Added a reference to abi3audit to the docs (#1347)
    Commits
    • 27fc88e Bump version: v2.11.4
    • a7e9ece Merge pull request #1371 from pypa/update-dependencies-pr
    • b9a3ed8 Update cibuildwheel/resources/build-platforms.toml
    • 3dcc2ff fix: not skipping the tests stops the copy (Windows ARM) (#1377)
    • 1c9ec76 Merge pull request #1378 from pypa/henryiii-patch-3
    • 22b433d Merge pull request #1379 from pypa/pre-commit-ci-update-config
    • 98fdf8c [pre-commit.ci] pre-commit autoupdate
    • cefc5a5 Update dependencies
    • e53253d ci: move to ubuntu 20
    • e9ecc65 [pre-commit.ci] pre-commit autoupdate (#1374)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • Cannot sandwich SplitMatrix with non-owned array

    Cannot sandwich SplitMatrix with non-owned array

    This throws an error:

    import numpy as np
    import tabmat
    from scipy.sparse import csc_matrix
    
    rng = np.random.default_rng(seed=123)
    X = rng.standard_normal(size=(100,20))
    Xd = tabmat.DenseMatrix(X[:,:10])
    Xs = tabmat.SparseMatrix(csc_matrix(X[:,10:]))
    Xm = tabmat.SplitMatrix([Xd, Xs])
    Xm.sandwich(np.ones(X.shape[0]))
    
    ---------------------------------------------------------------------------
    Exception                                 Traceback (most recent call last)
    <ipython-input-2-91ba52e4f568> in <module>
          8 Xs = tabmat.SparseMatrix(csc_matrix(X[:,10:]))
          9 Xm = tabmat.SplitMatrix([Xd, Xs])
    ---> 10 Xm.sandwich(np.ones(X.shape[0]))
    
    ~/anaconda3/envs/py39/lib/python3.9/site-packages/tabmat/split_matrix.py in sandwich(self, d, rows, cols)
        287             idx_i = subset_cols_indices[i]
        288             mat_i = self.matrices[i]
    --> 289             res = mat_i.sandwich(d, rows, subset_cols[i])
        290             if isinstance(res, sps.dia_matrix):
        291                 out[(idx_i, idx_i)] += np.squeeze(res.data)
    
    ~/anaconda3/envs/py39/lib/python3.9/site-packages/tabmat/dense_matrix.py in sandwich(self, d, rows, cols)
         62         d = np.asarray(d)
         63         rows, cols = setup_restrictions(self.shape, rows, cols)
    ---> 64         return dense_sandwich(self, d, rows, cols)
         65 
         66     def _cross_sandwich(
    
    src/tabmat/ext/dense.pyx in tabmat.ext.dense.dense_sandwich()
    
    Exception: 
    

    Compare against this:

    Xd = tabmat.DenseMatrix(X[:,:10].copy())
    Xs = tabmat.SparseMatrix(csc_matrix(X[:,10:]))
    Xm = tabmat.SplitMatrix([Xd, Xs])
    Xm.sandwich(np.ones(X.shape[0]))
    

    (No error)

    opened by david-cortes 0
  • tabmat has no attribute __version__

    tabmat has no attribute __version__

    I find it convenient to be able to check directly inside a python shell.

    In [1]: import tabmat
    
    In [2]: tabmat.__version__
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-2-aae82a909ca3> in <module>
    ----> 1 tabmat.__version__
    
    AttributeError: module 'tabmat' has no attribute '__version__'
    

    this is tabmat 3.0.7 installed via PyPI

    opened by mathurinm 3
  • Support initializing matrices with Patsy?

    Support initializing matrices with Patsy?

    I think we've discussed this, but I don't remember the conclusion and can't find an issue now.

    We recommend from_pandas as the way "most users" should construct tabmat objects. from_pandas then guesses which columns should be treated as categorical. I think it would be really nice to have Patsy-like formulas as an alternative, since

    1. R users (including many economists) like using formulas, and
    2. It's easy to infer from a Patsy formula which columns are categorical, which are sparse (generally interactions with categoricals), and which are dense (everything else), so this could remove some of the guesswork from tabmat and improve performance.

    I'm not sure how feasible this would be, since Patsy is a sizable library that allows for fairly sophisticated formulas and it would be quite an endeavor to replicate all of the functionality. A few ways of doing this would be

    1. Don't change any code, but document how Patsy can already be used to construct a dataframe that can then be passed to tabmat / glum. Warn that this involves creating a large dense matrix as an intermediate. See Twitter discussion: https://twitter.com/esantorella22/status/1447980727820296198
    2. Have tabmat call patsy.dmatrix with "return_type = 'dataframe'", then call tabmat.from_pandas on the resulting pd.DataFrame. That would not be any more efficient than (1), but would just save the user a little typing and the need to install patsy. On the down side, it adds a dependency and may force creation of a very large dense matrix.
    3. Support very simple patsy-like formulas without having patsy as a dependency or reproducing its full functionality. That would allow the user to designate which columns should be treated as categorical in a more natural way. See Twitter discussion: https://twitter.com/esantorella22/status/1447981081358184461
    4. Make it so that any Patsy formula can be used to create a tabmat object -- I'm not sure how. Might be hard.
    opened by esantorella 2
Releases(3.1.2)
  • 3.1.2(Jul 1, 2022)

  • 3.1.1(Jul 1, 2022)

    3.1.1 - 2022-07-01

    Other changes:

    • Add Python 3.10 support to CI (remove Python 3.6).
    • We are now building the wheel for PyPI without --march=native to make it more portable across architectures.
    Source code(tar.gz)
    Source code(zip)
  • 3.1.0(Mar 7, 2022)

  • 3.0.8(Jan 3, 2022)

  • 3.0.7(Nov 23, 2021)

  • 3.0.6(Nov 12, 2021)

    Bug fix

    • We fixed a bug in SplitMatrix.matvec, where incorrect matrix vector products were computed when a SplitMatrix did not contain any dense components.
    Source code(tar.gz)
    Source code(zip)
  • 3.0.5(Nov 5, 2021)

    Other changes

    • We are now specifying the run time dependencies in setup.py, so that missing dependencies are automatically installed from PyPI when installing tabmat via pip.
    Source code(tar.gz)
    Source code(zip)
  • 3.0.4(Nov 3, 2021)

  • 3.0.3(Oct 15, 2021)

  • 3.0.2(Oct 15, 2021)

  • 3.0.1(Oct 8, 2021)

    3.0.1 - 2021-10-07

    Bug fix

    • The license was mistakenly left as proprietary. Corrected to BSD-3-Clause.

    Other changes

    • ReadTheDocs integration.
    • CONTRIBUTING.md
    • Correct pyproject.toml to work with PEP-517
    Source code(tar.gz)
    Source code(zip)
  • 3.0.0(Oct 7, 2021)

    3.0.0 - 2021-10-07

    It's public! Yay!

    Breaking changes:

    • The package has been renamed to tabmat. CELEBRATE!
    • The one_over_var_inf_to_val function has been made private.
    • The csc_to_split function has been re-named to tabmat.from_csc to match the tabmat.from_pandas function.
    • The tabmat.MatrixBase.get_col_means and tabmat.MatrixBase.get_col_stds methods have been made private.
    • The cross_sandwich method has also been made private.

    Bug fixes:

    • StandardizedMatrix.transpose_matvec was giving the wrong answer when the out parameter was provided. This is now fixed.
    • SplitMatrix.__repr__ now calls the __repr__ method of component matrices instead of __str__.

    Other changes:

    • Optimized the tabmat.SparseMatrix.matvec and tabmat.SparseMatrix.tranpose_matvec for when rows and cols are None.
    • Implemented CategoricalMatrix.__rmul__
    • Reorganizing the documentation and updating the text to match the current API.
    • Enable indexing the rows of a CategoricalMatrix. Previously CategoricalMatrix.__getitem__ only supported column indexing.
    • Allow creating a SplitMatrix from a list of any MatrixBase objects including another SplitMatrix.
    • Reduced memory usage in tabmat.SplitMatrix.matvec.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.3(Jul 15, 2021)

    2.0.3 - 2021-07-15

    Bug fix:

    • In SplitMatrix.sandwich, when a col subset was specified, incorrect output was produced if the components of the indices array were not sorted. SplitMatrix.__init__ now checks for sorted indices and maintains sorted index lists when combining matrices.

    Other changes:

    • SplitMatrix.__init__ now filters out any empty matrices.
    • StandardizedMatrix.sandwich passes rows=None and cols=None onwards to the underlying matrix instead of replacing them with full arrays of indices. This should improve performance slightly.
    • SplitMatrix.__repr__ now includes the type of the underlying matrix objects in the string output.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.2(Jun 24, 2021)

  • 2.0.1(Jun 20, 2021)

  • 2.0.0(Jun 17, 2021)

    2.0.0 - 2021-06-17

    Breaking changes:

    We renamed several public functions to make them private. These include functions in quantcore.matrix.benchmark that are unlikely to be used outside of this package as well as

    • quantcore.matrix.dense_matrix._matvec_helper
    • quantcore.matrix.sparse_matrix._matvec_helper
    • quantcore.matrix.split_matrix._prepare_out_array

    Other changes:

    • We removed the dependency on sparse_dot_mkl. We now use scipy.sparse.csr_matvec instead of sparse_dot_mkl.dot_product_mkl on all platforms, because the former suffered from poor performance, especially on narrow problems. This also means that we removed the function quantcore.matrix.sparse_matrix._dot_product_maybe_mkl.
    • We updated the pre-commit hooks and made sure the code is line with the new hooks.
    Source code(tar.gz)
    Source code(zip)
  • 1.0.6(Apr 26, 2021)

  • 1.0.5(Apr 26, 2021)

  • 1.0.3(Apr 22, 2021)

    Bug fixes:

    • Added a check that matrices are two-dimensional in the SplitMatrix.__init__
    • Replace np.int with np.int64 where appropriate due to NumPy deprecation of np.int.
    Source code(tar.gz)
    Source code(zip)
  • 1.0.2(Apr 20, 2021)

  • 1.0.1(Nov 25, 2020)

    Bug fixes:

    • Handling for nulls when setting up a CategoricalMatrix
    • Fixes to make several functions work with both row and col restrictions and out

    Other changes:

    • Added various tests and documentation improvements
    Source code(tar.gz)
    Source code(zip)
  • 1.0.0(Nov 11, 2020)

    Breaking change:

    • Rename dot to matvec. Our dot function supports matrix-vector multiplication for every subclass, but only supports matrix-matrix multiplication for some. We therefore rename it to matvec in line with other libraries.

    Bug fix:

    • Fix a bug in matvec for categorical components when the number of categories exceeds the number of rows.
    Source code(tar.gz)
    Source code(zip)
Transform-Invariant Non-Negative Matrix Factorization

Transform-Invariant Non-Negative Matrix Factorization A comprehensive Python package for Non-Negative Matrix Factorization (NMF) with a focus on learn

EMD Group 6 Jul 1, 2022
Tools for working with MARC data in Catalogue Bridge.

catbridge_tools Tools for working with MARC data in Catalogue Bridge. Borrows heavily from PyMarc

null 1 Nov 11, 2021
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 5, 2021
Very useful and necessary functions that simplify working with data

Additional-function-for-pandas Very useful and necessary functions that simplify working with data random_fill_nan(module_name, nan) - Replaces all sp

Alexander Goldian 2 Dec 2, 2021
OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Overview OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase

Tom 3 Feb 12, 2022
Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods Introduction Graph Neural Networks (GNNs) have demonstrated

null 37 Dec 15, 2022
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 3, 2023
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

null 898 Jan 9, 2023
A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

null 3k Jan 2, 2023
Powerful, efficient particle trajectory analysis in scientific Python.

freud Overview The freud Python library provides a simple, flexible, powerful set of tools for analyzing trajectories obtained from molecular dynamics

Glotzer Group 195 Dec 20, 2022
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Jan 4, 2023
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

Zed(Zijun) Chen 40 Dec 12, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 3, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021