Efficient matrix representations for working with tabular data

Last update: Dec 14, 2022

Overview

Efficient matrix representations for working with tabular data

Installation

Simply install via conda-forge!

conda install -c conda-forge tabmat

Use case

TL;DR: We provide matrix classes for efficiently building statistical algorithms with data that is partially dense, partially sparse and partially categorical.

Data used in economics, actuarial science, and many other fields is often tabular, containing rows and columns. Further common properties are also common:

It often is very sparse.
It often contains a mix of dense and sparse columns.
It often contains categorical data, processed into many columns of indicator values created by "one-hot encoding."

High-performance statistical applications often require fast computation of certain operations, such as

Computing sandwich products of the data, transpose(X) @ diag(d) @ X. A sandwich product shows up in the solution to weighted least squares, as well as in the Hessian of the likelihood in generalized linear models such as Poisson regression.
Matrix-vector products, possibly on only a subset of the rows or columns. For example, when limiting computation to an "active set" in a L1-penalized coordinate descent implementation, we may only need to compute a matrix-vector product on a small subset of the columns.
Computing all operations on standardized predictors which have mean zero and standard deviation one. This helps with numerical stability and optimizer efficiency in a wide range of machine learning algorithms.

This library and its design

We designed this library with the above use cases in mind. We built this library first for estimating generalized linear models, but expect it will be useful in a variety of econometric and statistical use cases. This library was borne out of our need for speed, and its unified API is motivated by the desire to work with a unified matrix API internal to our statistical algorithms.

Design principles:

Speed and memory efficiency are paramount.
You don't need to sacrifice functionality by using this library: DenseMatrix and SparseMatrix subclass np.ndarray and scipy.sparse.csc_matrix respectively, and inherit behavior from those classes wherever it is not improved on.
As much as possible, syntax follows NumPy syntax, and dimension-reducing operations (like sum) return NumPy arrays, following NumPy dimensions about the dimensions of results. The aim is to make these classes as close as possible to being drop-in replacements for numpy.ndarray. This is not always possible, however, due to the differing APIs of numpy.ndarray and scipy.sparse.
Other operations, such as toarray, mimic Scipy sparse syntax.
All matrix classes support matrix-vector products, sandwich products, and getcol.

Individual subclasses may support significantly more operations.

Matrix types

DenseMatrix represents dense matrices, subclassing numpy nparray. It additionally supports methods getcol, toarray, sandwich, standardize, and unstandardize.
SparseMatrix represents column-major sparse data, subclassing scipy.sparse.csc_matrix. It additionally supports methods sandwich and standardize.
CategoricalMatrix represents one-hot encoded categorical matrices. Because all the non-zeros in these matrices are ones and because each row has only one non-zero, the data can be represented and multiplied much more efficiently than a generic sparse matrix.
SplitMatrix represents matrices with both dense, sparse and categorical parts, allowing for a significant speedup in matrix multiplications.
StandardizedMatrix efficiently and sparsely represents a matrix that has had its column normalized to have mean zero and variance one. Even if the underlying matrix is sparse, such a normalized matrix will be dense. However, by storing the scaling and shifting factors separately, StandardizedMatrix retains the original matrix sparsity.

Benchmarks

See here for detailed benchmarking.

API documentation

See here for detailed API documentation.

Comments

Poor performance on narrow sparse matrices.
I've been investigating problems where our MKL-based sparse matrices are massively underperforming scipy.sparse. For example:

operation storage memory time 0 matvec scipy.sparse csc 0 0.00211215 1 matvec quantcore.matrix 0 0.0266283

This is a matrix with 3e6 rows and 3 columns

It seems like having a small number of columns makes MKL perform quite poorly. I'm not sure why that's the case. But, it may be worth having a check and just falling back to scipy.sparse in narrow cases like this. This kind of narrow case may actually be the dominant use case for sparse matrices because they will be a small component of a SplitMatrix.
help wanted
opened by tbenthompson 11
Swap n_rows with n_cols in matvec

This might fix https://github.com/Quantco/quantcore.glm/issues/323. I think we were passing in the number of rows into matvec when we mean to pass in the number of columns. But maybe I'm misunderstanding what's going on.

The function signature for matvec is

https://github.com/Quantco/quantcore.matrix/blob/9ef54c6cb21e8d8063c0968fe47c300b79d3af4b/src/quantcore/matrix/ext/categorical.pyx#L61-L62

but before we were passing in the number of rows as last argument.

opened by jtilly 6
Build script in PyPI source version uses default `jemalloc`
I see the build script for linux uses jemalloc with disable-tls: "./autogen.sh --disable-cxx --with-jemalloc-prefix=local --with-install-suffix=local --disable-tls --disable-initial-exec-tls",

However, the source distribution in PyPI doesn't run that script when installing it through pip, relying instead on whatever jemalloc it finds when it tries to compile. If, for example, one tries to install tabmat from source through pip, it will later on fail to import, complaining about an error with jemalloc:

cannot allocate memory in static TLS block
opened by david-cortes 5

BUG: cannot allocate memory in static TLS block when installing through pip

The installation wia conda froge was getting stuck in the "Solving environment" part so I tried to install with pip, given that the package is available on Pypi. pip install glum runs in seconds, but then I am unable to import stuff from it, with the following error:

In [1]: from glum import GeneralizedLinearRegressor
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-0284693fe484> in <module>
----> 1 from glum import GeneralizedLinearRegressor

~/anaconda3/lib/python3.9/site-packages/glum/__init__.py in <module>
      1 import pkg_resources
      2 
----> 3 from ._distribution import TweedieDistribution
      4 from ._glm import GeneralizedLinearRegressor
      5 from ._glm_cv import GeneralizedLinearRegressorCV

~/anaconda3/lib/python3.9/site-packages/glum/_distribution.py in <module>
      6 import numpy as np
      7 from scipy import sparse, special
----> 8 from tabmat import MatrixBase, StandardizedMatrix
      9 
     10 from ._functions import (

~/anaconda3/lib/python3.9/site-packages/tabmat/__init__.py in <module>
----> 1 from .categorical_matrix import CategoricalMatrix
      2 from .constructor import from_csc, from_pandas
      3 from .dense_matrix import DenseMatrix
      4 from .matrix_base import MatrixBase
      5 from .sparse_matrix import SparseMatrix

~/anaconda3/lib/python3.9/site-packages/tabmat/categorical_matrix.py in <module>
    171 from .ext.split import sandwich_cat_cat, sandwich_cat_dense
    172 from .matrix_base import MatrixBase
--> 173 from .sparse_matrix import SparseMatrix
    174 from .util import (
    175     check_matvec_out_shape,

~/anaconda3/lib/python3.9/site-packages/tabmat/sparse_matrix.py in <module>
      4 from scipy import sparse as sps
      5 
----> 6 from .ext.sparse import (
      7     csc_rmatvec,
      8     csc_rmatvec_unrestricted,

ImportError: /home/mathurin/anaconda3/lib/python3.9/site-packages/tabmat/ext/../../tabmat.libs/libjemalloclocal-691a3dac.so.2: cannot allocate memory in static TLS block

googling did not help, is there a way to make the pip-installed version work ?

opened by mathurinm 5

Improvements to SplitMatrix
Allow SplitMatrix to be constructed from another SplitMatrix.

Allow inputs of SplitMatrix to be 1-d

Implement __getitem__ for column subset

Also had to implement column subsetting for CategoricalMatrix

__repr__ uses the __repr__ method of components instead of str()

ToDo:

[ ] FIX BUG WITH _split_col_subsets (first confirm that it's a bug)

[ ] Add testing for new features

Checklist

[ ] Added a CHANGELOG.rst entry
opened by MarcAntoineSchmidtQC 5
Enable dropping one column from a CategoricalMatrix?

Currently, CategoricalMatrix does not provide an easy way to drop a column. We are required to include a category for every row in the dataset, but in an unregularized setting, it is nice to sometimes drop one column.

Something sort of like this is already implemented by the cols parameter to the matrix vector and sandwich functions.
question on hold

opened by tbenthompson 5

BUG: segfault when fitting a GeneralizedLinearRegressor

Requirements: pip install libsvmdata

The following script gives me a segfault:

from libsvmdata import fetch_libsvm
from glum import GeneralizedLinearRegressor

X, y = fetch_libsvm("rcv1.binary")
clf = GeneralizedLinearRegressor(alpha=0.01, fit_intercept=False,
                                 family="gaussian")
clf.fit(X, y)

Output:

In [1]: %run glum_segfault.py
Dataset: rcv1.binary
[1]    271745 segmentation fault (core dumped)  ipython

I'm using glum 2.0.3

@qb3

opened by mathurinm 4

Bump pypa/cibuildwheel from 2.2.2 to 2.3.0
Bumps pypa/cibuildwheel from 2.2.2 to 2.3.0.

Release notes

Sourced from pypa/cibuildwheel's releases.

v2.3.0

📈 cibuildwheel now defaults to manylinux2014 image for linux builds, rather than manylinux2010. If you want to stick with manylinux2010, it's simple to set this using the image options. (#926)

✨ You can now pass environment variables from the host machine into the Docker container during a Linux build. Check out the docs for CIBW_ENVIRONMENT_PASS_LINUX for the details. (#914)

✨ Added support for building PyPy 3.8 wheels. (#881)

✨ Added support for building Windows arm64 CPython wheels on a Windows arm64 runner. We can't test this in CI yet, so for now, this is experimental. (#920)

📚 Improved the deployment documentation (#911)

🛠 Changed the escaping behaviour inside cibuildwheel's option placeholders e.g. {project} in before_build or {dest_dir} in repair_wheel_command. This allows bash syntax like ${SOME_VAR} to passthrough without being interpreted as a placeholder by cibuildwheel. See this section in the docs for more info. (#889)

🛠 Pip updated to 21.3, meaning it now defaults to in-tree builds again. If this causes an issue with your project, setting environment variable PIP_USE_DEPRECATED=out-of-tree-build is available as a temporary flag to restore the old behaviour. However, be aware that this flag will probably be removed soon. (#881)

🐛 You can now access the current Python interpreter using python3 within a build on Windows (#917)

Changelog

Sourced from pypa/cibuildwheel's changelog.

v2.3.0

26 November 2021

📈 cibuildwheel now defaults to manylinux2014 image for linux builds, rather than manylinux2010. If you want to stick with manylinux2010, it's simple to set this using the image options. (#926)

✨ You can now pass environment variables from the host machine into the Docker container during a Linux build. Check out the docs for CIBW_ENVIRONMENT_PASS_LINUX for the details. (#914)

✨ Added support for building PyPy 3.8 wheels. (#881)

✨ Added support for building Windows arm64 CPython wheels on a Windows arm64 runner. We can't test this in CI yet, so for now, this is experimental. (#920)

📚 Improved the deployment documentation (#911)

🛠 Changed the escaping behaviour inside cibuildwheel's option placeholders e.g. {project} in before_build or {dest_dir} in repair_wheel_command. This allows bash syntax like ${SOME_VAR} to passthrough without being interpreted as a placeholder by cibuildwheel. See this section in the docs for more info. (#889)

🛠 Pip updated to 21.3, meaning it now defaults to in-tree builds again. If this causes an issue with your project, setting environment variable PIP_USE_DEPRECATED=out-of-tree-build is available as a temporary flag to restore the old behaviour. However, be aware that this flag will probably be removed soon. (#881)

🐛 You can now access the current Python interpreter using python3 within a build on Windows (#917)

Commits

f717468 Bump version: v2.3.0

7b223b4 Merge pull request #938 from pypa/henryiii-patch-2

716b784 Update dependencies (#929)

a02eeaa Update README.md

e519566 Update README.md

c33639c docs: note about musllinux mess

594d89f add support for building wheels for windows on arm64 (#920)

59b157f Merge pull request #926 from mattip/manylinux2014

6ef2ec3 docs: add psycopg 3 to the projects page (#932)

6c2e1cc Merge pull request #928 from ofek/patch-1

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 4

Use a namespaced version of `jemalloc`

We are currently observing issues when using quantcore.matrix in conjunction with onnx and onnxruntime on MacOS. The call to python -c 'import onnx; import quantcore.matrix.ext.dense; import onnxruntime' fails with a bus error or segfault whereas the call DYLD_INSERT_LIBRARIES=$CONDA_PREFIX/lib/libjemalloc.dylib python -c 'import onnx; import quantcore.matrix.ext.dense; import onnxruntime' passes just fine. This indicates that using an unnamespaced jemalloc may be problematic here as the following traceback indicates:

collecting ... Process 6259 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x4efffffff7)
    frame #0: 0x000000013c0f3704 libjemalloc.2.dylib`je_free_default + 240
libjemalloc.2.dylib`je_free_default:
->  0x13c0f3704 <+240>: str    x20, [x8, w9, sxtw #3]
    0x13c0f3708 <+244>: ldr    w8, [x19, #0x200]
    0x13c0f370c <+248>: sub    w9, w8, #0x1              ; =0x1
    0x13c0f3710 <+252>: str    w9, [x19, #0x200]
Target 0: (python) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x4efffffff7)
  * frame #0: 0x000000013c0f3704 libjemalloc.2.dylib`je_free_default + 240
    frame #1: 0x0000000142745010 onnxruntime_pybind11_state.so`std::__1::__hash_table<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, std::__1::__unordered_map_hasher<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_hash, pybind11::detail::type_equal_to, true>, std::__1::__unordered_map_equal<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_equal_to, pybind11::detail::type_hash, true>, std::__1::allocator<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > > > >::__rehash(unsigned long) + 76
    frame #2: 0x0000000142744dd0 onnxruntime_pybind11_state.so`std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, void*>*>, bool> std::__1::__hash_table<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, std::__1::__unordered_map_hasher<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_hash, pybind11::detail::type_equal_to, true>, std::__1::__unordered_map_equal<std::__1::type_index, std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > >, pybind11::detail::type_equal_to, pybind11::detail::type_hash, true>, std::__1::allocator<std::__1::__hash_value_type<std::__1::type_index, std::__1::vector<bool (*)(_object*, void*&), std::__1::allocator<bool (*)(_object*, void*&)> > > > >::__emplace_unique_key_args<std::__1::type_index, std::__1::piecewise_construct_t const&, std::__1::tuple<std::__1::type_index const&>, std::__1::tuple<> >(std::__1::type_index const&, std::__1::piecewise_construct_t const&, std::__1::tuple<std::__1::type_index const&>&&, std::__1::tuple<>&&) + 480
    frame #3: 0x00000001427427dc onnxruntime_pybind11_state.so`pybind11::detail::generic_type::initialize(pybind11::detail::type_record const&) + 396
    frame #4: 0x0000000142751688 onnxruntime_pybind11_state.so`pybind11::class_<onnxruntime::ExecutionOrder>::class_<>(pybind11::handle, char const*) + 140
    frame #5: 0x00000001427513f8 onnxruntime_pybind11_state.so`pybind11::enum_<onnxruntime::ExecutionOrder>::enum_<>(pybind11::handle const&, char const*) + 52
    frame #6: 0x000000014272b5c8 onnxruntime_pybind11_state.so`onnxruntime::python::addObjectMethods(pybind11::module_&, onnxruntime::Environment&) + 296
    frame #7: 0x0000000142734e68 onnxruntime_pybind11_state.so`PyInit_onnxruntime_pybind11_state + 340
    frame #8: 0x000000010019f994 python`_imp_create_dynamic + 2412
    frame #9: 0x00000001000b40f8 python`cfunction_vectorcall_FASTCALL + 208
    frame #10: 0x000000010016bfd8 python`_PyEval_EvalFrameDefault + 30088

My suggestion would be to add an output to the jemalloc-feedstock as described in https://github.com/conda-forge/jemalloc-feedstock/issues/23 that comes with a prefixed version of the library.

opened by xhochy 4

Bump google-github-actions/setup-gcloud from 0.2.0 to 0.2.1
Bumps google-github-actions/setup-gcloud from 0.2.0 to 0.2.1.

Release notes

Sourced from google-github-actions/setup-gcloud's releases.

setup-gcloud v0.2.1

Bug Fixes

Update action names (#250) (#251) (95e2d15)

Changelog

Sourced from google-github-actions/setup-gcloud's changelog.

0.2.1 (2021-02-12)

Bug Fixes

Update action names (#250) (#251) (95e2d15)

Commits

daadedc chore: release 0.2.1 (#252)

2eacbe9 chore: Create CODEOWNERS (#268)

3054e12 doc: update example workflows (#255)

99f25f1 fix: point users to specific example workflow (#254)

95e2d15 fix: Update action names (#250) (#251)

4a93bb3 correct typo in readme (#247)

b1f8896 fix: add credentials_file_path input to root action.yaml (#246)

aceed99 fix: major/minor tags (#242)

0975df4 docs: update README (#241)

dc4636f fix: action.yml description length (#237)

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot will merge this PR once CI passes on it, as requested by @xhochy.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 4
Update linter
Updating the flake8 config to match the new flake8 config from glm_benchmarks.

Changes:

changed linter according to the issue

added simple docstrings to public functions (most functions were in the main matrix classes)

preceded function names with underscores if the functions were only being used internally

added “no docstrings in magic function” flake8 error to list of ignores (didn’t seem helpful)

added # noqa in places where flake8 errors were just creating issues in unhelpful places

Closes #45

Checklist

[ ] Added a CHANGELOG.rst entry
opened by MargueriteBastaQC 4
Bump pypa/cibuildwheel from 2.11.3 to 2.11.4
Bumps pypa/cibuildwheel from 2.11.3 to 2.11.4.

Release notes

Sourced from pypa/cibuildwheel's releases.

v2.11.4

🐛 Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)

🛠 Updates CPython 3.11 to 3.11.1 (#1371)

🛠 Updates PyPy 3.7 to 3.7.10, except on macOS which remains on 7.3.9 due to a bug. (#1371)

📚 Added a reference to abi3audit to the docs (#1347)

Changelog

Sourced from pypa/cibuildwheel's changelog.

v2.11.4

24 Dec 2022

🐛 Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)

🛠 Updates CPython 3.11 to 3.11.1 (#1371)

🛠 Updates PyPy to 7.3.10, except on macOS which remains on 7.3.9 due to a bug on that platform. (#1371)

📚 Added a reference to abi3audit to the docs (#1347)

Commits

27fc88e Bump version: v2.11.4

a7e9ece Merge pull request #1371 from pypa/update-dependencies-pr

b9a3ed8 Update cibuildwheel/resources/build-platforms.toml

3dcc2ff fix: not skipping the tests stops the copy (Windows ARM) (#1377)

1c9ec76 Merge pull request #1378 from pypa/henryiii-patch-3

22b433d Merge pull request #1379 from pypa/pre-commit-ci-update-config

98fdf8c [pre-commit.ci] pre-commit autoupdate

cefc5a5 Update dependencies

e53253d ci: move to ubuntu 20

e9ecc65 [pre-commit.ci] pre-commit autoupdate (#1374)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 0

Cannot sandwich SplitMatrix with non-owned array

This throws an error:

import numpy as np
import tabmat
from scipy.sparse import csc_matrix

rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(100,20))
Xd = tabmat.DenseMatrix(X[:,:10])
Xs = tabmat.SparseMatrix(csc_matrix(X[:,10:]))
Xm = tabmat.SplitMatrix([Xd, Xs])
Xm.sandwich(np.ones(X.shape[0]))

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-2-91ba52e4f568> in <module>
      8 Xs = tabmat.SparseMatrix(csc_matrix(X[:,10:]))
      9 Xm = tabmat.SplitMatrix([Xd, Xs])
---> 10 Xm.sandwich(np.ones(X.shape[0]))

~/anaconda3/envs/py39/lib/python3.9/site-packages/tabmat/split_matrix.py in sandwich(self, d, rows, cols)
    287             idx_i = subset_cols_indices[i]
    288             mat_i = self.matrices[i]
--> 289             res = mat_i.sandwich(d, rows, subset_cols[i])
    290             if isinstance(res, sps.dia_matrix):
    291                 out[(idx_i, idx_i)] += np.squeeze(res.data)

~/anaconda3/envs/py39/lib/python3.9/site-packages/tabmat/dense_matrix.py in sandwich(self, d, rows, cols)
     62         d = np.asarray(d)
     63         rows, cols = setup_restrictions(self.shape, rows, cols)
---> 64         return dense_sandwich(self, d, rows, cols)
     65 
     66     def _cross_sandwich(

src/tabmat/ext/dense.pyx in tabmat.ext.dense.dense_sandwich()

Exception:

Compare against this:

Xd = tabmat.DenseMatrix(X[:,:10].copy())
Xs = tabmat.SparseMatrix(csc_matrix(X[:,10:]))
Xm = tabmat.SplitMatrix([Xd, Xs])
Xm.sandwich(np.ones(X.shape[0]))

(No error)

opened by david-cortes 0

tabmat has no attribute version

I find it convenient to be able to check directly inside a python shell.

In [1]: import tabmat

In [2]: tabmat.__version__
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-aae82a909ca3> in <module>
----> 1 tabmat.__version__

AttributeError: module 'tabmat' has no attribute '__version__'

this is tabmat 3.0.7 installed via PyPI

opened by mathurinm 3

Support initializing matrices with Patsy?
I think we've discussed this, but I don't remember the conclusion and can't find an issue now.

We recommend from_pandas as the way "most users" should construct tabmat objects. from_pandas then guesses which columns should be treated as categorical. I think it would be really nice to have Patsy-like formulas as an alternative, since

R users (including many economists) like using formulas, and

It's easy to infer from a Patsy formula which columns are categorical, which are sparse (generally interactions with categoricals), and which are dense (everything else), so this could remove some of the guesswork from tabmat and improve performance.

I'm not sure how feasible this would be, since Patsy is a sizable library that allows for fairly sophisticated formulas and it would be quite an endeavor to replicate all of the functionality. A few ways of doing this would be

Don't change any code, but document how Patsy can already be used to construct a dataframe that can then be passed to tabmat / glum. Warn that this involves creating a large dense matrix as an intermediate. See Twitter discussion: https://twitter.com/esantorella22/status/1447980727820296198

Have tabmat call patsy.dmatrix with "return_type = 'dataframe'", then call tabmat.from_pandas on the resulting pd.DataFrame. That would not be any more efficient than (1), but would just save the user a little typing and the need to install patsy. On the down side, it adds a dependency and may force creation of a very large dense matrix.

Support very simple patsy-like formulas without having patsy as a dependency or reproducing its full functionality. That would allow the user to designate which columns should be treated as categorical in a more natural way. See Twitter discussion: https://twitter.com/esantorella22/status/1447981081358184461

Make it so that any Patsy formula can be used to create a tabmat object -- I'm not sure how. Might be hard.
opened by esantorella 2

Releases(3.1.2)

3.1.2(Jul 1, 2022)
3.1.2 - 2022-07-01

Other changes:

Next attempt to build wheel for PyPI without --march=native.

Source code(tar.gz)
Source code(zip)
3.1.1(Jul 1, 2022)
3.1.1 - 2022-07-01

Other changes:

Add Python 3.10 support to CI (remove Python 3.6).

We are now building the wheel for PyPI without --march=native to make it more portable across architectures.

Source code(tar.gz)
Source code(zip)
3.1.0(Mar 7, 2022)
New features

tabmat.CategoricalMatrix now accepts a drop_first argurment. This allows the user to drop the first column of a CategoricalMatrix to avoid multicollinearity problems in unregularized models.

tabmat.StandardizedMatrix and tabmat.MatrixBase now support the multiply method.

Source code(tar.gz)
Source code(zip)
3.0.8(Jan 3, 2022)
Bug fix

Always use 64bit integers for indexing in tabmat.ext.sparse.sparse_sandwich to avoid segmentation faults on very wide problems.

Source code(tar.gz)
Source code(zip)
3.0.7(Nov 23, 2021)
Bug fix

Disable the use of static TLS in the Linux wheels to avoid issues with too small TLS on some distributions.

Source code(tar.gz)
Source code(zip)
3.0.6(Nov 12, 2021)
Bug fix

We fixed a bug in SplitMatrix.matvec, where incorrect matrix vector products were computed when a SplitMatrix did not contain any dense components.

Source code(tar.gz)
Source code(zip)
3.0.5(Nov 5, 2021)
Other changes

We are now specifying the run time dependencies in setup.py, so that missing dependencies are automatically installed from PyPI when installing tabmat via pip.

Source code(tar.gz)
Source code(zip)
3.0.4(Nov 3, 2021)
3.0.4 - 2021-11-03

Other changes

tabmat is now available on PyPI and will be automatically updated when a new release is published.

Source code(tar.gz)
Source code(zip)
3.0.3(Oct 15, 2021)
Bug fix

We now support xsimd>=8 and support alternative jemalloc installations.

Source code(tar.gz)
Source code(zip)
3.0.2(Oct 15, 2021)
Bug fix

Allow to link to alternatively suffixed jemalloc installation to workaround #113 .

Source code(tar.gz)
Source code(zip)
3.0.1(Oct 8, 2021)
3.0.1 - 2021-10-07

Bug fix

The license was mistakenly left as proprietary. Corrected to BSD-3-Clause.

Other changes

ReadTheDocs integration.

CONTRIBUTING.md

Correct pyproject.toml to work with PEP-517

Source code(tar.gz)
Source code(zip)
3.0.0(Oct 7, 2021)
3.0.0 - 2021-10-07

It's public! Yay!

Breaking changes:

The package has been renamed to tabmat. CELEBRATE!

The one_over_var_inf_to_val function has been made private.

The csc_to_split function has been re-named to tabmat.from_csc to match the tabmat.from_pandas function.

The tabmat.MatrixBase.get_col_means and tabmat.MatrixBase.get_col_stds methods have been made private.

The cross_sandwich method has also been made private.

Bug fixes:

StandardizedMatrix.transpose_matvec was giving the wrong answer when the out parameter was provided. This is now fixed.

SplitMatrix.__repr__ now calls the __repr__ method of component matrices instead of __str__.

Other changes:

Optimized the tabmat.SparseMatrix.matvec and tabmat.SparseMatrix.tranpose_matvec for when rows and cols are None.

Implemented CategoricalMatrix.__rmul__

Reorganizing the documentation and updating the text to match the current API.

Enable indexing the rows of a CategoricalMatrix. Previously CategoricalMatrix.__getitem__ only supported column indexing.

Allow creating a SplitMatrix from a list of any MatrixBase objects including another SplitMatrix.

Reduced memory usage in tabmat.SplitMatrix.matvec.

Source code(tar.gz)
Source code(zip)
2.0.3(Jul 15, 2021)
2.0.3 - 2021-07-15

Bug fix:

In SplitMatrix.sandwich, when a col subset was specified, incorrect output was produced if the components of the indices array were not sorted. SplitMatrix.__init__ now checks for sorted indices and maintains sorted index lists when combining matrices.

Other changes:

SplitMatrix.__init__ now filters out any empty matrices.

StandardizedMatrix.sandwich passes rows=None and cols=None onwards to the underlying matrix instead of replacing them with full arrays of indices. This should improve performance slightly.

SplitMatrix.__repr__ now includes the type of the underlying matrix objects in the string output.

Source code(tar.gz)
Source code(zip)
2.0.2(Jun 24, 2021)
Bug fix:

Sparse matrices now accept 64-bit indices on Windows.

Source code(tar.gz)
Source code(zip)
2.0.1(Jun 20, 2021)
2.0.1 - 2021-06-20

Bug fix:

Split matrices now also work on Windows.

Source code(tar.gz)
Source code(zip)
2.0.0(Jun 17, 2021)
2.0.0 - 2021-06-17

Breaking changes:

We renamed several public functions to make them private. These include functions in quantcore.matrix.benchmark that are unlikely to be used outside of this package as well as

quantcore.matrix.dense_matrix._matvec_helper

quantcore.matrix.sparse_matrix._matvec_helper

quantcore.matrix.split_matrix._prepare_out_array

Other changes:

We removed the dependency on sparse_dot_mkl. We now use scipy.sparse.csr_matvec instead of sparse_dot_mkl.dot_product_mkl on all platforms, because the former suffered from poor performance, especially on narrow problems. This also means that we removed the function quantcore.matrix.sparse_matrix._dot_product_maybe_mkl.

We updated the pre-commit hooks and made sure the code is line with the new hooks.

Source code(tar.gz)
Source code(zip)
1.0.6(Apr 26, 2021)

Windows releases 🚀
Source code(tar.gz)
Source code(zip)
1.0.5(Apr 26, 2021)

Fix Windows CI upload.
Source code(tar.gz)
Source code(zip)
1.0.3(Apr 22, 2021)
Bug fixes:

Added a check that matrices are two-dimensional in the SplitMatrix.__init__

Replace np.int with np.int64 where appropriate due to NumPy deprecation of np.int.

Source code(tar.gz)
Source code(zip)
1.0.2(Apr 20, 2021)
Added Python 3.9 support.

Use scipy.sparse dot product when MKL isn't available.

Source code(tar.gz)
Source code(zip)
1.0.1(Nov 25, 2020)
Bug fixes:

Handling for nulls when setting up a CategoricalMatrix

Fixes to make several functions work with both row and col restrictions and out

Other changes:

Added various tests and documentation improvements

Source code(tar.gz)
Source code(zip)
1.0.0(Nov 11, 2020)
Breaking change:

Rename dot to matvec. Our dot function supports matrix-vector multiplication for every subclass, but only supports matrix-matrix multiplication for some. We therefore rename it to matvec in line with other libraries.

Bug fix:

Fix a bug in matvec for categorical components when the number of categories exceeds the number of rows.

Source code(tar.gz)
Source code(zip)

Owner

QuantCo

GitHub https://tabmat.readthedocs.io/

Transform-Invariant Non-Negative Matrix Factorization

Transform-Invariant Non-Negative Matrix Factorization A comprehensive Python package for Non-Negative Matrix Factorization (NMF) with a focus on learn

6 Jul 1, 2022

Tools for working with MARC data in Catalogue Bridge.

catbridge_tools Tools for working with MARC data in Catalogue Bridge. Borrows heavily from PyMarc

1 Nov 11, 2021

Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

1 Nov 5, 2021

Very useful and necessary functions that simplify working with data

Additional-function-for-pandas Very useful and necessary functions that simplify working with data random_fill_nan(module_name, nan) - Replaces all sp

2 Dec 2, 2021

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Overview OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase

3 Feb 12, 2022

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods Introduction Graph Neural Networks (GNNs) have demonstrated

37 Dec 15, 2022

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 2, 2023

Powerful, efficient particle trajectory analysis in scientific Python.

freud Overview The freud Python library provides a simple, flexible, powerful set of tools for analyzing trajectories obtained from molecular dynamics

195 Dec 20, 2022

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

359 Dec 22, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

1 Dec 27, 2021

Efficient matrix representations for working with tabular data

Related tags

Overview

Efficient matrix representations for working with tabular data

Installation

Use case

This library and its design

Matrix types

Benchmarks

API documentation

Comments

v2.3.0

v2.3.0

setup-gcloud v0.2.1

Bug Fixes

0.2.1 (2021-02-12)

Bug Fixes

v2.11.4

v2.11.4

Releases(3.1.2)

3.1.2(Jul 1, 2022)

3.1.2 - 2022-07-01

3.1.1(Jul 1, 2022)

3.1.1 - 2022-07-01

3.1.0(Mar 7, 2022)

3.0.8(Jan 3, 2022)

3.0.7(Nov 23, 2021)

3.0.6(Nov 12, 2021)

3.0.5(Nov 5, 2021)

3.0.4(Nov 3, 2021)

3.0.4 - 2021-11-03

3.0.3(Oct 15, 2021)

3.0.2(Oct 15, 2021)

3.0.1(Oct 8, 2021)

3.0.1 - 2021-10-07

3.0.0(Oct 7, 2021)

3.0.0 - 2021-10-07

2.0.3(Jul 15, 2021)

2.0.3 - 2021-07-15

2.0.2(Jun 24, 2021)

2.0.1(Jun 20, 2021)

2.0.1 - 2021-06-20

2.0.0(Jun 17, 2021)

2.0.0 - 2021-06-17

1.0.6(Apr 26, 2021)

1.0.5(Apr 26, 2021)

1.0.3(Apr 22, 2021)

1.0.2(Apr 20, 2021)

1.0.1(Nov 25, 2020)

1.0.0(Nov 11, 2020)

Owner

QuantCo

Transform-Invariant Non-Negative Matrix Factorization

Tools for working with MARC data in Catalogue Bridge.

Minimal working example of data acquisition with nidaqmx python API

Very useful and necessary functions that simplify working with data

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

A highly efficient and modular implementation of Gaussian Processes in PyTorch

Powerful, efficient particle trajectory analysis in scientific Python.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Fancy data functions that will make your life as a data scientist easier.

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Utilize data analytics skills to solve real-world business problems using Humana’s big data