Select, weight and analyze complex sample data

Overview

Sample Analytics

docs

In large-scale surveys, often complex random mechanisms are used to select samples. Estimates derived from such samples must reflect the random mechanism. Samplics is a python package that implements a set of sampling techniques for complex survey designs. These survey sampling techniques are organized into the following four sub-packages.

Sampling provides a set of random selection techniques used to draw a sample from a population. It also provides procedures for calculating sample sizes. The sampling subpackage contains:

  • Sample size calculation and allocation: Wald and Fleiss methods for proportions.
  • Equal probability of selection: simple random sampling (SRS) and systematic selection (SYS)
  • Probability proportional to size (PPS): Systematic, Brewer's method, Hanurav-Vijayan method, Murphy's method, and Rao-Sampford's method.

Weighting provides the procedures for adjusting sample weights. More specifically, the weighting subpackage allows the following:

  • Weight adjustment due to nonresponse
  • Weight poststratification, calibration and normalization
  • Weight replication i.e. Bootstrap, BRR, and Jackknife

Estimation provides methods for estimating the parameters of interest with uncertainty measures that are consistent with the sampling design. The estimation subpackage implements the following types of estimation methods:

  • Taylor-based, also called linearization methods
  • Replication-based estimation i.e. Boostrap, BRR, and Jackknife
  • Regression-based e.g. generalized regression (GREG)

Small Area Estimation (SAE). When the sample size is not large enough to produce reliable / stable domain level estimates, SAE techniques can be used to model the output variable of interest to produce domain level estimates. This subpackage provides Area-level and Unit-level SAE methods.

For more details, visit https://samplics.readthedocs.io/en/latest/

Usage

Let's assume that we have a population and we would like to select a sample from it. The goal is to calculate the sample size for an expected proportion of 0.80 with a precision (half confidence interval) of 0.10.

from samplics.sampling import SampleSize

sample_size = SampleSize(parameter = "proportion")
sample_size.calculate(target=0.80, half_ci=0.10)

Furthermore, the population is located in four natural regions i.e. North, South, East, and West. We could be interested in calculating sample sizes based on region specific requirements e.g. expected proportions, desired precisions and associated design effects.

from samplics.sampling import SampleSize

sample_size = SampleSize(parameter="proportion", method="wald", stratification=True)

expected_proportions = {"North": 0.95, "South": 0.70, "East": 0.30, "West": 0.50}
half_ci = {"North": 0.30, "South": 0.10, "East": 0.15, "West": 0.10}
deff = {"North": 1, "South": 1.5, "East": 2.5, "West": 2.0}

sample_size = SampleSize(parameter = "proportion", method="Fleiss", stratification=True)
sample_size.calculate(target=expected_proportions, half_ci=half_ci, deff=deff)

To select a sample of primary sampling units using PPS method, we can use code similar to the snippets below. Note that we first use the datasets module to import the example dataset.

# First we import the example dataset
from samplics.datasets import load_psu_frame
psu_frame_dict = load_psu_frame()
psu_frame = psu_frame_dict["data"]

# Code for the sample selection
from samplics.sampling import SampleSelection

psu_sample_size = {"East":3, "West": 2, "North": 2, "South": 3}
pps_design = SampleSelection(
   method="pps-sys",
   stratification=True,
   with_replacement=False
   )

psu_frame["psu_prob"] = pps_design.inclusion_probs(
   psu_frame["cluster"],
   psu_sample_size,
   psu_frame["region"],
   psu_frame["number_households_census"]
   )

The initial weighting step is to obtain the design sample weights. In this example, we show a simple example of two-stage sampling design.

import pandas as pd

from samplics.datasets import load_psu_sample, load_ssu_sample
from samplics.weighting import SampleWeight

# Load PSU sample data
psu_sample_dict = load_psu_sample()
psu_sample = psu_sample_dict["data"]

# Load PSU sample data
ssu_sample_dict = load_ssu_sample()
ssu_sample = ssu_sample_dict["data"]

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]],
    ssu_sample[["cluster", "household", "ssu_prob"]],
    on="cluster"
)

full_sample["inclusion_prob"] = full_sample["psu_prob"] * full_sample["ssu_prob"]
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"]

To adjust the design sample weight for nonresponse, we can use code similar to:

import numpy as np

from samplics.weighting import SampleWeight

# Simulate response
np.random.seed(7)
full_sample["response_status"] = np.random.choice(
    ["ineligible", "respondent", "non-respondent", "unknown"],
    size=full_sample.shape[0],
    p=(0.10, 0.70, 0.15, 0.05),
)
# Map custom response statuses to teh generic samplics statuses
status_mapping = {
   "in": "ineligible",
   "rr": "respondent",
   "nr": "non-respondent",
   "uk":"unknown"
   }
# adjust sample weights
full_sample["nr_weight"] = SampleWeight().adjust(
   samp_weight=full_sample["design_weight"],
   adjust_class=full_sample["region"],
   resp_status=full_sample["response_status"],
   resp_dict=status_mapping
   )

To estimate population parameters using Taylor-based and replication-based methods, we can use code similar to:

# Taylor-based
from samplics.datasets import load_nhanes2

nhanes2_dict = load_nhanes2()
nhanes2 = nhanes2_dict["data"]

from samplics.estimation import TaylorEstimator

zinc_mean_str = TaylorEstimator("mean")
zinc_mean_str.estimate(
    y=nhanes2["zinc"],
    samp_weight=nhanes2["finalwgt"],
    stratum=nhanes2["stratid"],
    psu=nhanes2["psuid"],
    remove_nan=True,
)

# Replicate-based
from samplics.datasets import load_nhanes2brr

nhanes2brr_dict = load_nhanes2brr()
nhanes2brr = nhanes2brr_dict["data"]

from samplics.estimation import ReplicateEstimator

ratio_wgt_hgt = ReplicateEstimator("brr", "ratio").estimate(
    y=nhanes2brr["weight"],
    samp_weight=nhanes2brr["finalwgt"],
    x=nhanes2brr["height"],
    rep_weights=nhanes2brr.loc[:, "brr_1":"brr_32"],
    remove_nan=True,
)

To predict small area parameters, we can use code similar to:

import numpy as np
import pandas as pd

# Area-level basic method
from samplics.datasets import load_expenditure_milk

milk_exp_dict = load_expenditure_milk()
milk_exp = milk_exp_dict["data"]

from samplics.sae import EblupAreaModel

fh_model_reml = EblupAreaModel(method="REML")
fh_model_reml.fit(
    yhat=milk_exp["direct_est"],
    X=pd.get_dummies(milk_exp["major_area"], drop_first=True),
    area=milk_exp["small_area"],
    error_std=milk_exp["std_error"],
    intercept=True,
    tol=1e-8,
)
fh_model_reml.predict(
    X=pd.get_dummies(milk_exp["major_area"], drop_first=True),
    area=milk_exp["small_area"],
    intercept=True,
)

# Unit-level basic method
from samplics.datasets import load_county_crop, load_county_crop_means

# Load County Crop sample data
countycrop_dict = load_county_crop()
countycrop = countycrop_dict["data"]
# Load County Crop Area Means sample data
countycropmeans_dict = load_county_crop_means()
countycrop_means = countycropmeans_dict["data"]

from samplics.sae import EblupUnitModel

eblup_bhf_reml = EblupUnitModel()
eblup_bhf_reml.fit(
    countycrop["corn_area"],
    countycrop[["corn_pixel", "soybeans_pixel"]],
    countycrop["county_id"],
)
eblup_bhf_reml.predict(
    Xmean=countycrop_means[["ave_corn_pixel", "ave_corn_pixel"]],
    area=np.linspace(1, 12, 12),
)

Installation

pip install samplics

Python 3.7 or newer is required and the main dependencies are numpy, pandas, scpy, and statsmodel.

Contribution

If you would like to contribute to the project, please read contributing to samplics

License

MIT

Contact

created by Mamadou S. Diallo - feel free to contact me!

Comments
  • Estimation problem: 'division by 0'

    Estimation problem: 'division by 0'

    Hi. First of all, thanks!, this is a very usefull package!. I have a problem and i think it's because the sample size in a group and the method of estimation. It is returning me "division by 0" problem. image I think the part that's interesting is this: image

    I think it has to be something with the sample in each strata: image When i drop the strata 3, that has just 1 observation, it works fine: image

    Thanks!

    opened by BArFinrod 7
  • Add datasets referenced in tutorial notebooks as part of the repo

    Add datasets referenced in tutorial notebooks as part of the repo

    I was trying to follow the examples in the tutorials, and noticed that several of the datasets referenced weren't available. This made it tough to replicate the notebook outputs and follow along.

    For example, I wasn't able to find any of the following datasets when searching through the repo:

    • psu_frame.csv
    • expenditure_on_milk.csv
    • countycropareas.csv
    • countycropareas_means.csv
    • nhanes2f.csv
    • nmihs_bs.csv
    • nhanes2brr.csv
    • nhanes2jknife.csv

    Perhaps a nice way of making these available to users would be to create a datasets module like they do in the sklearn library (example):

    from samplics.datasets import load_psu_frame
    
    psu_frame = load_psu_frame()
    psu_frame.head(25)
    

    Thanks for all your hard work on this project!

    enhancement 
    opened by rchew 2
  • Documentation outdated in README and ReadTheDocs

    Documentation outdated in README and ReadTheDocs

    Followed instructions to pip install samplics (samplics==0.3.10) and then tried some of the example code snippets from the project README page. It seems like there are newly supported features that are not yet added to the version on pip.

    For example, the following throws TypeError: calculate() got an unexpected keyword argument 'precision'

    import samplics
    from samplics.sampling import SampleSize
    
    sample_size = SampleSize(parameter = "proportion")
    sample_size.calculate(target=0.80, precision=0.10)
    

    And the following references a local file ("psu_frame.csv") which users do not appear to have access to in the way presented, though it appears you can import the reference dataset using from samplics.datasets import load_psu_frame.

    import samplics
    from samplics.sampling import SampleSelection
    
    psu_frame = pd.read_csv("psu_frame.csv")
    psu_sample_size = {"East":3, "West": 2, "North": 2, "South": 3}
    pps_design = SampleSelection(
       method="pps-sys",
       stratification=True,
       with_replacement=False
       )
    
    frame["psu_prob"] = pps_design.inclusion_probs(
       psu_frame["cluster"],
       psu_sample_size,
       psu_frame["region"],
       psu_frame["number_households_census"]
       )
    

    As another general note, I would also add code to the examples to import all libraries needed to run each code snippet. For example, in the snippet above, I would add import pandas as pd to the top if you're going to use Panda's read_csv function.

    opened by rchew 1
  • minor issues w/ writing

    minor issues w/ writing

    Dear @MamadouSDiallo,

    to sign off on " Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?", i recommend you fix all the minor grammatical issues. one idea is to use grammarly. just take a close look.

    opened by soodoku 1
  • comm. guidelines

    comm. guidelines

    dear @MamadouSDiallo,

    couldn't find the following:

    Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
    

    check other JOSS software to see what you can add to the readme.

    opened by soodoku 1
  • testing

    testing

    dear @MamadouSDiallo,

    from joss. it will be cool to have some validation around correctness.

    one idea = confirm some outputs w/ lumley's survey sampling package?

    opened by soodoku 1
  • fix use of some deprecated things

    fix use of some deprecated things

    sample_size = SampleSize(parameter = "proportion")
    sample_size.calculate(target=0.80, precision=0.10)
    
    FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
      import pandas.util.testing as tm
    
    opened by soodoku 1
  • LinAlgError: Singular matrix when running EBLUP Area Model

    LinAlgError: Singular matrix when running EBLUP Area Model

    Hi, I have a problem when running EBLUP Area Model using my own dataset. It is always stuck on Singular Matrix error even my auxiliary variables don't have strong collinearity. Here is the error :

    image

    I have tried with some combination of auxiliary variables, but always return error. Can you explain what is the cause of this error? Thanks

    opened by eki1381 0
  • Enhancements for sample estimation - response rate and sampling

    Enhancements for sample estimation - response rate and sampling

    Currently there isn't a way to account for response rate in the stratified sampling estimate per strata(the interface is there but not the functionality; tested it during the AAPOR meeting). I'd like to account for the individual strata in response rate and perhaps do this in a bit more flexible manner if possible. This will allow our team to use this for our sampling plan - we stratify by geographical region and the response rates which we have historical data on differ by region.

    opened by quillan86 0
  • "cannot reshape array" error message with crosstabs containing 0-value cells (samplics 0.3.12 and 0.3.13)

    Hi, and thanks for this very useful library!

    I encountered a problem when trying to perform a crosstab analysis with it, and it took me some time to understand that the problem comes from contingency tables containing zeros.

    It's simpler to explain with an example, so case in point, consider the following made-up weighted dataset, with 0 individual at the intersection "Man/Other nationality":

    import pandas
    from samplics.categorical import Tabulation, CrossTabulation
    df= pandas.DataFrame(data=
                            [["Woman", "European"]]*100 + \
                            [["Woman", "American"]]* 35 + \
                            [["Woman", "Other"]]*93 + \
                            [["Man", "European"]]*150 + \
                            [["Man", "American"]]*77,
                         columns=["Gender", "Nationality"])
    df["weights"] = [1, 0.3, 8, 3, 0.7]  * 91
    #Let's preview the data
    print(df.head(3).append(df.tail(3)))
    

    ||Gender | Nationality | weights| |---| -------- | -----|-------- | 0 | Woman | European |1.0| 1 | Woman | European |0.3| 2 |Woman | European |8.0| ...|...|...|...| 452 | Man | American | 8.0| 453 |Man | American | 3.0| 454 |Man | American | 0.7|

    If you crosstab the data, you'll notice there are 0 man of "Other" nationality :

    pandas.crosstab(df["Nationality"],
                    df["Gender"])
    

    |Gender | Man |Woman| | -------- | -----|-------- | |Nationality | | |
    |American | 77| 35| |European| 150| 100| |Other| 0| 93|

    If I try to use samplics with it:

    crosstab_samplics = CrossTabulation("count")
    crosstab_samplics.tabulate(
        vars=df[["Gender", "Nationality"]],
        samp_weight=df["weights"],
        remove_nan=True,
    )
    

    it throws the following error:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Users\me\env\lib\site-packages\samplics\categorical\tabulation.py", line 430, in tabulate
        - cell_est.reshape(vars_levels.shape[0], 1)
    ValueError: cannot reshape array of size 5 into shape (6,1)
    
    

    The problem disappear if I slightly change the dataset so the crosstab does not contain any zero anymore. I observed this problem with various datasets, so I'm almost certain the problem comes from these zero cells.

    Is it supposed to throw an error like that?

    I'm not certain if 1) the error simply comes from the fact that it's statistically incorrect to perform analysis with weights on crosstables containing zeros, or 2) if it's a case that the library doesn't take into account for the moment. If it's scenario 1), it might be useful to throw a more specific error message.

    Anyway, here is my configuration, if it can help:

    • Windows 10, Python 3.10.2 (tags/v3.10.2:a58ebcc), [MSC v.1929 64 bit (AMD64)]
    • the code is executed in a virtual environment
    • the following possibly relevant packages are installed:
      • numpy 1.22.3,
      • pandas 1.4.1,
      • statsmodels 0.13.2,
      • matplotlib 3.5.1,
      • scipy 1.8.0

    NB: for some reason, pip install samplics --upgrade won't upgrade samplics from 0.3.12 to 0.3.13, so I had to install the newest version directly from the repo with pip install git+https://github.com/samplics-org/samplics.git . But anyway, the "cannot reshape array" message I encountered occurs in both versions, 0.3.12 and 0.3.13.

    Thanks again for this library!

    opened by jeanbaptisteb 10
  • RuntimeWarning: divide by zero encountered in log   ll_term1 = np.log(np.linalg.det(V))

    RuntimeWarning: divide by zero encountered in log ll_term1 = np.log(np.linalg.det(V))

    Excuse me, I'm getting an error in the EBLUP calculation. The error message is as follows:

    RuntimeWarning: divide by zero encountered in log ll_term1 = np.log(np.linalg.det(V)).

    And this is my script:

    area = data_sae["id"] yhat = data_sae["cons"] X = data_sae[["ntl", 'feat_1', 'feat_2', 'feat_3']] sigma_e = data_sae["se"]

    fh_model_reml = EblupAreaModel(method="ML") fh_model_reml.fit( yhat=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8, )

    This is my data collation:

    id centroid_lon centroid_lat cons ntl feat_1 feat_2 feat_3 se 181 3404070003 110.399345 -7.755213 2.336299e+06 13.000000 -0.657062 0.256064 -0.718757 3.782321e+05 68 3402080003 110.367601 -7.901088 7.120323e+05 3.570000 -0.365713 0.222525 -1.224635 8.464960e+04 57 3402050003 110.322569 -7.928683 7.259114e+05 2.400000 -0.515885 0.689977 -1.196988 2.273737e-11

    opened by rifkigst 0
  • cluster sample size

    cluster sample size

    Add functionalities to the SampleSize class to handle sample size calculation for clusters. For example, to get the sample size for enumeration areas (EAs) or households when the final sampling unit is the person.

    opened by MamadouSDiallo 0
Owner
samplics
samplics
Convert weight file.pth to weight file.blob

CONVERT YOUR MODEL TO IR FORMAT INSTALLATION OpenVino Toolkit Download openvinotoolkit 2021.3 version : Link Instruction of installation : Link Pytorc

Tran Anh Tuan 3 Nov 18, 2021
Yolox-bytetrack-sample - Python sample of MOT (Multiple Object Tracking) using YOLOX and ByteTrack

yolox-bytetrack-sample YOLOXとByteTrackを用いたMOT(Multiple Object Tracking)のPythonサン

KazuhitoTakahashi 12 Nov 9, 2022
Complex-Valued Neural Networks (CVNN)Complex-Valued Neural Networks (CVNN)

Complex-Valued Neural Networks (CVNN) Done by @NEGU93 - J. Agustin Barrachina Using this library, the only difference with a Tensorflow code is that y

youceF 1 Nov 12, 2021
Aiming at the common training datsets split, spectrum preprocessing, wavelength select and calibration models algorithm involved in the spectral analysis process

Aiming at the common training datsets split, spectrum preprocessing, wavelength select and calibration models algorithm involved in the spectral analysis process, a complete algorithm library is established, which is named opensa (openspectrum analysis).

Fu Pengyou 50 Jan 7, 2023
code for paper "Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning" by Zhongzheng Ren*, Raymond A. Yeh*, Alexander G. Schwing.

Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning Overview This code is for paper: Not All Unlabeled Data are Equa

Jason Ren 22 Nov 23, 2022
A light weight data augmentation tool for training CNNs and Viola Jones detectors

hey-daug A light weight data augmentation tool for training CNNs and Viola Jones detectors (Haar Cascades). This tool inflates your data by up to six

Jaiyam Sharma 2 Nov 23, 2019
A light-weight image labelling tool for Python designed for creating segmentation data sets.

An image labelling tool for creating segmentation data sets, for Django and Flask.

null 117 Nov 21, 2022
Light-weight network, depth estimation, knowledge distillation, real-time depth estimation, auxiliary data.

light-weight-depth-estimation Boosting Light-Weight Depth Estimation Via Knowledge Distillation, https://arxiv.org/abs/2105.06143 Junjie Hu, Chenyou F

Junjie Hu 13 Dec 10, 2022
Official implementation of the paper DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows

DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows Official implementation of the paper DeFlow: Learning Complex Im

Valentin Wolf 86 Nov 16, 2022
Automatic labeling, conversion of different data set formats, sample size statistics, model cascade

Simple Gadget Collection for Object Detection Tasks Automatic image annotation Conversion between different annotation formats Obtain statistical info

llt 4 Aug 24, 2022
A highly efficient, fast, powerful and light-weight anime downloader and streamer for your favorite anime.

AnimDL - Download & Stream Your Favorite Anime AnimDL is an incredibly powerful tool for downloading and streaming anime. Core features Abuses the dev

KR 759 Jan 8, 2023
Using machine learning to predict and analyze high and low reader engagement for New York Times articles posted to Facebook.

How The New York Times can increase Engagement on Facebook Using machine learning to understand characteristics of news content that garners "high" Fa

Jessica Miles 0 Sep 16, 2021
DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

Facebook Research 145 Dec 30, 2022
Official Tensorflow implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

M-LSD: Towards Light-weight and Real-time Line Segment Detection Official Tensorflow implementation of "M-LSD: Towards Light-weight and Real-time Line

NAVER/LINE Vision 357 Jan 4, 2023
Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

M-LSD: Towards Light-weight and Real-time Line Segment Detection Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Det

null 123 Jan 4, 2023
An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

null 45 Dec 8, 2022
DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

Sachin Mehta 440 Dec 18, 2022
Code for "On the Effects of Batch and Weight Normalization in Generative Adversarial Networks"

Note: this repo has been discontinued, please check code for newer version of the paper here Weight Normalized GAN Code for the paper "On the Effects

Sitao Xiang 182 Sep 6, 2021