Specification language for generating Generalized Linear Models (with or without mixed effects) from conceptual models

Overview

tisane

Tisane: Authoring Statistical Models via Formal Reasoning from Conceptual and Data Relationships

TL;DR: Analysts can use Tisane to author generalized linear models with or without mixed effects. Tisane infers statistical models from variable relationships (from domain knowledge) that analysts specify. By doing so, Tisane helps analysts avoid common threats to external and statistical conclusion validity. Analysts do not need to be statistical experts!

Jump to see a tutorial here or see some examples here. Below, we provide an overview of the API and language primitives.


Tisane provides (i) a graph specification language for expressing relationships between variables and (ii) an interactive query and compilation process for inferring a valid statistical model from a set of variables in the graph.

Graph specification language

Variables

There are three types of variables: (i) Units, (ii) Measures, and (iii) SetUp, or environmental, variables.

  • Unit types represent entities that are observed (observed units in the experimental design literature) or the recipients of experimental conditions (experimental units).
# There are 386 adults participating in a study on weight loss.
adult = ts.Unit("member", cardinality=386)
  • Measure types represent attributes of units that are proxies of underlying constructs. Measures can have one of the following data types: numeric, nominal, or ordinal. Numeric measures have values that lie on an interval or ratio scale. Nominal measures are categorical variables without an ordering between categories. Ordinal measures are categorical variables with an ordering between categories.
# Adults have motivation levels.
motivation_level = adult.ordinal("motivation", order=[1, 2, 3, 4, 5, 6])
# Adults have pounds lost. 
pounds_lost = adult.numeric("pounds_lost")
# Adults have one of four racial identities in this study. 
race = adult.nominal("race group", cardinality=4)
  • SetUp types represent study or experimental settings that are global and unrelated to any of the units involved. For example, time is often an environmental variable that differentiates repeated measures but is neither a unit nor a measure.
# Researchers collected 12 weeks of data in this study. 
week = ts.SetUp("Week", order=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

Design rationale: We derived this type system from how other software tools focused on study design separate their concerns.

Relationships between variables

Analysts can use Tisane to express (i) conceptual and (ii) data measurement relationships between variables.

There are three different types of conceptual relationships.

  • A variable can cause another variable. (e.g., motivation_level.causes(pounds_lost))
  • A variable can be associated with another variable. (e.g., race.associates_with(pounds_lost))
  • One or more variables can moderate the effect of a variable on another variable. (e.g., age.moderates(moderator=[motivation_level], on=pounds_lost)) Currently, a variable, V1, can have a moderated relationship with a variable, V2, without also having a causal or associative relationship with V2.

These relationships are used to construct an internal graph representation of variables and their relationships with one another.

Internally, Tisane constructs a graph representing these relationships. Graph representation is useufl for inferring statistical models (next section).

For example, the below graph represents the above relationships. Rectangular nodes are units. Elliptical nodes are measures and set-up variables. The colored node is the dependent variable in the query.The dotted edges connect units to their measures. The solid edges represent conceptual relationships, as labeled. A graph representation created using DOT

A graph representation created using TikZ

Interactive query and compilation

Analysts query the relationships they have specified (technically, the internal graph represenation) for a statistical model. For each query, analysts must specify (i) a dependent variable to explain using (ii) a set of independent variables.

design = ts.Design(dv=pounds_lost, ivs=[treatment_approach, motivation_level]).assign_data(df)
ts.infer_statistical_model_from_design(design=design)

Query validation: To be a valid query, Tisane verifies that the dependent variable does not cause an independent variable. It would be conceptually incorrect to explain a cause from an effect.

Interaction model

A key aspect of Tisane that distinguishes it from other systems, such as Tea, is the importance of user interaction in guiding the statistical model that is inferred as output and ultimately fit.

Tisane generates a space of candidate statistical models and asks analysts disambiguation questions for (i) including additional main or interaction effects and, if applicable, correlating (or uncorrelating) random slopes and random intercepts as well as (ii) selecting among viable family/link function pairs.

To help analysts, Tisane provides text explanations and visualizations. For example, to show possible family functions, Tisane simulates data to fit a family function and visualizes it on top of a histogram of the analyst's data and explains to the how to use the visualization to compare family functions.

Statistical model inference

After validating a query, Tisane traverses the internal graph representation in order to generate candidate generalized linear models with or without mixed effects. A generalized linear model consists of a model effects structure and a family/link function pair.

Query

Analysts query the relationships they have specified (technically, the internal graph represenation) for a statistical model. For each query, analysts must specify (i) a dependent variable to explain using (ii) a set of independent variables.

Query validation: To be a valid query, Tisane verifies that the dependent variable does not cause an independent variable. It would be conceptually incorrect to explain a cause from an effect.

Statistical model inference

After validating a query, Tisane traverses the internal graph representation in order to generate candidate generalized linear models with or without mixed effects. A generalized linear model consists of a model effects structure and a family/link function pair.

Model effects structure

Tisane generates candidate main effects, interaction effects, and, if applicable, random effects based on analysts' expressed relationships.

  • Tisane aims to direct analysts' attention to variables, especially possible confounders, that the analyst may have overlooked. When generating main effects candidates, Tisane looks for other variables in the graph that may exert causal influence on the dependent variable and are related to the input independent variables.
  • Tisane aims to represent conceptual relationships between variables accurately. Based on the main effects analysts choose to include in their output statistical model, Tisane suggests interaction effects to include. Tisane relies on the moderate relationships analysts specified in their input program to infer interaction effects.
  • Tisane aims to increase the generalizability of statistical analyses and results by automatically detecting the need for and including random effects. Tisane follows the guidelines outlined in [] and [] to generat the maximal random effects structure.

INFERENCE.md explains all inference rules in greater detail.

Family/link function

Family and link functions depend on the data types of dependent variables and their distributions.

Based on the data type of the dependent variable, Tisane suggests matched pairs of possible family and link functions to consider. Tisane ensures that analysts consider only valid pairs of family and link functions.


Limitations

  • Tisane is designed for researchers or analysts who are domain experts and can accurately express their domain knowledge and data measurement/collection details using the Tisane graph specification language. We performed an initial evaluation of the expressive coverage of Tisane's language and found that it is useful for expressing a breadth of study designs common in HCI.

Benefits

Tisane helps analysts avoid common threats to statistical conclusion and external validity.

Specifically, Tisane helps analysts

  • avoid violations of GLM assumptions by inferring random effects and plausible family and link functions
  • fishing and false discovery due to conceptually incomplete statistical models
  • interaction of the causal relationships with units, interaction of the causal realtionships with settings due to not controlling for the appropriate clusters/non-independence of observations as random effects

These are four of the 37 threats to validity Shadish, Cook, and Campbell outline across internal, external, statistical conclusion, and construct validity [1].


Examples

Check out examples here!

References

[1] Cook, T. D., Campbell, D. T., & Shadish, W. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.

Comments
  • Graph created from a Design is empty

    Graph created from a Design is empty

    For example, if we take the following code:

        def test_more_complex(self):
            student = ts.Unit(
                "Student", attributes=[]
            )  # object type, specify data types through object type
            race = student.nominal("Race", cardinality=5, exactly=1)  # proper OOP
            ses = student.numeric("SES")
            test_score = student.nominal("Test score")
            tutoring = student.nominal("treatment")
            race.associates_with(test_score)
            student.associates_with(test_score)
            race.moderate(ses, on=test_score)
            design = ts.Design(dv=test_score, ivs=[race, ses])
            gr = design.graph
            print(gr.get_nodes())
            self.assertTrue(gr.has_variable(test_score))
    

    the print will print an empty list, and the assertion will fail. This seems to be because graph.py requires relationships from tisane.og_variable instead of from tisane.variable, and tisane.design calls tisane.graph.Graph.add_relationship to add edges:

    # from tisane/graph.py
        def add_relationship(
            self, relationship: Union[Has, Treatment, Nest, Associate, Cause, Moderate]
        ):
            if isinstance(relationship, Has):
                identifier = relationship.variable
                measure = relationship.measure
                repetitions = relationship.repetitions
                self.has(identifier, measure, relationship, repetitions)
            elif isinstance(relationship, Treatment):
                identifier = relationship.unit
                treatment = relationship.treatment
                repetitions = relationship.num_assignments
                self.treat(unit=identifier, treatment=treatment, treatment_obj=relationship)
            # ...
    

    The types for relationship are imported from tisane.og_variable, which means that none of the relationships are added as edges.

    bug fixed 
    opened by audreyseo 2
  • Fixes to graph visualization in tikz + dot!

    Fixes to graph visualization in tikz + dot!

    • causes and associates edges should be the same style (solid) and different from has (dotted), nests (dashed)
    • dependent variables should be filled in with "light-ish" grey (i.e., grey!30) --> def get_causes_associates_tikz_graph(self) --> def get_causes_associates_tikz_graph(self, dv: AbstractVariable)
    opened by emjun 1
  • Try SAT and ASP formulation of StatisticalModel Selection problem

    Try SAT and ASP formulation of StatisticalModel Selection problem

    12.03: Start trying SAT formulation using Z3

    TODO:

    • Try ASP -- see which one is easy, etc.

    Other:

    • [12.03] talked w Alan Borning about possible solvers and logical formulations --> SAT seems to make a lot of sense, finite domain constraints
    opened by emjun 1
  • Exp graph inference

    Exp graph inference

    • Rewrite graph inference rules to not use SMT
    • Add tests for using graph to infer main, interaction, and random effects and family/link functions
    • Add/debug functionality for constructing graphs from interaction/moderation effects that are specified in the language (see https://github.com/emjun/tisane/commit/a541ba5db952fd4856fc75943e811206e04b5889)
    opened by emjun 0
  • Exp api

    Exp api

    This branch introduces API redesigns/re-implementation, specifically:

    • Introduce three data types: Units, Measures, and SetUp variables
    • Updated the API to follow more OOP conventions so that measures must be declared through a unit. This enables us to enforce more valid programs/declarations at the API level, effectively enforcing automatically/removing the need to manually check for data measurement relationships between units and measures (was previously another step of compilation)
    • Add more tests for the new API
    • Start implementing new graph inference rules (this should have been a separate branch. Further development will occur on a separate branch.)
    • Update README to reflect the above changes.
    opened by emjun 0
  • Kb experimental

    Kb experimental

    KB experimental branch: Tried ASP and SMT implementations in parallel. Decided to stick with SMT implementation. Passes tests. Iterated/revised KB multiple times.

    Done experimenting with Knowledge Base implementations for now. Time to experiment with API.

    opened by emjun 0
  • When update concepts, not reflected in ConceptGraph

    When update concepts, not reflected in ConceptGraph

    This was a bug.

    Seems that issue was that we were deepcopying graphs and nodes when updating the graphs, so the references to objects in the test cases were "older" objects that we updated but were not in the ConceptGraphs. I removed deepcopies in the latest commit.

    bug fixed 
    opened by emjun 0
  • RFC on strategy/design doc for Tisane R

    RFC on strategy/design doc for Tisane R

    Goal: Create an R version of Tisane

    Considerations:

    • Keep user-facing API as R-idiomatic as possible
    • Reduce duplicate maintenance efforts. So that changes to Tisane in Python will improve/update Tisane R package.

    Pipeline at 10,000 ft

    R API (user input code) --> Python script --> JSON --> Tisane GUI --> R Code (output statistical modeling code)

    Note: The new part is R API --> Python script. The rest is already how Tisane (Python implementation) works. Put another way, the goal is to "transpile" R into Python.

    How to compile/transpile R into Python?

    • Strategy 1: Build up internal graph IR in R, traverse graph to produce Python code
    • Strategy 2: Parse R script into AST, traverse AST, generate Python code from AST
    • Strategy 3: Build up internal graph IR in R, output graph IR in some format (maybe DOT or something like that), read in graph output, write Python code from graph

    In all of these: Key thing is to control Python script execution through a bash script, which we can call from R.

    Current/next steps

    As of January 18, 2022: I opt for Strategy 1 first because (i) I suspect the syntax of Tisane is likely to change more than the graph IR and (ii) outputting the graph to read it back in might not be necessary.

    TODOS related to Tisane R:

    • @emjun: write R API --> Python script
    • @audreyseo: revise code generation (Tisane GUI --> code) (see related issue)
    rfc 
    opened by emjun 2
  • [RFC] Causes vs. Associates_with

    [RFC] Causes vs. Associates_with

    Tisane currently provides two types of conceptual relationships: causes and associates_with. This doc covers when and how to use these verbs.

    If a user provides associates_with, we walk them through possible association patterns to identify the underlying causal relationships. In other words, associates_with indicates a need for disambiguation to compile to a series of causes statements.

    To do this well, we need to resolve two competing interests: causal accuracy and usability. Prioritizing causal accuracy, the system should help an analyst distinguish and choose among an exhaustive list of possible causal situations. However, doing so may be unusable because the task of differentiating among numerous possible causal situations may be unrealistic for analysts unfamiliar with causality. These concerns do not seem insurmountable.

    With an infinite number of hidden variables, there are an infinite number of possible causal relationships. We could restrict the number of hidden variables an analyst considers. This decision compromises causal accuracy for usability. If we had a justifiable cap on hidden variables, it may be worthwhile to take this approach.

    Another perspective: If the goal is to translate each associates_with into a set of causes, why provide associates_with at all?

    The primary reason I wanted to provide both was because of the following:

    • Analysts are sometimes unsure about the causal edges in their conceptual models. This uncertainty can be due to their own lack of knowledge or because the relationships are hypothesized but not known and now the analysts want to see if data supports the hypothesized relationships.
    • There may be a lack of definitive evidence in a domain about some causal edges and paths (that may involve multiple variables).

    In all these cases, it seems important to acknowledge what is known, what is hypothesized/the focus of inquiry, and what is asserted for the scope of the analysis. (accurate documentation, transparency)

    In the current version of Tisane, analysts can express any relationships they might know or are probing into using causes. If analysts do not want to assert any causal relationships due to a perceived lack of evidence in their field, they should use associates_with. Whenever possible, analysts should use causes instead of associates_with.

    Tisane's model inference process makes argubaly less useful covariate selection recommendations based on associates_with relationships. Tisane looks for variables that have associates_with relationships with both one of the IVs and the DV. Tisane suggests these variables as covariates with caution, including a warning in the Tisane GUI and a tooltip explaining to analysts that associates_with edges may have additional causal confounders that are not specified or detectable with the current specification.

    For the causes relationships, Tisane uses the disjunctive criteria, developed for settings where researchers may be uncertain about their causal models, to recommend possible confounders as covariates.

    We assume that the set of IVs an end-user provides in their query are the ones they are most interested in and want to treat as exposures.

    What happens if the initial choice of variables could lead to confusion in interpretation of results? We currently treat each IV as a separate exposure and combine all confounders into one model. In some cases, this may lead to interpretation confusion. For example, if the model includes two variables on the same causal path, one of the variables may appear to have no effect on the outcome even if it does (due to d-separation). We currently expect analysts to be aware of and interpret their results accurately in light of their variable selection choices. In their input queries, analysts should include only the variables they absolutely care the most about in their queries.

    Moving forward

    I would like to see the following (working list, no priority given yet):

    • Tisane: Separate out use cases and provide language constructs for each: lack of knowledge vs. hypothesized causal edges vs. lack of definitive evidence in the domain.
      • Language design: Remove associates with, require only causes
      • Provide a "gallery" or "library" of canonical graph shapes/statements they could adapt.
      • Allow for inclusion of hidden variables?
      • Generate multiple linear models to verify the input DAG/mechanism validation.
      • Enforce variable selection that guarantees accurate inference, not just DAG/mechanism validation.
    • A question I keep coming back to: How usable is causal modeling to non-experts and how can we make it more usable to them?
      • Study/find out what makes stating "causes" statements difficult for researchers and how to constructively support their skepticism rather than allowing them to avoid formalizing their knowledge.

    Implementation changes:

    • [BIG] Thomas R. had some hesitation about the theoretical soundness of the disjunctive criteria. He did not expand much, but I hope to meet with him early in the winter quarter to discuss.
    • I could re-implement Tisane in R so that it uses Daggity under the hood. Would have to see how to use Daggity under the hood in Python.
    • In R, I've never created a widget/plug in, but I can look into how to do that.
    • Both R and Python versions could use a code-only interface, not having to rely on the GUI.

    Follow-up work/Paper ideas:

    • Eval and improve conceptual modeling language
    • Eval Tisane vs. R
    rfc 
    opened by emjun 2
  • Moderation on Nominal Not Working

    Moderation on Nominal Not Working

    import tisane as ts
    import pandas as pd
    import os
    
    
    FILE_NAME = "schools.csv"
    
    dir = os.path.dirname(__file__)
    df = pd.read_csv(os.path.join(dir, FILE_NAME))
    
    
    # Initialize Units
    school = ts.Unit("schid", cardinality=10)
    student = ts.Unit("stuid", cardinality=96)
    
    
    homework = student.ordinal("homework", order=[0, 1, 2, 3, 4, 5, 6, 7])
    
    # school variables
    school_size = school.ordinal("scsize", order=[2,3,4,6])
    school_region = school.nominal("region")
    school_type = school.ordinal("sctype", order=[1, 4])
    public = school.nominal("public")
    
    # Define relationships
    public.causes(homework)
    public.moderates(school_region, on=homework)
    school_size.associates_with(homework)
    school_type.associates_with(homework)
    
    design = ts.Design(dv=school_size, ivs=[school_region]).assign_data(df)
    
    ts.infer_statistical_model_from_design(design=design)
    
    
    opened by shreyashnigam 1
  • Explanation of effect parameter to causes is confusing

    Explanation of effect parameter to causes is confusing

    In API_OVERVIEW.md, the description of the effect parameter to causes is effect: tisane.variable.AbstractVariable -- the cause data variable. Is effect not supposed to be the the result of the cause-ing variable?

    documentation 
    opened by audreyseo 0
Owner
Eunice Jun
PhD student in computer science at University of Washington. Human-computer interaction, statistical analysis, programming languages, all things data.
Eunice Jun
Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

Linear algebra in python Number of operations and problems in Linear Algebra and

Alireza 5 Oct 9, 2022
This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

Hansheng Jiang 6 Nov 18, 2022
Simple Linear 2nd ODE Solver GUI - A 2nd constant coefficient linear ODE solver with simple GUI using euler's method

Simple_Linear_2nd_ODE_Solver_GUI Description It is a 2nd constant coefficient li

:) 4 Feb 5, 2022
Hitters Linear Regression - Hitters Linear Regression With Python

Hitters_Linear_Regression Kullanacağımız veri seti Carnegie Mellon Üniversitesi'

AyseBuyukcelik 2 Jan 26, 2022
The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

Yu Meng 38 Dec 12, 2022
[NeurIPS 2021] Galerkin Transformer: a linear attention without softmax

[NeurIPS 2021] Galerkin Transformer: linear attention without softmax Summary A non-numerical analyst oriented explanation on Toward Data Science abou

Shuhao Cao 159 Dec 20, 2022
BitPack is a practical tool to efficiently save ultra-low precision/mixed-precision quantized models.

BitPack is a practical tool that can efficiently save quantized neural network models with mixed bitwidth.

Zhen Dong 36 Dec 2, 2022
This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."

Omnimatte in PyTorch This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effect

Erika Lu 728 Dec 28, 2022
Code for "On the Effects of Batch and Weight Normalization in Generative Adversarial Networks"

Note: this repo has been discontinued, please check code for newer version of the paper here Weight Normalized GAN Code for the paper "On the Effects

Sitao Xiang 182 Sep 6, 2021
SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data

SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data Au

null 14 Nov 28, 2022
An Image compression simulator that uses Source Extractor and Monte Carlo methods to examine the post compressive effects different compression algorithms have.

ImageCompressionSimulation An Image compression simulator that uses Source Extractor and Monte Carlo methods to examine the post compressive effects o

James Park 1 Dec 11, 2021
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.

NVIDIA Corporation 6.9k Jan 3, 2023
EdMIPS: Rethinking Differentiable Search for Mixed-Precision Neural Networks

EdMIPS is an efficient algorithm to search the optimal mixed-precision neural network directly without proxy task on ImageNet given computation budgets. It can be applied to many popular network architectures, including ResNet, GoogLeNet, and Inception-V3.

Zhaowei Cai 47 Dec 30, 2022
This is the pytorch implementation for the paper: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation, which is accepted to ICCV2021.

GMPQ: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation This is the pytorch implementation for the paper: Generalizable Mix

null 18 Sep 2, 2022
Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination (ICCV 2021) Dataset License This work is l

DongYoung Kim 33 Jan 4, 2023
Official repository of the paper "A Variational Approximation for Analyzing the Dynamics of Panel Data". Mixed Effect Neural ODE. UAI 2021.

Official repository of the paper (UAI 2021) "A Variational Approximation for Analyzing the Dynamics of Panel Data", Mixed Effect Neural ODE. Panel dat

Jurijs Nazarovs 7 Nov 26, 2022
Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

HAWQ: Hessian AWare Quantization HAWQ is an advanced quantization library written for PyTorch. HAWQ enables low-precision and mixed-precision uniform

Zhen Dong 293 Dec 30, 2022
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This is a Python package available on PyPI for NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pyto

Artit 'Art' Wangperawong 5 Sep 29, 2021