Soda SQL Data testing, monitoring and profiling for SQL accessible data.

Overview

Soda SQL

Data testing, monitoring and profiling for SQL accessible data.

What does Soda SQL do?

Soda SQL allows you to

  • Stop your pipeline when bad data is detected
  • Extract metrics and column profiles through super efficient SQL
  • Full control over metrics and queries through declarative config files

Why Soda SQL?

To protect against silent data issues for the consumers of your data, it's best-practice to profile and test your data:

  • as it lands in your warehouse,
  • after every important data processing step
  • right before consumption.

This way you will prevent delivery of bad data to downstream consumers. You will spend less time firefighting and gain a better reputation.

How does Soda SQL work?

Soda SQL is a Command Line Interface (CLI) and a Python library to measure and test your data using SQL.

As input, Soda SQL uses YAML configuration files that include:

  • SQL connection details
  • What metrics to compute
  • What tests to run on the measurements

Based on those configuration files, Soda SQL will perform scans. A scan performs all measurements and runs all tests associated with one table. Typically a scan is executed after new data has arrived. All soda-sql configuration files can be checked into your version control system as part of your pipeline code.

Want to try Soda SQL? Head over to our '5 minute tutorial' and get started straight away!

"Show me the money"

Simple metrics and tests can be configured in YAML configuration files called scan.yml. An example of the contents of such a file:

metrics:
    - row_count
    - missing_count
    - missing_percentage
    - values_count
    - values_percentage
    - valid_count
    - valid_percentage
    - invalid_count
    - invalid_percentage
    - min
    - max
    - avg
    - sum
    - min_length
    - max_length
    - avg_length
columns:
    ID:
        metrics:
            - distinct
            - duplicate_count
        valid_format: uuid
        tests:
            duplicate_count == 0
    CATEGORY:
        missing_values:
            - N/A
            - No category
        tests:
            missing_percentage < 3
    SIZE:
        tests:
            max - min < 20

Metrics aren't limited to the ones defined by Soda SQL. You can create your own custom SQL metric definitions with a simple yml file.

sql: |
    SELECT sum(volume) as total_volume_us
    FROM CUSTOMER_TRANSACTIONS
    WHERE country = 'US'
tests:
    - total_volume_us > 5000

Based on these configuration files, Soda SQL will scan your data each time new data arrived like this:

$ soda scan ./soda/metrics my_warehouse my_dataset
Soda 1.0 scan for dataset my_dataset on prod my_warehouse
  | SELECT column_name, data_type, is_nullable
  | FROM information_schema.columns
  | WHERE lower(table_name) = 'customers'
  |   AND table_catalog = 'datasource.database'
  |   AND table_schema = 'datasource.schema'
  - 0.256 seconds
Found 4 columns: ID, NAME, CREATE_DATE, COUNTRY
  | SELECT
  |  COUNT(*),
  |  COUNT(CASE WHEN ID IS NULL THEN 1 END),
  |  COUNT(CASE WHEN ID IS NOT NULL AND ID regexp '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' THEN 1 END),
  |  MIN(LENGTH(ID)),
  |  AVG(LENGTH(ID)),
  |  MAX(LENGTH(ID)),
  | FROM customers
  - 0.557 seconds
row_count : 23543
missing   : 23
invalid   : 0
min_length: 9
avg_length: 9
max_length: 9

...more queries...

47 measurements computed
23 tests executed
All is good. No tests failed. Scan took 23.307 seconds

The next step is to add Soda SQL scans in your favorite data pipeline orchestration solution like:

  • Airflow
  • AWS Glue
  • Prefect
  • Dagster
  • Fivetran
  • Matillion
  • Luigi

If you like the goals of this project, encourage us! Star sodadata/soda-sql on Github.

Next, head over to our '5 minute tutorial' and get your first project going!

Comments
  • Snowflake dialect implementation issue

    Snowflake dialect implementation issue

    Hi guys,

    I've tried using SodaSQL with Snowflake, but I've had an issue with soda init.

    First, I've run soda create ., which created the default warehouse.yml. I modified it to look like the following:

    name: .
    connection:
      type: snowflake
      username: MYUSERNAME
      password: env_var(SNOWFLAKE_PWD)
      account: MYACCOUNTNAME
      database: MYDB
      warehouse: MYWH
      schema: MYSCHEMA
    

    In ~/.soda/env_vars.yml I have:

    .:
      SNOWFLAKE_PWD: '<my secret password>'
    

    When I run soda init . I get the following error:

    /Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/pandas/compat/__init__.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
      warnings.warn(msg)
    /Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/snowflake/connector/options.py:78: UserWarning: You have an incompatible version of 'pyarrow' installed (2.0.0), please install a version that adheres to: 'pyarrow<0.18.0,>=0.17.0; extra == "pandas"'
      warn_incompatible_dep('pyarrow', _installed_pyarrow_version.version, _expected_pyarrow_version)
      | Querying warehouse for tables
      | Exception: TODO override and implement this abstract method
    Traceback (most recent call last):
      File "/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/sodasql/cli/cli.py", line 196, in init
        warehouse.dialect.sql_tables_metadata_query())
      File "/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/sodasql/scan/dialect.py", line 98, in sql_tables_metadata_query
        raise RuntimeError('TODO override and implement this abstract method')
    RuntimeError: TODO override and implement this abstract method
    

    A part from pyarrow and lzma warnings, it seems that the sql_tables_metadata_query method is not implemented for Snowflake. I have not looked into SodaSQL code, though.

    opened by AlessandroLollo 11
  • [Telemetry] Add anonymous usage and profiling statistics collection

    [Telemetry] Add anonymous usage and profiling statistics collection

    Goal

    In order to understand the usage, adoption, and performance of our tool better we should collect usage and performance statistics. What we want to achieve with those are the following goals:

    • Pro-actively track bugs and find and fix errors to get ahead of issues faster
    • Manage performance to keep our tools fast and secure,
    • Troubleshoot issues, speed up resolution, and minimize any downtime for users
    • Understand how you user use our tools to improve user flow and experience

    Those statistics must be strictly anonymous, will not pertain to any private information, and will be limited to tracking events that emanate from the tool itself.

    What do we consider private and will never track?

    • Personally identifiable information (names, email addresses, credentials, geolocation, IP, etc.)
    • Organizations' or Users' Proprietary information (e.g. name of checks, datasets, test data, or test results content)

    Target Release: v2.2.0

    Requirements

    • Users should be able to opt-out at any point via an easy to configure switch (e.g. YAML variable in config.yaml)
    • Documentation pages are updated with reference to this GitHub issue along with goal and rationale digest and how to opt-out of the tool.
    • Telemetry events should not significantly hurt the performance of the tool (i.e. we need to make sure we use an async or non-blocking framework/API for that).

    Proposed Solution

    Event Collection Framework

    Two generally accepted frameworks come to mind:

    • Open Telemetry: built on open standards, easy to configure, suitable for tracing of performance (traces and spans) with high granularity.

    • Snowplow: open source option, used by a lot of organizations and reliable, little heavier setup, only made for behavioral events so tracing around performance would have to be shoehorned in their data models.

    Event Schema & Attributes

    Names and implementations are just ideas. Let's challenge those in this issue and we can update this once we're set.

    • user_cookie_id: we could generate a UUID and store it in ~/.soda/config.yaml or in a different file in that config folder.
    • command(_ran): we could get it as a trace name or just add it as an attribute in the attributes key, value object. We should be able to capture this from click or API method name.
    • command_options: these should be flags used if we're talking about a CLI invocation. Can be captured from click but we might want to find a more internal way of capturing it from method signatures of class attributes so that we are ready for API-driven usage in the future.
    • version: which version of soda_sql is this coming from: https://github.com/sodadata/soda-sql/blob/main/core/sodasql/version.py or whatever else makes sense.
    • datasource_type: other names could be dialect_id but it lends itself less to soda-spark or other data source types so maybe datasource_type is fine.
    • datasource_id: given the importance of knowing how many projects and teams are using the tool, we should add a dataserouce_id which should be a hash of the connection parameter that best uniquely identifies a group of people working on the same project.
      • Postgres, Redshift, Spark, Presto: host
      • Snowflake: account
      • BigQuery: project
      • etc. to be determined on a dialect/supported data source basis.
      • see a similar rationale from dbt

    ~- run_status: the alternative name could be invocation_status and would allow us to know if a scan, analyze etc. executed successfully or not (exception types should be logged as well).~ ~- metrics_count: number of metrics collected/derived in a scan.~ ~- metrics_invoked: which metrics are people using. NOTE: we should not log custom metrics derivation code (i.e. the SQL query of a custom metric or any formula as this is considered proprietary information). When uses use custom_sql we should just know that they used a custom metric.~ ~- tests_count: count of tests evaluated.~ ~- tests_invoked: same reasoning as metrics list, if the tests are built-in and do not reveal any proprietary logic we should have a list, if not let's leave it out.~

    • sql_metrics_count: see comment about metrics_count. SQL metrics contain proprietary business logic so we absolutely do not want to track the SQL query itself but just the fact that a custom metric was used.❓
    • historic_metrics_count
    • architecture: we can use the platform library of os library to get at which architecture we're running on
    • operating_system: see above.
    • python_version
    • python_implementation
    • invocation_start: span start/end in the OT framework
    • invocation_end: same as above.
    soda-sql 
    opened by bastienboutonnet 11
  • Update setup.py with git based build/release tagging

    Update setup.py with git based build/release tagging

    Currently building the local version uses a hardcoded __dev__ as a version, it is better to use the git SHA/ref as a version.

    • [ ] update setup.py to pickup the git ref and set SODA_SQL_VERSION
    • [ ] Update documentation on how to build/install locally
    • [ ] Make sure that CI/CD uses the same process
    opened by vijaykiran 9
  • Split soda-sql in separate modules

    Split soda-sql in separate modules

    Folder structure to be discussed

    Libraries use dashes in the names, no underscores. Folder names as indicated below all use underscores.

    + src
        + sodasql (why was this again?)
            + setup.py
        + sodasql_cloud
            + setup.py
            + sodasql_cloud
            + tests
        + sodasql_scan
            + setup.py
            + sodasql_scan
            + tests
        + sodasql_cli
            + setup.py
            + sodasql_cli
            + tests
        + sodasql_airflow
        + connectors
            + sodasql_postgres
                + setup.py
                + sodasql_postgres
                + tests
            + sodasql_snowflake
            + sodasql_athena
            + sodasql_redshift
            + sodasql_bigquery
            + sodasql_spark
    

    Also, before starting I would like to understand how we'll deal with versioning the individual libs and how this impacts the release process.

    TODO investigate if dialects can be extracted. Passing data from core lib to dialects is no prob. But The dialects should not invoke methods on core stuff. That would cause circular dependencies. To be investigated before starting this.

    enhancement soda-sql 
    opened by tombaeyens 9
  • soda-sql doesn't show the test names of the tests executed during a scan.

    soda-sql doesn't show the test names of the tests executed during a scan.

    Hi team,

    Soda-SQL doens't show the test names of the test executed during a scan. It does so for measurements though. Please see the output below:

    (.venv) mathissedestrooper@Mathisses-MacBook-Pro soda-sql-project % soda scan ./snowflake-soda-warehouse ORDERS
    | 2.0.0b5 | Scanning ORDERS in ./snowflake-soda-warehouse ... | sql_metric_path scan.yml does not exist | Executing SQL query: (...) | missing_count(O_ORDERDATE) = 0 | values_percentage(O_ORDERDATE) = 100.0 | 62 measurements computed | 1 tests executed | All is good. No tests failed.

    It would be handy to show the test names and results instead of just showing x tests executed and x failed. This would allow the user to immediately see what's wrong and act upon it accordingly.

    Thanks!

    good first issue soda-sql 
    opened by mathissedestrooper 8
  • Allow users to ingest dbt test results into Soda Cloud via soda-sql

    Allow users to ingest dbt test results into Soda Cloud via soda-sql

    Context

    Users who use dbt have spent time crafting tests in their transformations. Because we don’t want users to repeat themselves to benefit from Soda Cloud’s visibility and incident management we plan to:

    • Ingest dbt test results into Soda Cloud.
    • Provide clear indications of whether the tests are maintained in soda-sql or ingested from dbt
    • Allow collaboration and reporting to be done on those results just as if they were maintained from soda-sql

    User Flow

    Users should be able to call a command such as soda ingest --tool dbt, provide soda-sql with paths to the dbt artifacts, and have their tests results show up in the Soda Cloud UI. We will have to be clear in our documentation that ingestion should be orchestrated once dbt has been invoked and the transformations have completed. Soda-sql (at least in this first version) will not integrate with the core dbt API. This means it will not orchestrate a dbt run or be called by dbt. Those scenarios may be revisited in further iterations of our integration.

    We may also consider creating a new soda command such as soda capture which would run soda scan and follow it with soda ingest or a flag on soda scan such as --ingest dbt. Up for discussion and TBD once minimal isolated approach is working. This is sugar coating at this point.

    Overall Approach

    In this v1, we want to keep it simple. dbt generates two valuable artifacts: manifest.json and run_results.json. When it comes to test results, we can parse the results from the run_results.json and use the info in the manifest.json to gain knowledge about the table (dataset in Soda's language) that it maps on to.

    Once we have parsed all the info we need (see below) we can compose a TestObject similar (but not exactly) to the soda-sql TestObject that is then exposed to the Soda Cloud backend API. The UI will show the ingested tests on the dataset view, those tests will be clearly identified as external and display a dbt logo or whatever UI choice makes sense.

    Specific Requirements

    We will want to parse and represent the following information from dbt test results:

    • test_run_at: timestamp of when the test was run.
    • test_result: whether the test is passing or failing.
    • severity: if this info is available (I know it is for freshness but maybe not for all).
    • info that allows us to generate a fully qualified table name:
      • database
      • schema
      • table
    • tool_name: in this case it would be dbt (this information will be used by the Soda Cloud UI to display the test provenance and govern available attributes and details)
    • compiled_sql: dbt should give us access to the compiled sql snippet that it ran for the test. We may want to parse this as we can show it in the Soda Cloud UI and it is valuable information that users can use as a starting point for investigation and issue resolution. If users have enabled store_failures we should be able to obtain the select statement where they can visualise those failures. If this is possible we could consider adding those to the requirements. More info on this can be found here: https://docs.getdbt.com/docs/building-a-dbt-project/tests#storing-test-failures

    Concious Limitations

    • v1 will only support "generic tests": this means dbt builtins as well as the ones implemented via macros and added to models or columns of those models.
    • v1 will not support "bespoke tests" unless they are following the best practice of {{ ref() }} in their declaration and in that case the mapping should be available in a structured way just like any other tests (manifest.json and `run_results.json)
    • v1 will not attempt to capture results from store_failures (this may be something we offer as a stretch goal as if fits our failed_rows model (see last point of Specific Requirements
    • v1 will be limited to ingesting results from dbt-core (OSS). A future iteration will know how to talk to the dbt Cloud API to leverage similar information.
    soda-sql 
    opened by bastienboutonnet 8
  • Auto-complete soda CLI commands

    Auto-complete soda CLI commands

    Is your feature request related to a problem? Please describe. I'm constantly trying to auto-complete soda commands by pressing "tab".

    Describe the solution you'd like Auto-complete soda-commands.

    Additional context This is definitely not a crucial feature request but could be a nice-to-have in the future 😃

    Soda SQL Version: Soda CLI version 2.0.0b21

    enhancement soda-sql 
    opened by mathissedestrooper 7
  • BigQuery documentation is misleading

    BigQuery documentation is misleading

    Describe the bug Our documentation currently says:

    name: my_bigquery_project
    connection:
        type: bigquery
        account_info: <PATH TO YOUR BIGQUERY ACCOUNT INFO JSON FILE>
        dataset: sodasql
    ...
    

    but our code expects

    1. account_info_json, and
    2. that property should have JSON object, not the location of it

    OS: Python Version: * Soda SQL Version: main Warehouse Type: BigQuery

    bug good first issue soda-sql 
    opened by milanaleksic 7
  • ImportError when running soda from CLI

    ImportError when running soda from CLI

    Describe the bug When running soda from the CLI I get the following traceback:

    Traceback (most recent call last):
      File "/usr/lib/command-not-found", line 28, in <module>
        from CommandNotFound import CommandNotFound
      File "/usr/lib/python3/dist-packages/CommandNotFound/CommandNotFound.py", line 19, in <module>
        from CommandNotFound.db.db import SqliteDatabase
      File "/usr/lib/python3/dist-packages/CommandNotFound/db/db.py", line 5, in <module>
        import apt_pkg
    ImportError: libapt-pkg.so.6.0: cannot open shared object file: No such file or directory
    

    To Reproduce Steps to reproduce the behavior:

    1. sudo apt-get install g++ unixodbc-dev python3-dev libssl-dev libffi-dev
    2. sudo apt-get install build-essential cargo
    3. pip3 install soda-sql-postgresql
    4. soda

    Context Running Vagrant+Virtualbox with the attached vagrantfile Vagrantfile.zip

    OS: Ubuntu 18.04.5 LTS Python Version: 3.8.11 Soda SQL Version: unable to get info due to error. Should be current as of 07/08/21 Warehouse Type: PostgreSQL

    bug soda-sql 
    opened by mando222 7
  • Issue with the pyodbc depenency when installing on a Mac with M1 processor

    Issue with the pyodbc depenency when installing on a Mac with M1 processor

    Describe the bug There is an issue when installing on a Mac with the M1 processor and it seems to be related to the pyodbc dependency. It fails with this error message: clang: error: linker command failed with exit code 1 (use -v to see invocation) error: command '/usr/bin/clang++' failed with exit code 1

    The issue with pyodbc is described here: https://github.com/mkleehammer/pyodbc/issues/846

    I have tested the workarounds mentioned but I am still unable to get past the pyodbc issue.

    To Reproduce Steps to reproduce the behavior:

    1. Just run: pip3 install soda-sql

    Context Include your scan.yml or warehouse.yml when relevant

    OS: macOs Big Sur 11.2.3 Python Version: 3.9.2 Soda SQL Version: 2.0.0b25 Warehouse Type:

    bug soda-sql 
    opened by albinkjellin 6
  • [bug] Analyze does not push down the LIMIT clause and results a full scan

    [bug] Analyze does not push down the LIMIT clause and results a full scan

    Describe the bug In the analyze phase, the DatasetAnalyzer runs some count queries using a LIMIT clause to avoid the full scan. But when I check the query plan, I see that the limit is applied after the results are calculated and that causes a big performance issue for big tables. Limit should be applied before executing the count/sum operators

    To Reproduce Steps to reproduce the behavior:

    1. run soda analyze
    2. Run explain from a sql client for the query generated by the analyzer.

    This is the current plan: (for demodata dataset)

    Limit  (cost=2.30..2.31 rows=1 width=16)
    --
    ->  Aggregate  (cost=2.30..2.31 rows=1 width=16)
    ->  Seq Scan on demodata  (cost=0.00..1.65 rows=65 width=13)
    

    It should be:

    Aggregate  (cost=0.05..0.06 rows=1 width=16)
    --
    ->  Limit  (cost=0.00..0.03 rows=1 width=1569)
    ->  Seq Scan on demodata  (cost=0.00..1.65 rows=65 width=1569)
    
    

    One way of fixing the problem is changing the FROM demodata limit 1000 tofrom (select * from demodata limit 1) T;

    Context Include your scan.yml or warehouse.yml when relevant

    OS: Python Version: Soda SQL Version: 2.0.0.b15 Warehouse Type: Redshift + postgresql

    bug soda-sql 
    opened by mmigdiso 6
  • Incompatible with `Airflow` due to `MarkupSafe` conflict

    Incompatible with `Airflow` due to `MarkupSafe` conflict

    Describe the bug soda-core:2.2.2 is incompatible with Airflow:2.4.1 due to MarkupSafe version conflict

    ... soda-sql-bigquery (>=2.1.6,<3.0.0) requires markupsafe (2.0.1).
          And because apache-airflow (2.4.1) requires MarkupSafe (>=2.1.1) (2), apache-airflow (2.4.1) is incompatible with soda-sql-bigquery (>=2.1.6,<3.0.0)
    

    To Reproduce Steps to reproduce the behavior:

    1. create pyproject.toml with dependencies to
    • airflow = "2.4.1"
    • soda-sql-bigquery = "^2.1.6"
    • any other soda-sql-xxx with same version
    1. run poetry update sod-sql-bigquery ...
    2. observe conflict

    OS: MacOs Python Version: 3.9 Soda SQL Version: 2.2.2

    bug soda-sql 
    opened by KulykDmytro 0
  • #208: fix soda analyze error on mysql

    #208: fix soda analyze error on mysql

    This is for fixing the errors in #208.

    The error happens when there is a column name with the mysql keyword and in lower case. For example, a column named 'key'.

    After this fix, the issue appeared to be fixed for me.

    opened by hgbink 1
  • Soda Core | checks.yml for all Tables in Database

    Soda Core | checks.yml for all Tables in Database

    Goal: run checks.yml on all Tables in Database, implicitly / dynamically (not naming 100s of Tables).

    Following Soda's quick start, I've completed sections:

    • Install Soda Core
    • Connect Soda Core to a data source - configuration.yml

    Now I'm following Write a check and run a scan - checks.yml.


    Problem

    However, the documentation only gives examples for checking one Table each.

    4 Checks

    1. Sum of Tables (in Database)
    2. Sum of Columns (across all Tables, in Database)
    3. Sum of Tables' descriptions exist
    4. Sum of Columns' descriptions exist

    Queries return a COUNT().


    So far, checks.yml:

    # checks for MY_DATABASE:
      sql_metrics:
      name: num_tables, num_columns
        sum_tables query: |
          SELECT COUNT(*)
          FROM information_schema.tables
          WHERE table_schema = '*';
        sum_columns query: |
          SELECT COUNT(*)
          FROM information_schema.columns
          WHERE table_name = '*';
        sum_tables_descriptions query: |
          -- SQL
        sum_columns_descriptions query: |
          -- SQL
    
    bug soda-sql 
    opened by danielbellhv 2
  • Applying a filter to a row_count 'same as' comparison aggregation

    Applying a filter to a row_count 'same as' comparison aggregation

    Describe the bug

    Applying a filter to a row_count check only applies to the metric, not the external column we're evaluating against.

    I can see why this is expected behaviour, but I would like a way to optionally apply the filter to the comparison aggregation.

    To Reproduce

    Steps to reproduce the behavior:

    1. Create a new test in scan.yml
    filter foo [hourly]:
      where: from_iso8601_timestamp(time) between DATE_ADD('hour', -1, NOW()) AND NOW()
      
    checks for foo [hourly]:
      - row_count same as bar
    
    1. Run soda scan with -V verbose flag

    2. Observe the "otherRowcount" SQL doesn't include the filter and so counts all rows

    Soda Core 3.0.4
    Reading configuration file "configuration.yml"
    Reading SodaCL file "checks_foo.yml"
    Scan execution starts
    Query datalake_dev.foo[hourly].aggregation[0]:
    SELECT 
      COUNT(*) 
    FROM datalake.foo
    WHERE from_iso8601_timestamp(time) between DATE_ADD('hour', -1, NOW()) AND NOW()
    Query datalake_dev.bar.aggregation[0]:
    SELECT 
      COUNT(*) 
    FROM datalake.bar
    Scan summary:
    2/2 queries OK
      datalake_dev.foo[hourly].aggregation[0] [OK] 0:00:01.819557
      datalake_dev.bar.aggregation[0] [OK] 0:00:01.262278
    1/1 check FAILED: 
        foo [hourly] in datalake_dev
          row_count same as bar [FAILED]
            value: -12047
            rowCount: 9
            otherRowCount: 12056
    Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
    
    
    **Context**
    As above
    
    **OS**: MAC
    **Python Version**: 3.8.9
    **Soda SQL Version**: 3.0.4
    **Warehouse Type**: AWS Athena
    
    bug soda-sql 
    opened by paulskipprhudson 2
  • test error on scanning mysql table

    test error on scanning mysql table

    Describe the bug I run test against mysql table and error occurs:

    [TestResult(test=Test(id='{"expression":"row_count > 0"}', title='test(row_count > 0)', expression='row_count > 0', metrics=['row_count'], column=None, source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'row_count' is not defined"), group_values=None), TestResult(test=Test(id='{"column":"bus_no","expression":"invalid_percentage <= 20"}', title='column(bus_no) test(invalid_percentage <= 20)', expression='invalid_percentage <= 20', metrics=['invalid_percentage'], column='bus_no', source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'invalid_percentage' is not defined"), group_values=None), TestResult(test=Test(id='{"column":"incident_number","expression":"invalid_percentage == 0"}', title='column(incident_number) test(invalid_percentage == 0)', expression='invalid_percentage == 0', metrics=['invalid_percentage'], column='incident_number', source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'invalid_percentage' is not defined"), group_values=None), TestResult(test=Test(id='{"column":"incident_number","expression":"missing_count == 0"}', title='column(incident_number) test(missing_count == 0)', expression='missing_count == 0', metrics=['missing_count'], column='incident_number', source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'missing_count' is not defined"), group_values=None), TestResult(test=Test(id='{"column":"school_year","expression":"invalid_percentage == 0"}', title='column(school_year) test(invalid_percentage == 0)', expression='invalid_percentage == 0', metrics=['invalid_percentage'], column='school_year', source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'invalid_percentage' is not defined"), group_values=None), TestResult(test=Test(id='{"column":"schools_serviced","expression":"invalid_percentage <= 15"}', title='column(schools_serviced) test(invalid_percentage <= 15)', expression='invalid_percentage <= 15', metrics=['invalid_percentage'], column='schools_serviced', source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'invalid_percentage' is not defined"), group_values=None)]
    Test error for "row_count > 0": name 'row_count' is not defined
    Test error for "invalid_percentage <= 20": name 'invalid_percentage' is not defined
    Test error for "invalid_percentage == 0": name 'invalid_percentage' is not defined
    Test error for "missing_count == 0": name 'missing_count' is not defined
    Test error for "invalid_percentage == 0": name 'invalid_percentage' is not defined
    Test error for "invalid_percentage <= 15": name 'invalid_percentage' is not defined
    

    To Reproduce I run this python code:

    from sodasql.scan.scan_builder import ScanBuilder
    
    
    scan_builder = ScanBuilder()
    scan_builder.scan_yml_file = 'tables/breakdowns.yml'
    scan_builder.warehouse_yml_file = 'warehouse.yml'
    scan = scan_builder.build()
    scan_result = scan.execute()
    
    print(scan_result.test_results)
    

    And if i run this command:

     soda scan warehouse.yml tables/breakdowns.yml
    

    I get this:

     | 2.2.1
      | Scanning tables/breakdowns.yml ...
      | There is no value specified for valid_values for column incident_number
      | There is no value specified for valid_min for column incident_number
      | There is no value specified for valid_max for column incident_number
      | There is no value specified for valid_values for column school_year
      | There is no value specified for valid_min for column school_year
      | There is no value specified for valid_max for column school_year
      | There is no value specified for valid_values for column bus_no
      | There is no value specified for valid_min for column bus_no
      | There is no value specified for valid_max for column bus_no
      | There is no value specified for valid_values for column schools_serviced
      | There is no value specified for valid_min for column schools_serviced
      | There is no value specified for valid_max for column schools_serviced
      | # _do_auth(): user: mysql
      | # _do_auth(): self._auth_plugin: 
      | # _do_auth(): user: mysql
      | # _do_auth(): password: mysql
      | new_auth_plugin: caching_sha2_password
      | No Soda Cloud account configured
      | Executing SQL query: 
    SELECT column_name, data_type, is_nullable 
    FROM information_schema.columns 
    WHERE lower(table_name) = 'breakdowns' 
      AND table_schema = 'new_york'
      | SQL took 0:00:00.007393
      |   boro (text) 
      |   breakdown_or_running_late (text) 
      |   bus_company_name (text) 
      |   bus_no (text) 
      |   busbreakdown_id (int) 
      |   created_on (text) 
      |   has_contractor_notified_parents (text) 
      |   has_contractor_notified_schools (text) 
      |   have_you_alerted_opt (text) 
      |   how_long_delayed (text) 
      |   incident_number (int) 
      |   informed_on (text) 
      |   last_updated_on (text) 
      |   number_of_students_on_the_bus (int) 
      |   occured_on (text) 
      |   reason (text) 
      |   route_number (text) 
      |   run_type (text) 
      |   school_age_or_prek (text) 
      |   school_year (text) 
      |   schools_serviced (text) 
      | 21 columns:
      | Query measurement: schema = [{'name': 'boro', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'breakdown_or_running_late', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'bus_company_name', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'bus_no', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'number', 'semanticType': 'number'}, {'name': 'busbreakdown_id', 'type': 'int', 'dataType': 'int', 'nullable': True, 'logicalType': 'number', 'semanticType': 'number'}, {'name': 'created_on', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'has_contractor_notified_parents', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'has_contractor_notified_schools', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'have_you_alerted_opt', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'how_long_delayed', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'incident_number', 'type': 'int', 'dataType': 'int', 'nullable': True, 'logicalType': 'number', 'semanticType': 'number'}, {'name': 'informed_on', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'last_updated_on', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'number_of_students_on_the_bus', 'type': 'int', 'dataType': 'int', 'nullable': True, 'logicalType': 'number', 'semanticType': 'number'}, {'name': 'occured_on', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'reason', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'route_number', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'run_type', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'school_age_or_prek', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'school_year', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'time', 'semanticType': 'time'}, {'name': 'schools_serviced', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'number', 'semanticType': 'number'}]
      | Exception during aggregation query
    Traceback (most recent call last):
      File "/Users/mkoloev/Desktop/soda_projects/soda_mysql_2/venv/lib/python3.9/site-packages/sodasql/scan/scan.py", line 254, in _query_aggregations
        fields.append(dialect.sql_expr_count_conditional(scan_column.non_missing_condition, column_name))
    TypeError: sql_expr_count_conditional() takes 2 positional arguments but 3 were given
      | Test error for "row_count > 0": name 'row_count' is not defined
      | Test error for "invalid_percentage <= 20": name 'invalid_percentage' is not defined
      | Test error for "invalid_percentage == 0": name 'invalid_percentage' is not defined
      | Test error for "missing_count == 0": name 'missing_count' is not defined
      | Test error for "invalid_percentage == 0": name 'invalid_percentage' is not defined
      | Test error for "invalid_percentage <= 15": name 'invalid_percentage' is not defined
      | Executed 1 queries in 0:00:00.021183
      | Scan summary ------
      | 1 measurements computed
      | 6 tests executed
      | 6 of 6 tests failed:
      |   Test test(row_count > 0) failed with measurements null
      |   Test column(bus_no) test(invalid_percentage <= 20) failed with measurements null
      |   Test column(incident_number) test(invalid_percentage == 0) failed with measurements null
      |   Test column(incident_number) test(missing_count == 0) failed with measurements null
      |   Test column(school_year) test(invalid_percentage == 0) failed with measurements null
      |   Test column(schools_serviced) test(invalid_percentage <= 15) failed with measurements null
      | Errors occurred!
      |   [error] Exception during aggregation query
      |   [test_execution_error] Test "row_count > 0" failed
      |   [test_execution_error] Test "invalid_percentage <= 20" failed
      |   [test_execution_error] Test "invalid_percentage == 0" failed
      |   [test_execution_error] Test "missing_count == 0" failed
      |   [test_execution_error] Test "invalid_percentage == 0" failed
      |   [test_execution_error] Test "invalid_percentage <= 15" failed
      | Exiting with code 1
      | Starting new HTTPS connection (1): collect.soda.io:443
      | https://collect.soda.io:443 "POST /v1/traces HTTP/1.1" 200 0
    
    

    Context warehouse.yml:

    name: mysql
    connection:
      type: mysql
      host: localhost
      port: 3306
      username: mysql
      password: mysql
      database: new_york
    

    test: breakdowns.yml:

    table_name: breakdowns
    metrics:
      - row_count
      - missing_count
      - missing_percentage
      - values_count
      - values_percentage
      - invalid_count
      - invalid_percentage
      - valid_count
      - valid_percentage
      - avg_length
      - max_length
      - min_length
      - avg
      - sum
      - max
      - min
      - stddev
      - variance
    tests:
      - row_count > 0
    columns:
      incident_number:
        valid_format: number_whole
        tests:
          - invalid_percentage == 0
          - missing_count == 0
      school_year:
        valid_format: date_inverse
        tests:
          - invalid_percentage == 0
      bus_no:
        valid_format: number_whole
        tests:
          - invalid_percentage <= 20
      schools_serviced:
        valid_format: number_whole
        tests:
          - invalid_percentage <= 15
    

    OS: macOS Monterey 12.2.1 Python Version: 3.9.5 Soda SQL Version: 2.2.1 Warehouse Type: mysql

    However it works fine with postgres

    bug soda-sql 
    opened by mrkoloev 1
  • Arithmetic Overflow Error in SQL Server

    Arithmetic Overflow Error in SQL Server

    Describe the bug

    Getting Overflow error for count validation.

    image

    I guess it is related to the query that it is being running to perform the calculations. In SQL Server, regular counts and sum leads to Overflow error when not being applied to bigint, for example (count_big in case of the count).

    To Reproduce Steps to reproduce the behavior:

    1. Create a new test in tables/something.yml for a big table
    2. Run soda scan tables/something.yml

    Context

    table_name: big_table
    metrics:
      - row_count
      - missing_count
      - values_count
      - sum
    tests:
      - row_count > 0
    

    OS: Python Version: Latest Soda SQL Version: Latest Warehouse Type: SQL Server

    bug soda-sql 
    opened by lucas-lm 0
Releases(v2.1.3)
Owner
Soda Data Monitoring
Soda Data Monitoring
HTTP(s) "monitoring" webpage via FastAPI+Jinja2. Inspired by https://github.com/RaymiiOrg/bash-http-monitoring

python-http-monitoring HTTP(s) "monitoring" powered by FastAPI+Jinja2+aiohttp. Inspired by bash-http-monitoring. Installation can be done with pipenv

itzk 39 Aug 26, 2022
changedetection.io - The best and simplest self-hosted website change detection monitoring service

changedetection.io - The best and simplest self-hosted website change detection monitoring service. An alternative to Visualping, Watchtower etc. Designed for simplicity - the main goal is to simply monitor which websites had a text change. Open source web page change detection.

null 7.3k Jan 1, 2023
A demo of Prometheus+Grafana for monitoring an ML model served with FastAPI.

ml-monitoring Jeremy Jordan This repository provides an example setup for monitoring an ML system deployed on Kubernetes.

Jeremy Jordan 176 Jan 1, 2023
Ransomware leak site monitoring

RansomWatch RansomWatch is a ransomware leak site monitoring tool. It will scrape all of the entries on various ransomware leak sites, store the data

Zander Work 278 Dec 31, 2022
A Prometheus exporter for monitoring & analyzing Grafana Labs' technical documentation

grafana-docs-exporter A Prometheus exporter for monitoring & analyzing Grafana Labs' technical documentation Here is the public endpoint.

Matt Abrams 5 May 2, 2022
Scout: an open-source version of the monitoring tool

Badger Scout Scout is an open-source version of the monitoring tool used by Badg

Badger Finance 2 Jan 13, 2022
dash-manufacture-spc-dashboard is a dashboard for monitoring read-time process quality along manufacture production line

In our solution based on plotly, dash and influxdb, the user will firstly generate the specifications for different robots, and then a wide range of interactive visualizations for different machines for machine power, machine cost, and total cost based on the energy time and total energy getting dynamically from sensors. If a threshold is met, the alert email is generated for further operation.

Dequn Teng 1 Feb 13, 2022
Log processor for nginx or apache that extracts user and user sessions and calculates other types of useful data for bot detection or traffic analysis

Log processor for nginx or apache that extracts user and user sessions and calculates other types of useful data for bot detection or traffic analysis

David Puerta Martín 1 Nov 11, 2021
Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.

python-tabulate Pretty-print tabular data in Python, a library and a command-line utility. The main use cases of the library are: printing small table

Sergey Astanin 1.5k Jan 6, 2023
Display tabular data in a visually appealing ASCII table format

PrettyTable Installation Install via pip: python -m pip install -U prettytable Install latest development version: python -m pip install -U git+https

Jazzband 924 Jan 5, 2023
ClusterMonitor - a very simple python script which monitors and records the CPU and RAM consumption of submitted cluster jobs

ClusterMonitor A very simple python script which monitors and records the CPU and RAM consumption of submitted cluster jobs. Usage To start recording

null 23 Oct 4, 2021
Monitor and log Network and Disks statistics in MegaBytes per second.

iometrics Monitor and log Network and Disks statistics in MegaBytes per second. Install pip install iometrics Usage Pytorch-lightning integration from

Leo Gallucci 17 May 3, 2022
Robust and effective logging for Python 2 and 3.

Robust and effective logging for Python 2 and 3.

Chris Hager 1k Jan 4, 2023
A Fast, Extensible Progress Bar for Python and CLI

tqdm tqdm derives from the Arabic word taqaddum (تقدّم) which can mean "progress," and is an abbreviation for "I love you so much" in Spanish (te quie

tqdm developers 23.7k Jan 1, 2023
Rich is a Python library for rich text and beautiful formatting in the terminal.

Rich 中文 readme • lengua española readme • Läs på svenska Rich is a Python library for rich text and beautiful formatting in the terminal. The Rich API

Will McGugan 41.5k Jan 7, 2023
Progressbar 2 - A progress bar for Python 2 and Python 3 - "pip install progressbar2"

Text progress bar library for Python. Travis status: Coverage: Install The package can be installed through pip (this is the recommended method): pip

Rick van Hattem 795 Dec 18, 2022
Pretty and useful exceptions in Python, automatically.

better-exceptions Pretty and more helpful exceptions in Python, automatically. Usage Install better_exceptions via pip: $ pip install better_exception

Qix 4.3k Dec 29, 2022
A new kind of Progress Bar, with real time throughput, eta and very cool animations!

alive-progress :) A new kind of Progress Bar, with real-time throughput, eta and very cool animations! Ever found yourself in a remote ssh session, do

Rogério Sampaio de Almeida 4k Dec 30, 2022
Stand-alone parser for User Access Logging from Server 2012 and newer systems

KStrike Stand-alone parser for User Access Logging from Server 2012 and newer systems BriMor Labs KStrike This script will parse data from the User Ac

BriMor Labs 69 Nov 1, 2022