Library for exploring and validating machine learning data

Overview

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {35, 36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.27.0 2.0.0 nightly (1.x/2.x) 0.27.0 n/a 0.27.0
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Comments
  • Slow performance when computing stats for moderately large data set

    Slow performance when computing stats for moderately large data set

    Hi,

    I like this project a lot, and thanks for releasing it! I see the potential to save a lot of time when I first receive new datasets. However, I have issues with performance.

    OS: the notebook is running in a Docker container based on https://hub.docker.com/r/tensorflow/tensorflow/.

    Hardware:

    GPUs 16X NVIDIA® Tesla V100
    GPU Memory 512GB total
    CPU Dual Intel Xeon Platinum
    8168, 2.7 GHz, 24-cores
    System Memory 1.5TB
    

    It takes me >8 hours (30900 s) to compute the statistics for a ~100 dataset, with file sizes from 0.5 MB to 300 MB, with a median of 70 MB. It's true that the Docker container introduces some overhead, but given the specs of my hardware, I think it's too much. Any tips on how to speed up computations, w/o changing hardware (i.e., no cloud)? For example, if there was an option to compute statistics in a dataframe, rather than in a protocol buffer, one could use Modin to speedup pandas computations with minimal changes to code.

    PS if I use a GPU container instead, with

    $ docker run --runtime=nvidia -it -p 8888:8888 tensorflow/tensorflow:latest-gpu should I see a speedup?

    stat:awaiting response type:bug 
    opened by AndreaPi 12
  • Statistics visualization doesn't work in Firefox

    Statistics visualization doesn't work in Firefox

    tfdv.visualize_statistics(train_stats) display nothing except a line. I print 'train_stats', it does contain the stats i wanted. so i think the steps before "tfdv.visualize_statistics(train_stats)" is all normal. My browser is firefox. where the problem is?

    stat:awaiting response type:support 
    opened by advancera 12
  • TFDV does not catch out-of-domain values for categorical ints

    TFDV does not catch out-of-domain values for categorical ints

    The domain of a categorical int feature is included in my schema as a string_domain (generated by using feature.int_domain.is_categorical = True).

    However, when I try to run tfdv.validate_instance() on an example with an out-of-domain value for a categorical int, TFDV doesn't generate any anomalies.

    Here's a Colab to reproduce.

    stat:awaiting response type:bug 
    opened by kennysong 11
  • Python 3 Support

    Python 3 Support

    Currently, TFDV requires Python 2.7 due to dependency on Apache Beam, which is not compatible with Python 3 yet.

    But Python 3 support should be available very soon, as Apache Beam is almost Python 3 ready.

    Announcement 
    opened by paulgc 11
  • pip install tensorflow-data-validation fails on OS X El Capitan (10.11.6)

    pip install tensorflow-data-validation fails on OS X El Capitan (10.11.6)

    pip install tensorflow-data-validation Collecting tensorflow-data-validation Could not find a version that satisfies the requirement tensorflow-data-validation (from versions: ) No matching distribution found for tensorflow-data-validation

    type:build/install 
    opened by wjarek2 10
  • [DataflowRuntimeException] ImportError: No module named tfdv.statistics.stats_impl

    [DataflowRuntimeException] ImportError: No module named tfdv.statistics.stats_impl

    Context

    When running tfdv.generate_statistics_from_tfrecord on Dataflow, the job gets submitted successfully to the cluster but I get a: ImportError: No module named tensorflow_data_validation.statistics.stats_impl during the job unpickling phase in the Dataflow worker

    Error trace

    ---------------------------------------------------------------------------
    DataflowRuntimeException                  Traceback (most recent call last)
    <ipython-input-23-8f1147effd88> in <module>()
         16 # for more options about stats, run `?tfdv.generate_statistics_from_tfrecord`
         17 tfdv.generate_statistics_from_tfrecord(TFRECORDS_PATH, 
    ---> 18                                        pipeline_options=pipeline_options)
    
    /Users/romain/dev/venv/lib/python2.7/site-packages/tensorflow_data_validation/utils/stats_gen_lib.pyc in generate_statistics_from_tfrecord(data_location, output_path, stats_options, pipeline_options)
         86             shard_name_template='',
         87             coder=beam.coders.ProtoCoder(
    ---> 88                 statistics_pb2.DatasetFeatureStatisticsList)))
         89   return load_statistics(output_path)
         90 
    
    /Users/romain/dev/venv/lib/python2.7/site-packages/apache_beam/pipeline.pyc in __exit__(self, exc_type, exc_val, exc_tb)
        421   def __exit__(self, exc_type, exc_val, exc_tb):
        422     if not exc_type:
    --> 423       self.run().wait_until_finish()
        424 
        425   def visit(self, visitor):
    
    /Users/romain/dev/venv/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.pyc in wait_until_finish(self, duration)
       1164         raise DataflowRuntimeException(
       1165             'Dataflow pipeline failed. State: %s, Error:\n%s' %
    -> 1166             (self.state, getattr(self._runner, 'last_error_msg', None)), self)
       1167     return self.state
       1168 
    
    DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 642, in do_work
        work_executor.execute()
      File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 130, in execute
        test_shuffle_sink=self._test_shuffle_sink)
      File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 104, in create_operation
        is_streaming=False)
      File "apache_beam/runners/worker/operations.py", line 636, in apache_beam.runners.worker.operations.create_operation
        op = create_pgbk_op(name_context, spec, counter_factory, state_sampler)
      File "apache_beam/runners/worker/operations.py", line 482, in apache_beam.runners.worker.operations.create_pgbk_op
        return PGBKCVOperation(step_name, spec, counter_factory, state_sampler)
      File "apache_beam/runners/worker/operations.py", line 538, in apache_beam.runners.worker.operations.PGBKCVOperation.__init__
        fn, args, kwargs = pickler.loads(self.spec.combine_fn)[:3]
      File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 246, in loads
        return dill.loads(s)
      File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 316, in loads
        return load(file, ignore)
      File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 304, in load
        obj = pik.load()
      File "/usr/lib/python2.7/pickle.py", line 864, in load
        dispatch[key](self)
      File "/usr/lib/python2.7/pickle.py", line 1096, in load_global
        klass = self.find_class(module, name)
      File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 465, in find_class
        return StockUnpickler.find_class(self, module, name)
      File "/usr/lib/python2.7/pickle.py", line 1130, in find_class
        __import__(module)
    ImportError: No module named tensorflow_data_validation.statistics.stats_impl
    

    What code did I run?

    !pip install -U tensorflow \
                    tensorflow-data-validation \
                    apache-beam[gcp]
    
    import tensorflow_data_validation as tfdv
    from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions
    
    # Create and set your PipelineOptions.
    pipeline_options = PipelineOptions()
    
    # For Cloud execution, set the Cloud Platform project, job_name,
    # staging location, temp_location and specify DataflowRunner.
    google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
    google_cloud_options.project = PROJECT_ID
    google_cloud_options.job_name = JOB_NAME
    google_cloud_options.staging_location = GCS_STAGING_LOCATION
    google_cloud_options.temp_location = GCS_TMP_LOCATION
    pipeline_options.view_as(StandardOptions).runner = 'DataflowRunner'
        
    tfdv.generate_statistics_from_tfrecord(TFRECORDS_PATH, 
                                           pipeline_options=pipeline_options)
    

    Pip trace

    Requirement already up-to-date: tensorflow in /Users/romain/dev/venv/lib/python2.7/site-packages (1.12.0)
    Requirement already up-to-date: tensorflow-data-validation in /Users/romain/dev/venv/lib/python2.7/site-packages (0.11.0)
    Requirement already up-to-date: apache-beam[gcp] in /Users/romain/dev/venv/lib/python2.7/site-packages (2.8.0)
    Requirement already satisfied, skipping upgrade: enum34>=1.1.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.1.6)
    Requirement already satisfied, skipping upgrade: keras-preprocessing>=1.0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.5)
    Requirement already satisfied, skipping upgrade: wheel in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.31.1)
    Requirement already satisfied, skipping upgrade: astor>=0.6.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.7.1)
    Requirement already satisfied, skipping upgrade: backports.weakref>=1.0rc1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.post1)
    Requirement already satisfied, skipping upgrade: mock>=2.0.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (2.0.0)
    Requirement already satisfied, skipping upgrade: tensorboard<1.13.0,>=1.12.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.12.0)
    Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.1.0)
    Requirement already satisfied, skipping upgrade: protobuf>=3.6.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (3.6.1)
    Requirement already satisfied, skipping upgrade: gast>=0.2.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.2.0)
    Requirement already satisfied, skipping upgrade: absl-py>=0.1.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.3.0)
    Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.13.0)
    Requirement already satisfied, skipping upgrade: six>=1.10.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.10.0)
    Requirement already satisfied, skipping upgrade: keras-applications>=1.0.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.6)
    Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.14.0)
    Requirement already satisfied, skipping upgrade: IPython<6,>=5.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (5.7.0)
    Requirement already satisfied, skipping upgrade: tensorflow-metadata<0.10,>=0.9 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.9.0)
    Requirement already satisfied, skipping upgrade: tensorflow-transform<0.12,>=0.11 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.11.0)
    Requirement already satisfied, skipping upgrade: pandas<1,>=0.18 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.22.0)
    Requirement already satisfied, skipping upgrade: oauth2client<5,>=2.0.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (4.1.3)
    Requirement already satisfied, skipping upgrade: dill<=0.2.8.2,>=0.2.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.2.8.2)
    Requirement already satisfied, skipping upgrade: pydot<1.3,>=1.2.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.2.4)
    Requirement already satisfied, skipping upgrade: pyyaml<4.0.0,>=3.12 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.12)
    Requirement already satisfied, skipping upgrade: pyvcf<0.7.0,>=0.6.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.6.8)
    Requirement already satisfied, skipping upgrade: typing<3.7.0,>=3.6.0; python_version < "3.5.0" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.6.4)
    Requirement already satisfied, skipping upgrade: avro<2.0.0,>=1.8.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.8.2)
    Requirement already satisfied, skipping upgrade: future<1.0.0,>=0.16.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.16.0)
    Requirement already satisfied, skipping upgrade: fastavro<0.22,>=0.21.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.21.13)
    Requirement already satisfied, skipping upgrade: crcmod<2.0,>=1.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.7)
    Requirement already satisfied, skipping upgrade: httplib2<=0.11.3,>=0.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.11.3)
    Requirement already satisfied, skipping upgrade: futures<4.0.0,>=3.1.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.2.0)
    Requirement already satisfied, skipping upgrade: hdfs<3.0.0,>=2.1.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (2.1.0)
    Requirement already satisfied, skipping upgrade: pytz<=2018.4,>=2018.3 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (2018.4)
    Requirement already satisfied, skipping upgrade: google-apitools<=0.5.20,>=0.5.18; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.5.20)
    Requirement already satisfied, skipping upgrade: proto-google-cloud-pubsub-v1==0.15.4; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.15.4)
    Requirement already satisfied, skipping upgrade: googledatastore==7.0.1; python_version < "3.0" and extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (7.0.1)
    Requirement already satisfied, skipping upgrade: google-cloud-bigquery==0.25.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.25.0)
    Requirement already satisfied, skipping upgrade: google-cloud-pubsub==0.26.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.26.0)
    Requirement already satisfied, skipping upgrade: proto-google-cloud-datastore-v1<=0.90.4,>=0.90.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.90.4)
    Requirement already satisfied, skipping upgrade: funcsigs>=1; python_version < "3.3" in /Users/romain/dev/venv/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow) (1.0.2)
    Requirement already satisfied, skipping upgrade: pbr>=0.11 in /Users/romain/dev/venv/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow) (1.10.0)
    Requirement already satisfied, skipping upgrade: werkzeug>=0.11.10 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorboard<1.13.0,>=1.12.0->tensorflow) (0.14.1)
    Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorboard<1.13.0,>=1.12.0->tensorflow) (2.6.11)
    Requirement already satisfied, skipping upgrade: setuptools in /Users/romain/dev/venv/lib/python2.7/site-packages (from protobuf>=3.6.1->tensorflow) (39.1.0)
    Requirement already satisfied, skipping upgrade: h5py in /Users/romain/dev/venv/lib/python2.7/site-packages (from keras-applications>=1.0.6->tensorflow) (2.8.0)
    Requirement already satisfied, skipping upgrade: simplegeneric>0.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.8.1)
    Requirement already satisfied, skipping upgrade: pygments in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (2.2.0)
    Requirement already satisfied, skipping upgrade: backports.shutil-get-terminal-size; python_version == "2.7" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (1.0.0)
    Requirement already satisfied, skipping upgrade: pexpect; sys_platform != "win32" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.6.0)
    Requirement already satisfied, skipping upgrade: prompt-toolkit<2.0.0,>=1.0.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (1.0.15)
    Requirement already satisfied, skipping upgrade: decorator in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.3.0)
    Requirement already satisfied, skipping upgrade: pickleshare in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.7.4)
    Requirement already satisfied, skipping upgrade: appnope; sys_platform == "darwin" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.1.0)
    Requirement already satisfied, skipping upgrade: traitlets>=4.2 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.3.2)
    Requirement already satisfied, skipping upgrade: pathlib2; python_version == "2.7" or python_version == "3.3" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (2.3.2)
    Requirement already satisfied, skipping upgrade: googleapis-common-protos in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-metadata<0.10,>=0.9->tensorflow-data-validation) (1.5.3)
    Requirement already satisfied, skipping upgrade: python-dateutil in /Users/romain/dev/venv/lib/python2.7/site-packages (from pandas<1,>=0.18->tensorflow-data-validation) (2.7.3)
    Requirement already satisfied, skipping upgrade: rsa>=3.1.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (3.4.2)
    Requirement already satisfied, skipping upgrade: pyasn1>=0.1.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (0.1.9)
    Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (0.0.8)
    Requirement already satisfied, skipping upgrade: pyparsing>=2.1.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from pydot<1.3,>=1.2.0->apache-beam[gcp]) (2.1.10)
    Requirement already satisfied, skipping upgrade: docopt in /Users/romain/dev/venv/lib/python2.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]) (0.6.2)
    Requirement already satisfied, skipping upgrade: requests>=2.7.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]) (2.11.1)
    Requirement already satisfied, skipping upgrade: fasteners>=0.14 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-apitools<=0.5.20,>=0.5.18; extra == "gcp"->apache-beam[gcp]) (0.14.1)
    Requirement already satisfied, skipping upgrade: google-cloud-core<0.26dev,>=0.25.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (0.25.0)
    Requirement already satisfied, skipping upgrade: gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.15.4)
    Requirement already satisfied, skipping upgrade: ptyprocess>=0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from pexpect; sys_platform != "win32"->IPython<6,>=5.0->tensorflow-data-validation) (0.5.2)
    Requirement already satisfied, skipping upgrade: wcwidth in /Users/romain/dev/venv/lib/python2.7/site-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython<6,>=5.0->tensorflow-data-validation) (0.1.7)
    Requirement already satisfied, skipping upgrade: ipython-genutils in /Users/romain/dev/venv/lib/python2.7/site-packages (from traitlets>=4.2->IPython<6,>=5.0->tensorflow-data-validation) (0.2.0)
    Requirement already satisfied, skipping upgrade: scandir; python_version < "3.5" in /Users/romain/dev/venv/lib/python2.7/site-packages (from pathlib2; python_version == "2.7" or python_version == "3.3"->IPython<6,>=5.0->tensorflow-data-validation) (1.7)
    Requirement already satisfied, skipping upgrade: monotonic>=0.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from fasteners>=0.14->google-apitools<=0.5.20,>=0.5.18; extra == "gcp"->apache-beam[gcp]) (1.5)
    Requirement already satisfied, skipping upgrade: google-auth-httplib2 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (0.0.3)
    Requirement already satisfied, skipping upgrade: google-auth<2.0.0dev,>=0.4.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (1.1.1)
    Requirement already satisfied, skipping upgrade: grpc-google-iam-v1<0.12dev,>=0.11.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.11.4)
    Requirement already satisfied, skipping upgrade: google-gax<0.16dev,>=0.15.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.15.16)
    Requirement already satisfied, skipping upgrade: cachetools>=2.0.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-auth<2.0.0dev,>=0.4.0->google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (2.0.1)
    Requirement already satisfied, skipping upgrade: ply==3.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-gax<0.16dev,>=0.15.7->gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (3.8)
    
    stat:awaiting response type:support 
    opened by yonromai 9
  • infer_schema(..., infer_feature_shape=True) should parse VarLenFeature to SparseFeature

    infer_schema(..., infer_feature_shape=True) should parse VarLenFeature to SparseFeature

    If I understand things correctly, the desired behavior of infer_schema(..., infer_feature_shape=True) would be to infer feature shapes for FixedLenFeatures while parsing VarLenFeatures to SparseFeatures in the schema. However, my VarLenFeature currently gets parsed as a Feature with dim=0. Am I understanding this properly / am I doing something wrong?

    stat:awaiting response type:support 
    opened by schmidt-jake 8
  • TDFV==0.14.0 Wheel Fails Integrity Check

    TDFV==0.14.0 Wheel Fails Integrity Check

    It appears TFDV==0.14.0 has a bad hash value and is failing the integrity check done by the resolver, pex:

    Exception message: Bad hash for file 'tensorflow_data_validation/pywrap/_pywrap_tensorflow_data_validation.so'.

    It has been reported (https://github.com/pypa/pip/issues/4705) that pip does not check the file hashes for integrity but other resolvers such as Pex do which leads to failures. Thus, it appears there's some mutation after the wheel is built ?

    Minimal Repo:

    pip install pex
    pex tensorflow-data-validation==0.14.0 -o tfdv.pex
    
    stat:awaiting response type:support 
    opened by jhamet93 8
  • tfdv manylinux pypi packages are built/linked on too new of a platform for general compatibility

    tfdv manylinux pypi packages are built/linked on too new of a platform for general compatibility

    when we attempt to use the current manylinux bdist from pypi (tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl) on a centos7 machine, we see the following ImportError:

    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-2-1ad020593972> in <module>()
    1 import pkg_resources, importlib
    2 importlib.reload(pkg_resources)
    ----> 3 import tensorflow_data_validation as tfdv
     
    /var/lib/mesos/slaves/8bfbe6e2-3bf7-4b49-90c3-15e4be759186-S218/frameworks/201104070004-0000002563-0000/executors/thermos-kwilson-devel-pycx-notebook-0-e9d46056-d500-4c44-9970-ebab4e39f006/runs/18847da2-8502-4393-a01d-c5bfc264f405/sandbox/.pex/install/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl.840798a46d57eb5c1ed0f639d3f47149480121e9/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl/tensorflow_data_validation/__init__.py in <module>()
    19
    20 # Import validation API.
    ---> 21 from tensorflow_data_validation.api.validation_api import infer_schema
    22 from tensorflow_data_validation.api.validation_api import validate_instance
    23 from tensorflow_data_validation.api.validation_api import validate_statistics
     
    /var/lib/mesos/slaves/8bfbe6e2-3bf7-4b49-90c3-15e4be759186-S218/frameworks/201104070004-0000002563-0000/executors/thermos-kwilson-devel-pycx-notebook-0-e9d46056-d500-4c44-9970-ebab4e39f006/runs/18847da2-8502-4393-a01d-c5bfc264f405/sandbox/.pex/install/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl.840798a46d57eb5c1ed0f639d3f47149480121e9/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl/tensorflow_data_validation/api/validation_api.py in <module>()
    26 import tensorflow as tf
    27 from tensorflow_data_validation import types
    ---> 28 from tensorflow_data_validation.pywrap import pywrap_tensorflow_data_validation
    29 from tensorflow_data_validation.statistics import stats_impl
    30 from tensorflow_data_validation.statistics import stats_options
     
    /var/lib/mesos/slaves/8bfbe6e2-3bf7-4b49-90c3-15e4be759186-S218/frameworks/201104070004-0000002563-0000/executors/thermos-kwilson-devel-pycx-notebook-0-e9d46056-d500-4c44-9970-ebab4e39f006/runs/18847da2-8502-4393-a01d-c5bfc264f405/sandbox/.pex/install/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl.840798a46d57eb5c1ed0f639d3f47149480121e9/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl/tensorflow_data_validation/pywrap/pywrap_tensorflow_data_validation.py in <module>()
    26                 fp.close()
    27             return _mod
    ---> 28     _pywrap_tensorflow_data_validation = swig_import_helper()
    29     del swig_import_helper
    30 else:
     
    /var/lib/mesos/slaves/8bfbe6e2-3bf7-4b49-90c3-15e4be759186-S218/frameworks/201104070004-0000002563-0000/executors/thermos-kwilson-devel-pycx-notebook-0-e9d46056-d500-4c44-9970-ebab4e39f006/runs/18847da2-8502-4393-a01d-c5bfc264f405/sandbox/.pex/install/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl.840798a46d57eb5c1ed0f639d3f47149480121e9/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl/tensorflow_data_validation/pywrap/pywrap_tensorflow_data_validation.py in swig_import_helper()
    22         if fp is not None:
    23             try:
    ---> 24                 _mod = imp.load_module('_pywrap_tensorflow_data_validation', fp, pathname, description)
    25             finally:
    26                 fp.close()
     
    /opt/ee/python/3.6/lib/python3.6/imp.py in load_module(name, file, filename, details)
    241                 return load_dynamic(name, filename, opened_file)
    242         else:
    --> 243             return load_dynamic(name, filename, file)
    244     elif type_ == PKG_DIRECTORY:
    245         return load_package(name, filename)
     
    /opt/ee/python/3.6/lib/python3.6/imp.py in load_dynamic(name, path, file)
    341         spec = importlib.machinery.ModuleSpec(
    342             name=name, loader=loader, origin=path)
    --> 343         return _load(spec)
    344
    345 else:
     
    ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /var/lib/mesos/slaves/8bfbe6e2-3bf7-4b49-90c3-15e4be759186-S218/frameworks/201104070004-0000002563-0000/executors/thermos-kwilson-devel-pycx-notebook-0-e9d46056-d500-4c44-9970-ebab4e39f006/runs/18847da2-8502-4393-a01d-c5bfc264f405/sandbox/.pex/install/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl.840798a46d57eb5c1ed0f639d3f47149480121e9/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl/tensorflow_data_validation/pywrap/_pywrap_tensorflow_data_validation.so)
    

    ldd reveals a linking issue on the inner .so:

    $ ldd .pex/install/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl.840798a46d57eb5c1ed0f639d3f47149480121e9/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl/tensorflow_data_validation/pywrap/_pywrap_tensorflow_data_validation.so
    .pex/install/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl.840798a46d57eb5c1ed0f639d3f47149480121e9/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl/tensorflow_data_validation/pywrap/_pywrap_tensorflow_data_validation.so: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by .pex/install/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl.840798a46d57eb5c1ed0f639d3f47149480121e9/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl/tensorflow_data_validation/pywrap/_pywrap_tensorflow_data_validation.so)
    .pex/install/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl.840798a46d57eb5c1ed0f639d3f47149480121e9/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl/tensorflow_data_validation/pywrap/_pywrap_tensorflow_data_validation.so: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by .pex/install/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl.840798a46d57eb5c1ed0f639d3f47149480121e9/tensorflow_data_validation-0.13.1-cp36-cp36m-manylinux1_x86_64.whl/tensorflow_data_validation/pywrap/_pywrap_tensorflow_data_validation.so)
            linux-vdso.so.1 =>  (0x00007ffff1ebc000)
            libdl.so.2 => /lib64/libdl.so.2 (0x00007f06a4570000)
            libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f06a4354000)
            libm.so.6 => /lib64/libm.so.6 (0x00007f06a4052000)
            libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f06a3d4b000)
            libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f06a3b35000)
            libc.so.6 => /lib64/libc.so.6 (0x00007f06a3768000)
            /lib64/ld-linux-x86-64.so.2 (0x00007f06a4e42000)
    

    which points to being built/linked on a newer system than is compatible with this configuration:

    sh-4.2$ cat /etc/redhat-release
    CentOS release 7.6.1810 (Core)
    sh-4.2$ rpm -q glibc libstdc++
    glibc-2.17-260.el7_6.3.x86_64
    libstdc++-4.8.5-36.el7.x86_64
    

    thus I'm fairly sure these binaries aren't actually manylinux (or even broadly centos7) compatible.

    stat:awaiting tensorflower type:build/install 
    opened by kwlzn 8
  • Issue with importing TFDV 0.21.2 on MacOS 10.12.6(Sierra)

    Issue with importing TFDV 0.21.2 on MacOS 10.12.6(Sierra)

    I am running into the following error, trying both on PyCharm and a virtualenv when I am trying to import tensorflow_data_validation.

    Error importing tfx_bsl_extension.arrow.array_util. Some tfx_bsl functionalities are not availableError importing tfx_bsl_extension.arrow.table_util. Some tfx_bsl functionalities are not availableTraceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/ky/tfdvtest/lib/python3.7/site-packages/tensorflow_data_validation/__init__.py", line 27, in <module> from tensorflow_data_validation.api.validation_api import infer_schema File "/Users/ky/tfdvtest/lib/python3.7/site-packages/tensorflow_data_validation/api/validation_api.py", line 31, in <module> from tensorflow_data_validation.pywrap import pywrap_tensorflow_data_validation File "/Users/ky/tfdvtest/lib/python3.7/site-packages/tensorflow_data_validation/pywrap/pywrap_tensorflow_data_validation.py", line 28, in <module> _pywrap_tensorflow_data_validation = swig_import_helper() File "/Users/ky/tfdvtest/lib/python3.7/site-packages/tensorflow_data_validation/pywrap/pywrap_tensorflow_data_validation.py", line 24, in swig_import_helper _mod = imp.load_module('_pywrap_tensorflow_data_validation', fp, pathname, description) File "/Users/ky/tfdvtest/lib/python3.7/imp.py", line 242, in load_module return load_dynamic(name, filename, file) File "/Users/ky/tfdvtest/lib/python3.7/imp.py", line 342, in load_dynamic return _load(spec) ImportError: dlopen(/Users/ky/tfdvtest/lib/python3.7/site-packages/tensorflow_data_validation/pywrap/_pywrap_tensorflow_data_validation.so, 2): Symbol not found: ____chkstk_darwin Referenced from: /Users/ky/tfdvtest/lib/python3.7/site-packages/tensorflow_data_validation/pywrap/_pywrap_tensorflow_data_validation.so Expected in: /usr/lib/libSystem.B.dylib in /Users/ky/tfdvtest/lib/python3.7/site-packages/tensorflow_data_validation/pywrap/_pywrap_tensorflow_data_validation.so

    I have been using all the versions of tensorflow(2.1), apache-beam(2.17), and pyarrow(0.15.0) that are recommended as compatible. I am also using tfx-bsl(0.21.2). Has anyone run into this issue before?

    stat:awaiting tensorflower type:build/install 
    opened by KevsProjects 7
  • List datatype (multiple category feature type) not supported

    List datatype (multiple category feature type) not supported

    The tfdv.generate_statistics_from_dataframe(...) will fail with the following error:

    ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
    

    when one of the columns contain list (e.g. list of strings)

    type:support 
    opened by wsuchy 7
  • The latest numpy release 1.24.0 broke TFDV

    The latest numpy release 1.24.0 broke TFDV

    TFDV allows 'numpy>=1.16,<2'. However, the latest numpy version 1.24.0 breaks TFDV. I encountered the following error in a TFDV-related component via TFX.

    ...
    venv/lib/python3.8/site-packages/tfx/components/__init__.py:22: in <module>
        from tfx.components.example_validator.component import ExampleValidator
    venv/lib/python3.8/site-packages/tfx/components/example_validator/component.py:20: in <module>
        from tfx.components.example_validator import executor
    venv/lib/python3.8/site-packages/tfx/components/example_validator/executor.py:20: in <module>
        import tensorflow_data_validation as tfdv
    venv/lib/python3.8/site-packages/tensorflow_data_validation/__init__.py:18: in <module>
        from tensorflow_data_validation.api.stats_api import GenerateStatistics
    venv/lib/python3.8/site-packages/tensorflow_data_validation/api/stats_api.py:50: in <module>
        from tensorflow_data_validation.statistics import stats_impl
    venv/lib/python3.8/site-packages/tensorflow_data_validation/statistics/stats_impl.py:28: in <module>
        from tensorflow_data_validation.statistics.generators import image_stats_generator
    venv/lib/python3.8/site-packages/tensorflow_data_validation/statistics/generators/image_stats_generator.py:99: in <module>
        class TfImageDecoder(ImageDecoderInterface):
    venv/lib/python3.8/site-packages/tensorflow_data_validation/statistics/generators/image_stats_generator.py:146: in TfImageDecoder
        def get_formats(self, values: List[np.object]) -> np.ndarray:
    venv/lib/python3.8/site-packages/numpy/__init__.py:284: in __getattr__
        raise AttributeError("module {!r} has no attribute "
    E   AttributeError: module 'numpy' has no attribute 'object'
    

    Python version: 3.8.12 TFX version: 1.6.2 TFDV version: 1.6.0 numpy version: 1.24.0

    stat:awaiting response type:feature 
    opened by daikeshi 3
  • Using tfdv to validate text based data

    Using tfdv to validate text based data

    Hi,

    After searching online whether tfdv could be used to validate data that contains text. For instance, for a dataset with sentences that have to be mapped to labels. I could not find any real useful tutorials, as the ones that I could find only go into numerical data regarding the dataset. For instance, height, weights, etc.

    After looking around in the data-validation package I have found a couple of files that seem to be related to this. https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_stats_generator.py And https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_domain_inferring_stats_generator.py

    Furthermore on the Tensorflow website about the StatsOptions class I found the following: https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions

    Arguments | Description -- | -- enable_semantic_domain_stats | If True statistics for semantic domains are generated (e.g: image, text domains). semantic_domain_stats_sample_rate | An optional sampling rate for semantic domain statistics. If specified, semantic domain statistics is computed over a sample. vocab_paths | An optional dictionary mapping vocab names to paths. Used in the schema when specifying a NaturalLanguageDomain. The paths can either be to GZIP-compressed TF record files that have a tfrecord.gz suffix or to text files.

    These arguments and files do indicate that tfdv can be used to analyze and validate data that would be used in NLP / Text classification type problems.

    However, it is unclear to me how one would go about and use these features to validate text-based data? I have enabled the enable_semantic_domain_stats argument and this does give information like sequence length etc. However, how would one extend on this, and validate vocabularies for known/unknown word ratio's; etc.

    Any tips or thoughts are highly appreciated! Kind Regards, Caspar

    type:docs stat:awaiting tensorflower 
    opened by Capsar 0
  • Issue using `allowlist_features` and `denylist_features` in `visualize_statistics`

    Issue using `allowlist_features` and `denylist_features` in `visualize_statistics`

    Overview

    I'm having issues specifying the features to include/exclude when visualizing stats in TFDV. It seems like the allowlist_features and denylist_features require a tensorflow_data_validation.types.FeaturePath object, which took a bit to figure out how to construct. This doesn't seem that user friendly -- was it intended to allow a list of strings to be passed?

    Code to reproduce

    I can reproduce the problem in the public colab example. In the "Compute and Visualize Statistics" section of the above notebook, update the visualize_statistics call to be: tfdv.visualize_statistics(train_stats, denylist_features=['pickup_community_area']). The first feature shouldn't exist in the visualized example (if I'm calling this correctly).

    image

    Workaround code

    To make this work, I have to manually construct a tensorflow_data_validation.types.FeaturePath object. Perhaps it would be better to do the filter comparison on each feature's path string?

    # Show string name of feature
    first_feat = train_stats.datasets[0].features[0]
    print(first_feat.path)
    
    # Construct necessary object to make `allowlist_feature` filter work
    from tensorflow_data_validation import types
    print(types.FeaturePath.from_proto(first_feat.path))
    
    # docs-infra: no-execute
    tfdv.visualize_statistics(train_stats, allowlist_features=[types.FeaturePath.from_proto(first_feat.path)])
    
    image stat:awaiting tensorflower type:bug 
    opened by wronk 0
  • Model Unit Testing feature

    Model Unit Testing feature

    Hi, I recently checked the TensorFlow Data Validation paper (https://mlsys.org/Conferences/2019/doc/2019/167.pdf). First of all, thanks for the publishing the paper and open-sourcing this project.

    But I cannot find similar features in this project with the Model Unit Testing module mentioned in section 5. Is there any plans to open-source model unit testing module or is it already open-sourced?

    stat:awaiting tensorflower type:feature 
    opened by jeongukjae 0
Releases(v1.12.0)
  • v1.12.0(Dec 8, 2022)

    Major Features and Improvements

    • N/A

    Bug Fixes and Other Changes

    • TFDV is now tested against macOS 12.5 (Monterey).

    Known Issues

    • N/A

    Breaking Changes

    • Depends on tensorflow>=2.11,<3
    • Depends on tfx-bsl>=1.12.0,<1.13.0.
    • Depends on tensorflow-metadata>=1.12.0,<1.13.0.

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v1.11.0(Nov 16, 2022)

    Major Features and Improvements

    • This is the last version that supports TensorFlow 1.15.x. TF 1.15.x support will be removed in the next version. Please check the TF2 migration guide to migrate to TF2.

    • Add a custom_validate_statistics function to the validation API, and support passing custom validations to validate_statistics. Note that custom validation is not supported on Windows.

    Bug Fixes and Other Changes

    • Fix bug in implementation of semantic_domain_stats_sample_rate.

    • Add beam metrics on string length

    • Determine whether to calculate string statistics based on the is_categorical field in the schema string domain.

    • Histograms counts should now be more accurate for distributions with few distinct values, or frequent individual values.

    • Nested list length histogram counts are no longer based on the number of values one up in the nested list hierarchy.

    • Support using jensen-shannon divergence to detect drift and skew for string and categorical features.

    • get_drift_skew_dataframe now includes a threshold column.

    • Adds support for NormalizedAbsoluteDifference comparator.

    • Depends on tensorflow>=1.15.5,<2 or tensorflow>=2.10,<3

    • Depends on joblib>=1.2.0.

    Known Issues

    • N/A

    Breaking Changes

    • Histogram semantics are slightly changed, so that buckets include their upper bound instead of their lower bound. STANDARD histograms will no longer generate buckets that contain infinite and finite endpoints together.
    • Introduces StatsOptions.use_sketch_based_topk_uniques replacing experimental_use_sketch_based_topk_uniques. The latter option can still be written, but not read.

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v1.10.0(Aug 29, 2022)

    Major Features and Improvements

    • N/A

    Bug Fixes and Other Changes

    • Skew pipeline supports counting pairs of feature values in base/test.
    • Depends on apache-beam[gcp]>=2.40,<3.
    • Depends on pyarrow>=6,<7.
    • Depends on tfx-bsl>=1.10.1,<1.11.0.
    • Depends on tensorflow-metadata>=1.10.0,<1.11.0.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v1.9.0(Jun 29, 2022)

    Major Features and Improvements

    • N/A

    Bug Fixes and Other Changes

    • Depends on tensorflow>=1.15.5,<2 or tensorflow>=2.9,<3
    • Depends on tfx-bsl>=1.9.0,<1.10.0.
    • Depends on tensorflow-metadata>=1.9.0,<1.10.0.

    Known Issues

    • N/A

    Breaking Changes

    • Some fields in feature skew results proto changed names to be more generic.

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v1.8.0(May 16, 2022)

    Major Features and Improvements

    • From this version we will be releasing python 3.9 wheels.

    Bug Fixes and Other Changes

    • Adds get_statistics_html to the public API.
    • Fixes several incorrect type annotations.
    • Schema inference handles derived features.
    • StatsOptions.to_json now raises an error if it encounters unsupported options.
    • Depends on apache-beam[gcp]>=2.38,<3.
    • Depends on tensorflow>=1.15.5,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.
    • Depends on tensorflow-metadata>=1.8.0,<1.9.0.
    • Depends on tfx-bsl>=1.8.0,<1.9.0.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v1.7.0(Mar 2, 2022)

    Major Features and Improvements

    • Adds the DetectFeatureSkew PTransform to the public API, which can be used to detect feature skew between training and serving examples.
    • Uses sketch-based top-k/uniques in TFDV inmemory mode.

    Bug Fixes and Other Changes

    • Fixes a bug in load_statistics that would cause failure when reading binary protos.
    • Depends on pyfarmhash>=0.2,<0.4.
    • Depends on tensorflow>=1.15.5,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.
    • Depends on tensorflow-metadata>=1.7.0,<1.8.0.
    • Depends on tfx-bsl>=1.7.0,<1.8.0.
    • Depends on apache-beam[gcp]>=2.36,<3.
    • Updated the documentation for CombinerStatsGenerator to clarify that the first accumulator passed to merge_accumulators may be modified.
    • Added compression type detection when reading csv header.
    • Detection of invalid utf8 strings now works regardless of relative frequency.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v1.6.0(Jan 21, 2022)

    Major Features and Improvements

    • Introduces a convenience wrapper for handling indexed access to statistics protos.
    • String features are checked for UTF-8 validity, and the number of invalid strings is reported as invalid_utf8_count.

    Bug Fixes and Other Changes

    • Depends on numpy>=1.16,<2.
    • Depends on absl-py>=0.9,<2.0.0.
    • Depends on tensorflow>=1.15.5,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,<3.
    • Depends on tensorflow-metadata>=1.6.0,<1.7.0.
    • Depends on tfx-bsl>=1.6.0,<1.7.0.
    • Depends on apache-beam[gcp]>=2.35,<3.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v1.5.0(Dec 1, 2021)

    Major Features and Improvements

    • N/A

    Bug Fixes and Other Changes

    • BasicStatsGenerator is now responsible for setting the global num_examples. This field will no longer be populated at the DatasetFeatureStatistics level if default generators are disabled.
    • Depends on apache-beam[gcp]>=2.34,<3.
    • Depends on tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,<3.
    • Depends on tensorflow-metadata>=1.5.0,<1.6.0.
    • Depends on tfx-bsl>=1.5.0,<1.6.0.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v1.4.0(Oct 27, 2021)

    Major Features and Improvements

    • Float features can now be analyzed as categorical for the purposes of top-k and unique count using experimental sketch based generators.
    • Support SQL based slicing in TFDV. This would enable slicing (using SQL) in TFX OSS and Dataflow environments. SQL based slicing is currently not supported on Windows.

    Bug Fixes and Other Changes

    • Variance calculations have been updated to be more numerically stable for large datasets or large magnitude numeric data.
    • When running per-example validation against a schema, output of validate_examples_in_tfrecord and validate_examples_in_csv now optionally return samples of anomalous examples.
    • Changes to source code ensures that it can now work with pyarrow>=3.
    • Add load_anomalies_binary utility function.
    • Merge two accumulators at a time instead of batching.
    • BasicStatsGenerator is now responsible for setting FeatureNameStatistics.Type. Previously it was possible for a top-k generator and BasicStatsGenerator to set different types for categorical numeric features with physical type STRING.
    • Depends on pyarrow>=1,<6.
    • Depends on tensorflow-metadata>=1.4,<1.5.
    • Depends on tfx-bsl>=1.4,<1.5.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • Deprecated python 3.6 support.
    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Sep 20, 2021)

    Major Features and Improvements

    • N/A

    Bug Fixes and Other Changes

    • Fixed bug in JensenShannonDivergence calculation affecting comparisons of histograms that each contain a single value.
    • Fixed bug in dataset constraints validation that caused failures with very large numbers of examples.
    • Fixed a bug wherein slicing on a feature missing from some batches could produce slice keys derived from a different feature.
    • Depends on apache-beam[gcp]>=2.32,<3.
    • Depends on tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,<3.
    • Depends on tfx-bsl>=1.3,<1.4.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Jul 28, 2021)

    Major Features and Improvements

    • Added statistics/generators/mutual_information.py. It estimates AMI using a knn estimation. It differs from sklearn_mutual_information.py in that this supports multivalent features/labels (by encoding) and multivariate features/labels. The plan is to deprecate sklearn_mutual_information.py in the future.
    • Fixed NonStreamingCustomStatsGenerator to respect max_batches_per_partition.

    Bug Fixes and Other Changes

    • Depends on 'scikit-learn>=0.23,<0.24' ("mutual-information" extra only)
    • Depends on 'scipy>=1.5,<2' ("mutual-information" extra only)
    • Depends on apache-beam[gcp]>=2.31,<3.
    • Depends on tensorflow-metadata>=1.2,<1.3.
    • Depends on tfx-bsl>=1.2,<1.3.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
    tensorflow_data_validation-1.2.0-cp36-cp36m-macosx_10_9_x86_64.whl(1.43 MB)
    tensorflow_data_validation-1.2.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.1.whl(1.32 MB)
    tensorflow_data_validation-1.2.0-cp36-cp36m-win_amd64.whl(1.11 MB)
    tensorflow_data_validation-1.2.0-cp37-cp37m-macosx_10_9_x86_64.whl(1.43 MB)
    tensorflow_data_validation-1.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl(1.32 MB)
    tensorflow_data_validation-1.2.0-cp37-cp37m-win_amd64.whl(1.11 MB)
    tensorflow_data_validation-1.2.0-cp38-cp38-macosx_10_9_x86_64.whl(1.43 MB)
    tensorflow_data_validation-1.2.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl(1.32 MB)
    tensorflow_data_validation-1.2.0-cp38-cp38-win_amd64.whl(1.11 MB)
  • v1.1.1(Jul 26, 2021)

    Major Features and Improvements

    • N/A

    Bug Fixes and Other Changes

    • Depends on google-cloud-bigquery>=1.28.0,<2.21.
    • Depends on tfx-bsl>=1.1.1,<1.2.
    • Fixes error when using tfdv.experimental_get_feature_value_slicer with
    • pandas==1.3.0.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(May 24, 2021)

    Major Features and Improvements

    • N/A

    Bug Fixes and Other Changes

    • Increased the threshold beyond which a string feature value is considered "large" by the experimental sketch-based top-k/unique generator to 1024.
    • Added normalized AMI to sklearn mutual information generator.
    • Depends on apache-beam[gcp]>=2.29,<3.
    • Depends on tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,<3.
    • Depends on tensorflow-metadata>=1.0,<1.1.
    • Depends on tfx-bsl>=1.0,<1.1.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • Removed the following deprecated symbols. Their deprecation was announced in 0.30.0.
    • tfdv.validate_instance
    • tfdv.lift_stats_generator
    • tfdv.partitioned_stats_generator
    • tfdv.get_feature_value_slicer
    • Removed parameter compression_type in tfdv.generate_statistics_from_tfrecord
    Source code(tar.gz)
    Source code(zip)
    tensorflow_data_validation-1.0.0-cp36-cp36m-macosx_10_9_x86_64.whl(1.41 MB)
    tensorflow_data_validation-1.0.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl(1.29 MB)
    tensorflow_data_validation-1.0.0-cp36-cp36m-win_amd64.whl(1.09 MB)
    tensorflow_data_validation-1.0.0-cp37-cp37m-macosx_10_9_x86_64.whl(1.41 MB)
    tensorflow_data_validation-1.0.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl(1.29 MB)
    tensorflow_data_validation-1.0.0-cp37-cp37m-win_amd64.whl(1.09 MB)
    tensorflow_data_validation-1.0.0-cp38-cp38-macosx_10_9_x86_64.whl(1.41 MB)
    tensorflow_data_validation-1.0.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl(1.29 MB)
    tensorflow_data_validation-1.0.0-cp38-cp38-win_amd64.whl(1.09 MB)
  • v0.26.1(May 10, 2021)

    Major Features and Improvements

    • N/A

    Bug Fixes and Other Changes

    • Depends on apache-beam[gcp]>=2.25,!=2.26.*,<2.29.

    Known Issues

    • N/A

    Breaking changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.30.0(Apr 21, 2021)

    Major Features and Improvements

    • This version is the last version before TFDV 1.0. Once 1.0, all the TFDV public APIs (i.e. symbols in the root __init__.py) will be subject to semantic versioning. We are deprecating some public APIs in this version and they will be removed in 1.0.

    • Sketch-based top-k/unique stats generator now is able to detect invalid utf-8 sequences / large texts and replace them with a placeholder. It will not suffer from memory issue usually caused by image / large text features in the data. Note that this generator is not by default used yet.

    • Added StatsOptions.experimental_use_sketch_based_topk_uniques which enables the sketch-based top-k/unique stats generator.

    Bug Fixes and Other Changes

    • Fixed bug in display_schema that caused domains not to be displayed.
    • Modified how get_schema_dataframe outputs numeric domains.
    • Anomalies previously (un)classified as UKNOWN_TYPE now trigger more specific anomaly types: INVALID_DOMAIN_SPECIFICATION and MULTIPLE_REASONS.
    • Depends on tensorflow-metadata>=0.30,<0.31.
    • Depends on tfx-bsl>=0.30,<0.31.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • tfdv.LiftStatsGenerator is going to be removed in the next version from the public API. To enable that generator, supply StatsOptions.label_feature
    • tfdv.NonStreamingCustomStatsGenerator is going to be removed in the next version from the public API. You may continue to import it from TFDV but it will not be subject to compatibility guarantees.
    • tfdv.validate_instance is going to be removed in the next version from the public API. You may continue to import it from TFDV but it will not be subject to compatibility guarantees.
    • Removed tfdv.DecodeCSV, tfdv.DecodeTFExample (deprecated in 0.27).
    • Removed feature_whitelist in tfdv.StatsOptions (deprecated in 0.28). Use feature_allowlist instead.
    • tfdv.get_feature_value_slicer is deprecated. tfdv.experimental_get_feature_value_slicer is introduced as a replacement. TFDV is likely to have a different slicing functionality post 1.0, which may not be compatible with the current slicers.
    • StatsOptions.slicing_functions is deprecated. StatsOptions.experimental_slicing_functions is introduced as a replacement.
    • tfdv.WriteStatisticsToText is removed (deprecated in 0.25.0).
    • Parameter compression_type in tfdv.generate_statistics_from_tfrecord is deprecated. The compression type is currently automatically determined.
    Source code(tar.gz)
    Source code(zip)
    tensorflow_data_validation-0.30.0-cp36-cp36m-macosx_10_9_x86_64.whl(1.40 MB)
    tensorflow_data_validation-0.30.0-cp36-cp36m-manylinux2010_x86_64.whl(1.28 MB)
    tensorflow_data_validation-0.30.0-cp36-cp36m-win_amd64.whl(1.08 MB)
    tensorflow_data_validation-0.30.0-cp37-cp37m-macosx_10_9_x86_64.whl(1.40 MB)
    tensorflow_data_validation-0.30.0-cp37-cp37m-manylinux2010_x86_64.whl(1.28 MB)
    tensorflow_data_validation-0.30.0-cp37-cp37m-win_amd64.whl(1.08 MB)
    tensorflow_data_validation-0.30.0-cp38-cp38-macosx_10_9_x86_64.whl(1.40 MB)
    tensorflow_data_validation-0.30.0-cp38-cp38-manylinux2010_x86_64.whl(1.29 MB)
    tensorflow_data_validation-0.30.0-cp38-cp38-win_amd64.whl(1.08 MB)
  • v0.28.0(Feb 24, 2021)

    Major Features and Improvements

    • Add anomaly detection for max bytes size for images.

    Bug Fixes and Other Changes

    • Depends on numpy>=1.16,<1.20.
    • Fixed a bug that affected all CombinerFeatureStatsGenerators.
    • Allow for bytes type in get_feature_value_slicer in addition to Text and int.
    • Fixed a bug that caused TFDV to improperly infer a fixed shape when tfdv.infer_schema and tfdv.update_schema were called with infer_feature_shape=True.
    • Deprecated parameter infer_feature_shape of function tfdv.update_schema. If a schema feature has a pre-defined shape, tfdv.update_schema will always validate it. Otherwise, it will not try to add a shape.
    • Deprecated tfdv.StatsOptions.feature_whitelist and added feature_allowlist as a replacement. The former will be removed in the next release.
    • Added get_schema_dataframe and get_anomalies_dataframe utility functions.
    • Depends on apache-beam[gcp]>=2.28,<3.
    • Depends on tensorflow-metadata>=0.28,<0.29.
    • Depends on tfx-bsl>=0.28.1,<0.29.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
    tensorflow_data_validation-0.28.0-cp36-cp36m-macosx_10_9_x86_64.whl(2.93 MB)
    tensorflow_data_validation-0.28.0-cp36-cp36m-manylinux2010_x86_64.whl(1.28 MB)
    tensorflow_data_validation-0.28.0-cp36-cp36m-win_amd64.whl(1.08 MB)
    tensorflow_data_validation-0.28.0-cp37-cp37m-macosx_10_9_x86_64.whl(2.93 MB)
    tensorflow_data_validation-0.28.0-cp37-cp37m-manylinux2010_x86_64.whl(1.28 MB)
    tensorflow_data_validation-0.28.0-cp37-cp37m-win_amd64.whl(1.08 MB)
    tensorflow_data_validation-0.28.0-cp38-cp38-macosx_10_9_x86_64.whl(2.93 MB)
    tensorflow_data_validation-0.28.0-cp38-cp38-manylinux2010_x86_64.whl(1.28 MB)
    tensorflow_data_validation-0.28.0-cp38-cp38-win_amd64.whl(1.08 MB)
  • v0.27.0(Jan 28, 2021)

    Major Features and Improvements

    • Performance improvement to BasicStatsGenerator.

    Bug Fixes and Other Changes

    • Added a compact() and setup() interface to CombinerStatsGenerator, CombinerFeatureStatsWrapperGenerator, BasicStatsGenerator, CompositeStatsGenerator, and ConstituentStatsGenerator.
    • Stopped depending on tensorflow-transform.
    • Depends on apache-beam[gcp]>=2.27,<3.
    • Depends on pyarrow>=1,<3.
    • Depends on tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,<3.
    • Depends on tensorflow-metadata>=0.27,<0.28.
    • Depends on tfx-bsl>=0.27,<0.28.

    Known Issues

    • N/A

    Breaking changes

    • N/A

    Deprecations

    • tfdv.DecodeCSV and tfdv.DecodeTFExample are deprecated. Use tfx_bsl.public.tfxio.CsvTFXIO and tfx_bsl.public.tfxio.TFExampleRecord instead.
    Source code(tar.gz)
    Source code(zip)
  • v0.26.0(Dec 17, 2020)

    Version 0.26.0

    Major Features and Improvements

    • Added support for per-feature example weights which allows associating each column its specific weight column. See the per_feature_weight_override parameter in StatsOptions.__init__.

    Bug Fixes and Other Changes

    • Newly added LifecycleStage.DISABLED is now exempt from validation (similar to LifecycleStage.DEPRECATED, etc).
    • Fixed a bug where TFDV blindly trusts the claim type in the provided schema. TFDV now computes the stats according to the actual type of the data, and only when the actual type matches the claim in the schema will it compute type-specific stats (e.g. categorical ints).
    • Added an option to control whether to add default stats generators when tfdv.GenerateStatistics().
    • Started using a new quantiles computation routine that does not depend on TF. This could potentially increase the performance of TFDV under certain workloads.
    • Extending schema_util to support sematic domains.
    • Moving natural_language_stats_generator to natural_language_domain_inferring_stats_generator.
    • Providing vocab_utils to assist in opening / loading vocabulary files.
    • A SchemaDiff will be reported upon J-S skew/drift.
    • Fixed a bug in FLOAT_TYPE_SMALL_FLOAT anomaly message.
    • Depends on apache-beam[gcp]>=2.25,!=2.26.*,<3.
    • Depends on tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,!=2.4.*,<3.
    • Depends on tensorflow-metadata>=0.26,<0.27.
    • Depends on tensorflow-transform>=0.26,<0.27.
    • Depends on tfx-bsl>=0.26,<0.27.

    Known Issues

    • N/A

    Breaking changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
    tensorflow_data_validation-0.26.0-cp36-cp36m-macosx_10_9_x86_64.whl(2.92 MB)
    tensorflow_data_validation-0.26.0-cp36-cp36m-manylinux2010_x86_64.whl(1.27 MB)
    tensorflow_data_validation-0.26.0-cp36-cp36m-win_amd64.whl(1.07 MB)
    tensorflow_data_validation-0.26.0-cp37-cp37m-macosx_10_9_x86_64.whl(2.92 MB)
    tensorflow_data_validation-0.26.0-cp37-cp37m-manylinux2010_x86_64.whl(1.27 MB)
    tensorflow_data_validation-0.26.0-cp37-cp37m-win_amd64.whl(1.06 MB)
    tensorflow_data_validation-0.26.0-cp38-cp38-macosx_10_9_x86_64.whl(2.92 MB)
    tensorflow_data_validation-0.26.0-cp38-cp38-manylinux2010_x86_64.whl(1.27 MB)
    tensorflow_data_validation-0.26.0-cp38-cp38-win_amd64.whl(1.06 MB)
  • v0.25.0(Nov 5, 2020)

    Version 0.25.0

    Major Features and Improvements

    • Add support for detecting drift and distribution skew in numeric features.

    • tfdv.validate_statistics now also reports the raw measurements of distribution skew/drift (if any is done), regardless whether skew/drift is detected. The report is in the drift_skew_info of the Anomalies proto (return value of validate_statistics).

    • From this release TFDV will also be hosting nightly packages on https://pypi-nightly.tensorflow.org. To install the nightly package use the following command:

      pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation
      

      Note: These nightly packages are unstable and breakages are likely to happen. The fix could often take a week or more depending on the complexity involved for the wheels to be available on the PyPI cloud service. You can always use the stable version of TFDV available on PyPI by running the command pip install tensorflow-data-validation .

    Bug Fixes and Other Changes

    • Added tfdv.load_stats_binary to load stats what were written using tfdv.WriteStatisticsToText (now tfdv.WriteStatisticsToBinaryFile).
    • Anomalies previously (un)classified as UKNOWN_TYPE now trigger more specific anomaly types: DOMAIN_INVALID_FOR_TYPE, UNEXPECTED_DATA_TYPE, FEATURE_MISSING_NAME, FEATURE_MISSING_TYPE, INVALID_SCHEMA_SPECIFICATION
    • Fixed a bug that import tensorflow_data_validation would fail if IPython is not installed. IPython is an optional dependency of TFDV.
    • Depends on apache-beam[gcp]>=2.25,<3.
    • Depends on tensorflow-metadata>=0.25,<0.26.
    • Depends on tensorflow-transform>=0.25,<0.26.
    • Depends on tfx-bsl>=0.25,<0.26.

    Known Issues

    • N/A

    Breaking Changes

    • tfdv.WriteStatisticsToText is renamed as tfdv.WriteStatisticsToBinaryFile. The former is still available but will be removed in a future release.

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.24.1(Sep 24, 2020)

    Major Features and Improvements

    • N/A

    Bug Fixes and Other Changes

    • Depends on apache-beam[gcp]>=2.24,<3.
    • Depends on tensorflow-transform>=0.24.1,<0.25.
    • Depends on tfx-bsl>=0.24.1,<0.25.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.23.1(Sep 24, 2020)

    Major Features and Improvements

    • N/A

    Bug Fixes and Other Changes

    • Depends on apache-beam[gcp]>=2.24,<3.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • Deprecated python 3.5 support.
    Source code(tar.gz)
    Source code(zip)
  • v0.24.0(Sep 14, 2020)

    Major Features and Improvements

    • You can now build the TFDV wheel with python setup.py bdist_wheel. Note:
    • If you want to build a manylinux2010 wheel you'll still need to use Docker.
    • Bazel is still required.
    • You can now build manylinux2010 TFDV wheel for Python 3.8.

    Bug Fixes and Other Changes

    • Support allowlist and denylist features in tfdv.visualize_statistics method.
    • Depends on absl-py>=0.9,<0.11.
    • Depends on pandas>=1.0,<2.
    • Depends on protobuf>=3.9.2,<4.
    • Depends on tensorflow-metadata>=0.24,<0.25.
    • Depends on tensorflow-transform>=0.24,<0.25.
    • Depends on tfx-bsl>=0.24,<0.25.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • Deprecated Py3.5 support.
    • Deprecated sample_count option in tfdv.StatsOptions. Use sample_rate option instead.
    Source code(tar.gz)
    Source code(zip)
  • v0.23.0(Aug 14, 2020)

    Major Features and Improvements

    • Data validation is now able to handle arbitrarily nested arrow List/LargeList types. Schema entries for features with multiple nest levels describe the value count at each level in the value_counts field.
    • Add combiner stats generator to estimate top-K and uniques using Misra-Gries and K-Minimum Values sketches.

    Bug Fixes and Other Changes

    • Validate that enough supported images are present (if image_domain.minimum_supported_image_fraction is provided).
    • Stopped requiring avro-python3.
    • Depends on apache-beam[gcp]>=2.23,<3.
    • Depends on pyarrow>=0.17,<0.18.
    • Depends on tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,<3.
    • Depends on tensorflow-metadata>=0.23,<0.24.
    • Depends on tensorflow-transform>=0.23,<0.24.
    • Depends on tfx-bsl>=0.23,<0.24.

    Known Issues

    • N/A

    Breaking Changes

    • N/A

    Deprecations

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.22.2(Jun 29, 2020)

    Major Features and Improvements

    Bug Fixes and Other Changes

    • Fixed a bug that affected tfx 0.22.0 to work with TFDV 0.22.1.
    • Depends on 'avro-python3>=1.8.1,<1.9.2' on Python 3.5 + MacOS

    Known Issues

    Breaking Changes

    Deprecations

    Source code(tar.gz)
    Source code(zip)
  • v0.22.1(Jun 24, 2020)

    Major Features and Improvements

    • Statistics generation is now able to handle arbitrarily nested arrow List/LargeList types. Stats about the list elements' presence and valency are computed at each nest level, and stored in a newly added field, valency_and_presence_stats in CommonStatistics.

    Bug Fixes and Other Changes

    • Trigger DATASET_HIGH_NUM_EXAMPLES when a dataset has more than the specified limit on number of examples.
    • Fix bug in display_anomalies that prevented dataset-level anomalies from being displayed.
    • Trigger anomalies when a feature has a number of unique values that does not conform to the specified minimum/maximum.
    • Depends on pandas>=0.24,<2.
    • Depends on tensorflow-metadata>=0.22.2,<0.23.0.
    • Depends on tfx-bsl>=0.22.1,<0.23.0.

    Known Issues

    Breaking Changes

    Deprecations

    Source code(tar.gz)
    Source code(zip)
  • v0.22.0(May 15, 2020)

    Major Features and Improvements

    Bug Fixes and Other Changes

    • Crop values in natural language stats generator.
    • Switch to using PyBind11 instead of SWIG for wrapping C++ libraries.
    • CSV decoder support for multivalent columns by using tfx_bsl's decoder.
    • When inferring a schema entry for a feature, do not add a shape with dim = 0 when min_num_values = 0.
    • Add utility methods tfdv.get_slice_stats to get statistics for a slice and tfdv.compare_slices to compare statistics of two slices using Facets.
    • Make tfdv.load_stats_text and tfdv.write_stats_text public.
    • Add PTransforms tfdv.WriteStatisticsToText and tfdv.WriteStatisticsToTFRecord to write statistics proto to text and tfrecord files respectively.
    • Modify tfdv.load_statistics to handle reading statistics from TFRecord and text files.
    • Added an extra requirement group mutual-information. As a result, barebone TFDV does not require scikit-learn any more.
    • Added an extra requirement group visualization. As a result, barebone TFDV does not require ipython any more.
    • Added an extra requirement group all that specifies all the extra dependencies TFDV needs. Use pip install tensorflow-data-validation[all] to pull in those dependencies.
    • Depends on pyarrow>=0.16,<0.17.
    • Depends on apache-beam[gcp]>=2.20,<3.
    • Depends on `ipython>=7,<8;python_version>="3"'.
    • Depends on `scikit-learn>=0.18,<0.24'.
    • Depends on tensorflow>=1.15,!=2.0.*,<3.
    • Depends on tensorflow-metadata>=0.22.0,<0.23.
    • Depends on tensorflow-transform>=0.22,<0.23.
    • Depends on tfx-bsl>=0.22,<0.23.

    Known Issues

    • (Known issue resolution) It is no longer necessary to use Apache Beam 2.17 when running TFDV on Windows. The current release of Apache Beam will work.

    Breaking Changes

    • tfdv.GenerateStatistics now accepts a PCollection of pa.RecordBatch instead of pa.Table.
    • All the TFDV coders now output a PCollection of pa.RecordBatch instead of a PCollection of pa.Table.
    • tfdv.validate_instances and tfdv.api.validation_api.IdentifyAnomalousExamples now takes pa.RecordBatch as input instead of pa.Table.
    • The StatsGenerator interface (and all its sub-classes) now takes pa.RecordBatch as the input data instead of pa.Table.
    • Custom slicing functions now accepts a pa.RecordBatch instead of pa.Table as input and should output a tuple (slice_key, record_batch).

    Deprecations

    • Deprecating Py2 support.
    Source code(tar.gz)
    Source code(zip)
  • v0.21.5(Mar 6, 2020)

    Release 0.21.5

    Major Features and Improvements

    • Add label_feature to StatsOptions and enable LiftStatsGenerator when label_feature and schema are provided.
    • Add JSON serialization support for StatsOptions.

    Bug Fixes and Other Changes

    • Only requires avro-python3>=1.8.1,!=1.9.2.*,<2.0.0 on Python 3.5 + MacOS

    Breaking Changes

    Deprecations

    Source code(tar.gz)
    Source code(zip)
  • v0.21.4(Mar 5, 2020)

    Release 0.21.4

    Major Features and Improvements

    • Support visualizing feature value lift in facets visualization.

    Bug Fixes and Other Changes

    • Fix issue writing out string feature values in LiftStatsGenerator.
    • Requires 'apache-beam[gcp]>=2.17,<3'.
    • Requires 'tensorflow-transform>=0.21.1,<0.22'.
    • Requires 'tfx-bsl>=0.21.3,<0.22'.

    Breaking Changes

    Deprecations

    Source code(tar.gz)
    Source code(zip)
Owner
null
Rubrix is a free and open-source tool for exploring and iterating on data for artificial intelligence projects.

Open-source tool for exploring, labeling, and monitoring data for AI projects

Recognai 1.5k Jan 7, 2023
Splore - a simple graphical interface for scrolling through and exploring data sets of molecules

Scroll through and exPLORE molecule sets The splore framework aims to offer a si

null 3 Jun 18, 2022
Bioinformatics tool for exploring RNA-Protein interactions

Explore RNA-Protein interactions. RNPFind is a bioinformatics tool. It takes an RNA transcript as input and gives a list of RNA binding protein (RBP)

Nahin Khan 3 Jan 27, 2022
Debugging, monitoring and visualization for Python Machine Learning and Data Science

Welcome to TensorWatch TensorWatch is a debugging and visualization tool designed for data science, deep learning and reinforcement learning from Micr

Microsoft 3.3k Dec 27, 2022
Lime: Explaining the predictions of any machine learning classifier

lime This project is about explaining what machine learning classifiers (or models) are doing. At the moment, we support explaining individual predict

Marco Tulio Correia Ribeiro 10.3k Dec 29, 2022
A little logger for machine learning research

dowel dowel is a little logger for machine learning research. Installation pip install dowel Usage import dowel from dowel import logger, tabular log

Reinforcement Learning Working Group 27 Dec 3, 2022
Visualizations for machine learning datasets

Introduction The facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive

PAIR code 7.1k Jan 7, 2023
Visualizations for machine learning datasets

Introduction The facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive

PAIR code 6.5k Feb 17, 2021
OpenStats is a library built on top of streamlit that extracts data from the Github API and shows the main KPIs

Open Stats Discover and share the KPIs of your OpenSource project. OpenStats is a library built on top of streamlit that extracts data from the Github

Pere Miquel Brull 4 Apr 3, 2022
Apache Superset is a Data Visualization and Data Exploration Platform

Superset A modern, enterprise-ready business intelligence web application. Why Superset? | Supported Databases | Installation and Configuration | Rele

The Apache Software Foundation 50k Jan 6, 2023
Apache Superset is a Data Visualization and Data Exploration Platform

Apache Superset is a Data Visualization and Data Exploration Platform

The Apache Software Foundation 49.9k Jan 2, 2023
Tidy data structures, summaries, and visualisations for missing data

naniar naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot

Nicholas Tierney 611 Dec 22, 2022
Collection of data visualizing projects through Tableau, Data Wrapper, and Power BI

Data-Visualization-Projects Collection of data visualizing projects through Tableau, Data Wrapper, and Power BI Indigenous-Brands-Social-Movements Pyt

Jinwoo(Roy) Yoon 1 Feb 5, 2022
Python library that makes it easy for data scientists to create charts.

Chartify Chartify is a Python library that makes it easy for data scientists to create charts. Why use Chartify? Consistent input data format: Spend l

Spotify 3.2k Jan 4, 2023
An open-source plotting library for statistical data.

Lets-Plot Lets-Plot is an open-source plotting library for statistical data. It is implemented using the Kotlin programming language. The design of Le

JetBrains 820 Jan 6, 2023
Python library that makes it easy for data scientists to create charts.

Chartify Chartify is a Python library that makes it easy for data scientists to create charts. Why use Chartify? Consistent input data format: Spend l

Spotify 2.8k Feb 18, 2021
An open-source plotting library for statistical data.

Lets-Plot Lets-Plot is an open-source plotting library for statistical data. It is implemented using the Kotlin programming language. The design of Le

JetBrains 509 Feb 17, 2021
Python library that makes it easy for data scientists to create charts.

Chartify Chartify is a Python library that makes it easy for data scientists to create charts. Why use Chartify? Consistent input data format: Spend l

Spotify 3.2k Jan 1, 2023
High-level geospatial data visualization library for Python.

geoplot: geospatial data visualization geoplot is a high-level Python geospatial plotting library. It's an extension to cartopy and matplotlib which m

Aleksey Bilogur 1k Jan 1, 2023