Context
When running tfdv.generate_statistics_from_tfrecord
on Dataflow, the job gets submitted successfully to the cluster but I get a:
ImportError: No module named tensorflow_data_validation.statistics.stats_impl
during the job unpickling phase in the Dataflow worker
Error trace
---------------------------------------------------------------------------
DataflowRuntimeException Traceback (most recent call last)
<ipython-input-23-8f1147effd88> in <module>()
16 # for more options about stats, run `?tfdv.generate_statistics_from_tfrecord`
17 tfdv.generate_statistics_from_tfrecord(TFRECORDS_PATH,
---> 18 pipeline_options=pipeline_options)
/Users/romain/dev/venv/lib/python2.7/site-packages/tensorflow_data_validation/utils/stats_gen_lib.pyc in generate_statistics_from_tfrecord(data_location, output_path, stats_options, pipeline_options)
86 shard_name_template='',
87 coder=beam.coders.ProtoCoder(
---> 88 statistics_pb2.DatasetFeatureStatisticsList)))
89 return load_statistics(output_path)
90
/Users/romain/dev/venv/lib/python2.7/site-packages/apache_beam/pipeline.pyc in __exit__(self, exc_type, exc_val, exc_tb)
421 def __exit__(self, exc_type, exc_val, exc_tb):
422 if not exc_type:
--> 423 self.run().wait_until_finish()
424
425 def visit(self, visitor):
/Users/romain/dev/venv/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.pyc in wait_until_finish(self, duration)
1164 raise DataflowRuntimeException(
1165 'Dataflow pipeline failed. State: %s, Error:\n%s' %
-> 1166 (self.state, getattr(self._runner, 'last_error_msg', None)), self)
1167 return self.state
1168
DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 642, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 130, in execute
test_shuffle_sink=self._test_shuffle_sink)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 104, in create_operation
is_streaming=False)
File "apache_beam/runners/worker/operations.py", line 636, in apache_beam.runners.worker.operations.create_operation
op = create_pgbk_op(name_context, spec, counter_factory, state_sampler)
File "apache_beam/runners/worker/operations.py", line 482, in apache_beam.runners.worker.operations.create_pgbk_op
return PGBKCVOperation(step_name, spec, counter_factory, state_sampler)
File "apache_beam/runners/worker/operations.py", line 538, in apache_beam.runners.worker.operations.PGBKCVOperation.__init__
fn, args, kwargs = pickler.loads(self.spec.combine_fn)[:3]
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 246, in loads
return dill.loads(s)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 316, in loads
return load(file, ignore)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 304, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1096, in load_global
klass = self.find_class(module, name)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 465, in find_class
return StockUnpickler.find_class(self, module, name)
File "/usr/lib/python2.7/pickle.py", line 1130, in find_class
__import__(module)
ImportError: No module named tensorflow_data_validation.statistics.stats_impl
What code did I run?
!pip install -U tensorflow \
tensorflow-data-validation \
apache-beam[gcp]
import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions
# Create and set your PipelineOptions.
pipeline_options = PipelineOptions()
# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
pipeline_options.view_as(StandardOptions).runner = 'DataflowRunner'
tfdv.generate_statistics_from_tfrecord(TFRECORDS_PATH,
pipeline_options=pipeline_options)
Pip trace
Requirement already up-to-date: tensorflow in /Users/romain/dev/venv/lib/python2.7/site-packages (1.12.0)
Requirement already up-to-date: tensorflow-data-validation in /Users/romain/dev/venv/lib/python2.7/site-packages (0.11.0)
Requirement already up-to-date: apache-beam[gcp] in /Users/romain/dev/venv/lib/python2.7/site-packages (2.8.0)
Requirement already satisfied, skipping upgrade: enum34>=1.1.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.1.6)
Requirement already satisfied, skipping upgrade: keras-preprocessing>=1.0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.5)
Requirement already satisfied, skipping upgrade: wheel in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.31.1)
Requirement already satisfied, skipping upgrade: astor>=0.6.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.7.1)
Requirement already satisfied, skipping upgrade: backports.weakref>=1.0rc1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.post1)
Requirement already satisfied, skipping upgrade: mock>=2.0.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (2.0.0)
Requirement already satisfied, skipping upgrade: tensorboard<1.13.0,>=1.12.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.12.0)
Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.1.0)
Requirement already satisfied, skipping upgrade: protobuf>=3.6.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (3.6.1)
Requirement already satisfied, skipping upgrade: gast>=0.2.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.2.0)
Requirement already satisfied, skipping upgrade: absl-py>=0.1.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.3.0)
Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.13.0)
Requirement already satisfied, skipping upgrade: six>=1.10.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.10.0)
Requirement already satisfied, skipping upgrade: keras-applications>=1.0.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.6)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.14.0)
Requirement already satisfied, skipping upgrade: IPython<6,>=5.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (5.7.0)
Requirement already satisfied, skipping upgrade: tensorflow-metadata<0.10,>=0.9 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.9.0)
Requirement already satisfied, skipping upgrade: tensorflow-transform<0.12,>=0.11 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.11.0)
Requirement already satisfied, skipping upgrade: pandas<1,>=0.18 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.22.0)
Requirement already satisfied, skipping upgrade: oauth2client<5,>=2.0.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (4.1.3)
Requirement already satisfied, skipping upgrade: dill<=0.2.8.2,>=0.2.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.2.8.2)
Requirement already satisfied, skipping upgrade: pydot<1.3,>=1.2.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.2.4)
Requirement already satisfied, skipping upgrade: pyyaml<4.0.0,>=3.12 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.12)
Requirement already satisfied, skipping upgrade: pyvcf<0.7.0,>=0.6.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.6.8)
Requirement already satisfied, skipping upgrade: typing<3.7.0,>=3.6.0; python_version < "3.5.0" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.6.4)
Requirement already satisfied, skipping upgrade: avro<2.0.0,>=1.8.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.8.2)
Requirement already satisfied, skipping upgrade: future<1.0.0,>=0.16.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.16.0)
Requirement already satisfied, skipping upgrade: fastavro<0.22,>=0.21.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.21.13)
Requirement already satisfied, skipping upgrade: crcmod<2.0,>=1.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.7)
Requirement already satisfied, skipping upgrade: httplib2<=0.11.3,>=0.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.11.3)
Requirement already satisfied, skipping upgrade: futures<4.0.0,>=3.1.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.2.0)
Requirement already satisfied, skipping upgrade: hdfs<3.0.0,>=2.1.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (2.1.0)
Requirement already satisfied, skipping upgrade: pytz<=2018.4,>=2018.3 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (2018.4)
Requirement already satisfied, skipping upgrade: google-apitools<=0.5.20,>=0.5.18; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.5.20)
Requirement already satisfied, skipping upgrade: proto-google-cloud-pubsub-v1==0.15.4; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.15.4)
Requirement already satisfied, skipping upgrade: googledatastore==7.0.1; python_version < "3.0" and extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (7.0.1)
Requirement already satisfied, skipping upgrade: google-cloud-bigquery==0.25.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.25.0)
Requirement already satisfied, skipping upgrade: google-cloud-pubsub==0.26.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.26.0)
Requirement already satisfied, skipping upgrade: proto-google-cloud-datastore-v1<=0.90.4,>=0.90.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.90.4)
Requirement already satisfied, skipping upgrade: funcsigs>=1; python_version < "3.3" in /Users/romain/dev/venv/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow) (1.0.2)
Requirement already satisfied, skipping upgrade: pbr>=0.11 in /Users/romain/dev/venv/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow) (1.10.0)
Requirement already satisfied, skipping upgrade: werkzeug>=0.11.10 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorboard<1.13.0,>=1.12.0->tensorflow) (0.14.1)
Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorboard<1.13.0,>=1.12.0->tensorflow) (2.6.11)
Requirement already satisfied, skipping upgrade: setuptools in /Users/romain/dev/venv/lib/python2.7/site-packages (from protobuf>=3.6.1->tensorflow) (39.1.0)
Requirement already satisfied, skipping upgrade: h5py in /Users/romain/dev/venv/lib/python2.7/site-packages (from keras-applications>=1.0.6->tensorflow) (2.8.0)
Requirement already satisfied, skipping upgrade: simplegeneric>0.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.8.1)
Requirement already satisfied, skipping upgrade: pygments in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (2.2.0)
Requirement already satisfied, skipping upgrade: backports.shutil-get-terminal-size; python_version == "2.7" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (1.0.0)
Requirement already satisfied, skipping upgrade: pexpect; sys_platform != "win32" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.6.0)
Requirement already satisfied, skipping upgrade: prompt-toolkit<2.0.0,>=1.0.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (1.0.15)
Requirement already satisfied, skipping upgrade: decorator in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.3.0)
Requirement already satisfied, skipping upgrade: pickleshare in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.7.4)
Requirement already satisfied, skipping upgrade: appnope; sys_platform == "darwin" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.1.0)
Requirement already satisfied, skipping upgrade: traitlets>=4.2 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.3.2)
Requirement already satisfied, skipping upgrade: pathlib2; python_version == "2.7" or python_version == "3.3" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (2.3.2)
Requirement already satisfied, skipping upgrade: googleapis-common-protos in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-metadata<0.10,>=0.9->tensorflow-data-validation) (1.5.3)
Requirement already satisfied, skipping upgrade: python-dateutil in /Users/romain/dev/venv/lib/python2.7/site-packages (from pandas<1,>=0.18->tensorflow-data-validation) (2.7.3)
Requirement already satisfied, skipping upgrade: rsa>=3.1.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (3.4.2)
Requirement already satisfied, skipping upgrade: pyasn1>=0.1.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (0.1.9)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (0.0.8)
Requirement already satisfied, skipping upgrade: pyparsing>=2.1.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from pydot<1.3,>=1.2.0->apache-beam[gcp]) (2.1.10)
Requirement already satisfied, skipping upgrade: docopt in /Users/romain/dev/venv/lib/python2.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]) (0.6.2)
Requirement already satisfied, skipping upgrade: requests>=2.7.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]) (2.11.1)
Requirement already satisfied, skipping upgrade: fasteners>=0.14 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-apitools<=0.5.20,>=0.5.18; extra == "gcp"->apache-beam[gcp]) (0.14.1)
Requirement already satisfied, skipping upgrade: google-cloud-core<0.26dev,>=0.25.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (0.25.0)
Requirement already satisfied, skipping upgrade: gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.15.4)
Requirement already satisfied, skipping upgrade: ptyprocess>=0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from pexpect; sys_platform != "win32"->IPython<6,>=5.0->tensorflow-data-validation) (0.5.2)
Requirement already satisfied, skipping upgrade: wcwidth in /Users/romain/dev/venv/lib/python2.7/site-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython<6,>=5.0->tensorflow-data-validation) (0.1.7)
Requirement already satisfied, skipping upgrade: ipython-genutils in /Users/romain/dev/venv/lib/python2.7/site-packages (from traitlets>=4.2->IPython<6,>=5.0->tensorflow-data-validation) (0.2.0)
Requirement already satisfied, skipping upgrade: scandir; python_version < "3.5" in /Users/romain/dev/venv/lib/python2.7/site-packages (from pathlib2; python_version == "2.7" or python_version == "3.3"->IPython<6,>=5.0->tensorflow-data-validation) (1.7)
Requirement already satisfied, skipping upgrade: monotonic>=0.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from fasteners>=0.14->google-apitools<=0.5.20,>=0.5.18; extra == "gcp"->apache-beam[gcp]) (1.5)
Requirement already satisfied, skipping upgrade: google-auth-httplib2 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (0.0.3)
Requirement already satisfied, skipping upgrade: google-auth<2.0.0dev,>=0.4.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (1.1.1)
Requirement already satisfied, skipping upgrade: grpc-google-iam-v1<0.12dev,>=0.11.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.11.4)
Requirement already satisfied, skipping upgrade: google-gax<0.16dev,>=0.15.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.15.16)
Requirement already satisfied, skipping upgrade: cachetools>=2.0.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-auth<2.0.0dev,>=0.4.0->google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (2.0.1)
Requirement already satisfied, skipping upgrade: ply==3.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-gax<0.16dev,>=0.15.7->gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (3.8)
stat:awaiting response type:support