geobeam - adds GIS capabilities to your Apache Beam and Dataflow pipelines.

Overview

geobeam adds GIS capabilities to your Apache Beam pipelines.

What does geobeam do?

geobeam enables you to ingest and analyze massive amounts of geospatial data in parallel using Dataflow. geobeam provides a set of FileBasedSource classes that make it easy to read, process, and write geospatial data, and provides a set of helpful Apache Beam transforms and utilities that make it easier to process GIS data in your Dataflow pipelines.

See the Full Documentation for complete API specification.

Requirements

  • Apache Beam 2.27+
  • Python 3.7+

Note: Make sure the Python version used to run the pipeline matches the version in the built container.

Supported input types

File format Data type Geobeam class
tiff raster GeotiffSource
shp vector ShapefileSource
gdb vector GeodatabaseSource

Included libraries

geobeam includes several python modules that allow you to perform a wide variety of operations and analyses on your geospatial data.

Module Version Description
gdal 3.2.1 python bindings for GDAL
rasterio 1.1.8 reads and writes geospatial raster data
fiona 1.8.18 reads and writes geospatial vector data
shapely 1.7.1 manipulation and analysis of geometric objects in the cartesian plane

How to Use

1. Install the module

pip install geobeam

2. Write your pipeline

Write a normal Apache Beam pipeline using one of geobeams file sources. See geobeam/examples for inspiration.

3. Run

Run locally

python -m geobeam.examples.geotiff_dem \
  --gcs_url gs://geobeam/examples/dem-clipped-test.tif \
  --dataset=examples \
  --table=dem \
  --band_column=elev \
  --centroid_only=true \
  --runner=DirectRunner \
  --temp_location <temp gs://> \
  --project <project_id>

You can also run "locally" in Cloud Shell using the py-37 container variants

Note: Some of the provided examples may take a very long time to run locally...

Run in Dataflow

Write a Dockerfile

This will run in Dataflow as a custom container based on the dataflow-geobeam/base image. See [geobeam/examples/Dockerfile] for an example that installed the latest geobeam from source.

FROM gcr.io/dataflow-geobeam/base
# FROM gcr.io/dataflow-geobeam/base-py37

RUN pip install geobeam

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
# build locally with docker
docker build -t gcr.io/<project_id>/example
docker push gcr.io/<project_id>/example

# or build with Cloud Build
gcloud builds submit --tag gcr.io/<project_id>/<name> --timeout=3600s --machine-type=n1-highcpu-8

Start the Dataflow job

Note on Python versions

If you are starting a Dataflow job on a machine running Python 3.7, you must use the images suffixed with py-37. (Cloud Shell runs Python 3.7 by default, as of Feb 2021). A separate version of the base image is built for Python 3.7, and is available at gcr.io/dataflow-geobeam/base-py37. The Python 3.7-compatible examples image is similarly-named gcr.io/dataflow-geobeam/example-py37

# run the geotiff_soilgrid example in dataflow
python -m geobeam.examples.geotiff_soilgrid \
  --gcs_url gs://geobeam/examples/AWCh3_M_sl1_250m_ll.tif \
  --dataset=examples \
  --table=soilgrid \
  --band_column=h3 \
  --runner=DataflowRunner \
  --worker_harness_container_image=gcr.io/dataflow-geobeam/example \
  --experiment=use_runner_v2 \
  --temp_location=<temp bucket> \
  --service_account_email <service account> \
  --region us-central1 \
  --max_num_workers 2 \
  --machine_type c2-standard-30 \
  --merge_blocks 64

Examples

Polygonize Raster

def run(options):
  from geobeam.io import GeotiffSource
  from geobeam.fn import format_record

  with beam.Pipeline(options) as p:
    (p  | 'ReadRaster' >> beam.io.Read(GeotiffSource(gcs_url))
        | 'FormatRecord' >> beam.Map(format_record, 'elev', 'float')
        | 'WriteToBigquery' >> beam.io.WriteToBigQuery('geo.dem'))

Validate and Simplify Shapefile

def run(options):
  from geobeam.io import ShapefileSource
  from geobeam.fn import make_valid, filter_invalid, format_record

  with beam.Pipeline(options) as p:
    (p  | 'ReadShapefile' >> beam.io.Read(ShapefileSource(gcs_url))
        | 'Validate' >> beam.Map(make_valid)
        | 'FilterInvalid' >> beam.Filter(filter_invalid)
        | 'FormatRecord' >> beam.Map(format_record)
        | 'WriteToBigquery' >> beam.io.WriteToBigQuery('geo.parcel'))

See geobeam/examples/ for complete examples.

A number of example pipelines are available in the geobeam/examples/ folder. To run them in your Google Cloud project, run the included terraform file to set up the Bigquery dataset and tables used by the example pipelines.

Open up Bigquery GeoViz to visualize your data.

Shapefile Example

The National Flood Hazard Layer loaded from a shapefile. Example pipeline at geobeam/examples/shapefile_nfhl.py

Raster Example

The Digital Elevation Model is a high-resolution model of elevation measurements at 1-meter resolution. (Values converted to centimeters). Example pipeline: geobeam/examples/geotiff_dem.py.

Included Transforms

The geobeam.fn module includes several Beam Transforms that you can use in your pipelines.

Module Description
geobeam.fn.make_valid Attempt to make all geometries valid.
geobeam.fn.filter_invalid Filter out invalid geometries that cannot be made valid
geobeam.fn.format_record Format the (props, geom) tuple received from a FileSource into a dict that can be inserted into the destination table

Execution parameters

Each FileSource accepts several parameters that you can use to configure how your data is loaded and processed. These can be parsed as pipeline arguments and passed into the respective FileSources as seen in the examples pipelines.

Parameter Input type Description Default Required?
skip_reproject All True to skip reprojection during read False No
in_epsg All An EPSG integer to override the input source CRS to reproject from No
band_number Raster The raster band to read from 1 No
include_nodata Raster True to include nodata values False No
centroid_only Raster True to only read pixel centroids False No
merge_blocks Raster Number of block windows to combine during read. Larger values will generate larger, better-connected polygons. No
layer_name Vector Name of layer to read Yes
gdb_name Vector Name of geodatabase directory in a gdb zip archive Yes, for GDB files

License

This is not an officially supported Google product, though support will be provided on a best-effort basis.

Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Comments
  • Add get_bigquery_schema_dataflow

    Add get_bigquery_schema_dataflow

    Added get_bigquery_schema_dataflow to create a schema that can read files from Google Cloud Storage and generate schemas that can be used natively with Google Dataflow

    opened by mbforr 7
  • Unable to get worker harness container image

    Unable to get worker harness container image

    Unable to get the worker harness container images used in the examples.

    ~~➜ docker pull gcr.io/dataflow-geobeam/example-py37~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/base-py37~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/example~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/base~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ➜ docker pull gcr.io/dataflow-geobeam/base
    Using default tag: latest
    Error response from daemon: manifest for gcr.io/dataflow-geobeam/base:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/dataflow-geobeam/base/manifests/latest"
    
    opened by jskontorp 5
  • Encountering errors while trying to run examples geotiff examples

    Encountering errors while trying to run examples geotiff examples

    tl;dr This may be an issue with my environment (I'm running Python 3.9.13), but I've had no success getting any of the examples involving gridded data (e.g., geobeam.examples.geotiff_dem) to run locally. Have these been tested recently?

    I was bumping up against TypeError: only size-1 arrays can be converted to Python scalars [while running 'ElevToCentimeters'], which were fixed by using x.astype(int) where appropriate. Then I hit TypeError: format_record() takes from 1 to 2 positional arguments but 3 were given [while running 'FormatRecords']. (Maybe this line needs to be something like 'FormatRecords' >> beam.Map(format_rasterpolygon_record, 'int', known_args.band_column) instead?) Then I got TypeError: Object of type 'ndarray' is not JSON serializable [while running 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] and stopped to jump in here. If it's just my environment, then I'm making changes needlessly. If it's the code, it seemed better for these fixes to be applied directly in the upstream repo.

    Any advice you have would be much appreciated!!

    bug documentation 
    opened by lzachmann 3
  • ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    I am fairly new to python and Apache beam, however, I used the shapefile_nfhl.py as an example to create a reader for GeoJSON files, therefore I imported the GeoJSONSource (as per documentation) from geobeam.io but when I run the application I get the following error ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    Am I missing something as I did follow the instructions to install geobeam. pip install geobeam

    I have tried this with python 3.7, 3.9 and 3.10, versions 3.7 and 3.9 gives this error where as 3.10 does not work at all - getting issues while installing rasterio.

    I am also running this on macOS Monterey (12.2.1)

    Here is my code:

    def run(pipeline_args, known_args): 
        import apache_beam as beam
        from apache_beam.io.gcp.internal.clients import bigquery as beam_bigquery
        from apache_beam.options.pipeline_options import PipelineOptions, SetupOptions
        from geobeam.io import GeoJSONSource
        from geobeam.fn import format_record, make_valid, filter_invalid
    
        pipeline_options = PipelineOptions([
            '--experiments', 'use_beam_bq_sink',
        ] + pipeline_args)
    
        with beam.Pipeline(options=pipeline_options) as p:
            (p
             | beam.io.Read(GeoJSONSource(known_args.gcs_url,
                 layer_name=known_args.layer_name))
             | 'MakeValid' >> beam.Map(make_valid)
             | 'FilterInvalid' >> beam.Filter(filter_invalid)
             | 'FormatRecords' >> beam.Map(format_record)
             | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
                 beam_bigquery.TableReference(
                     datasetId=known_args.dataset,
                     tableId=known_args.table),
                 method=beam.io.WriteToBigQuery.Method.FILE_LOADS,
                 write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                 create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
    
    
    if __name__ == '__main__':
        import logging
        import argparse
    
        logging.getLogger().setLevel(logging.INFO)
    
        parser = argparse.ArgumentParser()
        parser.add_argument('--gcs_url')
        parser.add_argument('--dataset')
        parser.add_argument('--table')
        parser.add_argument('--layer_name')
        parser.add_argument('--in_epsg', type=int, default=None)
        known_args, pipeline_args = parser.parse_known_args()
    
        run(pipeline_args, known_args)```
    opened by migaelhartzenberg 3
  • Issue installing geobeam on GCP CloudShell

    Issue installing geobeam on GCP CloudShell

    Seeing the below issue while installing geobeam on GCP Cloud Shell.

    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-07s73wh5/orjson/
    

    Version of Python used from venv is 3.7.3

    (env) --@cloudshell:~ $ python
    Python 3.7.3 (default, Jan 22 2021, 20:04:44)
    [GCC 8.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>>
    

    Detailed error message

    Collecting orjson<4.0; python_version >= "3.6" (from apache-beam[gcp]>=2.27.0->geobeam)
      Using cached https://files.pythonhosted.org/packages/75/cd/eac8908d0b4a4b08067bc68c04e52d7601b0ed86bf2a6a3264c46dd51a84/orjson-3.6.3.tar.gz
      Installing build dependencies ... done
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/usr/lib/python3.7/tokenize.py", line 447, in open
            buffer = _builtin_open(filename, 'rb')
        FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-07s73wh5/orjson/setup.py'
    
    opened by hashkanna 1
  • Adding ESRIServerSource and GeoJSONSource

    Adding ESRIServerSource and GeoJSONSource

    Hey @tjwebb wanted to send these over - still need to do some testing but wanted to run them by you first.

    GeoJSONSource - this one should be fairly straightforward as it is a single file and Fiona can read the file natively

    ESRIServerSource - I added a package that can handle the transformation of ESRI JSON to GeoJSON, as well as loop through a layer request since the ESRI REST API generally limits features that can be requested to 1000 or 2000. I can write some of this code natively or we can use the package, but not sure if we want to limit the dependencies. The package in question is here.

    https://github.com/openaddresses/pyesridump

    Also any tips for testing locally would be great!

    opened by mbforr 1
  • Unable to load 5GB tif file to bigquery

    Unable to load 5GB tif file to bigquery

    It works fine for 1GB tif file. While trying to load 2GB ~ 5GB tif file it is failing with multiple errors during write to bigquery.

    If you would like to reproduce the errors, then you could get these datasets from here - https://files.isric.org/soilgrids/former/2017-03-10/data/ BDRLOG_M_250m_ll.tif OCDENS_M_sl1_250m_ll.tif ORCDRC_M_sl1_250m_ll.tif

    "Error processing instruction process_bundle-1256. Original traceback is Traceback (most recent call last): File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1374, in apache_beam.runners.common._OutputProcessor.process_outputs File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_file_loads.py", line 248, in process writer.write(row) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1396, in write return self._file_handle.write(self._coder.encode(row) + b'\n') File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filesystemio.py", line 205, in write self._uploader.put(b) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 663, in put self._conn.send_bytes(data.tobytes()) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes self._send(buf) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 289, in _execute response = task() File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 362, in lambda: self.create_worker().do_instruction(request), request) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 606, in do_instruction return getattr(self, request_type)( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 644, in process_bundle bundle_processor.process_bundle(instruction_id)) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 999, in process_bundle input_op_by_transform_id[element.transform_id].process_encoded( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 228, in process_encoded self.output(decoded_value) File "apache_beam/runners/worker/operations.py", line 357, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 359, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 829, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process File "apache_beam/runners/worker/operations.py", line 838, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process File "apache_beam/runners/common.py", line 1247, in apache_beam.runners.common.DoFnRunner.process_with_sized_restriction File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 886, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1321, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "/usr/local/lib/python3.8/site-packages/future/utils/init.py", line 446, in raise_with_traceback raise exc.with_traceback(traceback) File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1374, in apache_beam.runners.common._OutputProcessor.process_outputs File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_file_loads.py", line 248, in process writer.write(row) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1396, in write return self._file_handle.write(self._coder.encode(row) + b'\n') File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filesystemio.py", line 205, in write self._uploader.put(b) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 663, in put self._conn.send_bytes(data.tobytes()) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes self._send(buf) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) RuntimeError: BrokenPipeError: [Errno 32] Broken pipe [while running 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)-ptransform-3851']

    opened by aswinramakrishnan 1
  • Not able to docker or gcloud submit

    Not able to docker or gcloud submit

    Hi Travis, Thank you for creating geobeam package for our requirement.

    I am raising an issue here to just keep track.

    While using docker build -

     ---> 56341244044b
    Step 9/23 : RUN wget -q https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz     && tar -xzf curl-${CURL_VERSION}.tar.gz && cd curl-${CURL_VERSION}     && ./configure --prefix=/usr/local     && echo "building CURL ${CURL_VERSION}..."     && make --quiet -j$(nproc) && make --quiet install     && cd $WORKDIR && rm -rf curl-${CURL_VERSION}.tar.gz curl-${CURL_VERSION}
     ---> Running in 72bc532c27b8
    The command '/bin/sh -c wget -q https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz     && tar -xzf curl-${CURL_VERSION}.tar.gz && cd curl-${CURL_VERSION}     && ./configure --prefix=/usr/local     && echo "building CURL ${CURL_VERSION}..."     && make --quiet -j$(nproc) && make --quiet install     && cd $WORKDIR && rm -rf curl-${CURL_VERSION}.tar.gz curl-${CURL_VERSION}' returned a non-zero code: 5
    (global-env) arama3@C02S4AEYG8WP dataflow-geobeam % 
    (global-env) arama3@C02S4AEYG8WP dataflow-geobeam % 
    (global-env) arama3@C02S4AEYG8WP dataflow-geobeam % docker image ls                                                                                                        
    REPOSITORY                                                      TAG        IMAGE ID       CREATED         SIZE
    <none>                                                          <none>     56341244044b   5 minutes ago   2.55GB
    

    while trying to do gcloud submit command -

    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferOp.lo -MD -MP -MF .deps/BufferOp.Tpo -c BufferOp.cpp  -fPIC -DPIC -o .libs/BufferOp.o
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferOp.lo -MD -MP -MF .deps/BufferOp.Tpo -c BufferOp.cpp -o BufferOp.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferBuilder.lo -MD -MP -MF .deps/BufferBuilder.Tpo -c BufferBuilder.cpp -o BufferBuilder.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferParameters.lo -MD -MP -MF .deps/BufferParameters.Tpo -c BufferParameters.cpp  -fPIC -DPIC -o .libs/BufferParameters.o
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferParameters.lo -MD -MP -MF .deps/BufferParameters.Tpo -c BufferParameters.cpp -o BufferParameters.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferSubgraph.lo -MD -MP -MF .deps/BufferSubgraph.Tpo -c BufferSubgraph.cpp  -fPIC -DPIC -o .libs/BufferSubgraph.o
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     
    Your build timed out. Use the [--timeout=DURATION] flag to change the timeout threshold.
    ERROR: (gcloud.builds.submit) build ee748b6f-5347-4061-a81b-7f46959086c5 completed with status "TIMEOUT"
    
    documentation customer-reported 
    opened by aswinramakrishnan 1
  • Docstring type of fn.format_record is type but takes in string

    Docstring type of fn.format_record is type but takes in string

    Hi! Cool project that I want to test out but I noticed an inconsistency with the docstring and the code. Not sure which should be followed.

    https://github.com/GoogleCloudPlatform/dataflow-geobeam/blob/21479252be373b795a5c7d6626021b01a042e5de/geobeam/fn.py#L67-L91

    The docstring should be

            band_type (str, optional): Default to int. The data type of the
                raster band column to store in the database.
    ...
            p | beam.Map(geobeam.fn.format_record,
                band_column='elev', band_type='float'
    

    or the code should be

    def format_record(element, band_column=None, band_type=int):
        import json
    
        props, geom = element
        cast = band_type
    

    Thanks!

    opened by jtmiclat 1
  • Create BQ table from shapefile

    Create BQ table from shapefile

    Changes covering creation of the BQ table from the shp file schema + examples of loading "generic shapefile". By doing so we avoid the headache of table creation before, so any SHP could be loaded.

    opened by Vadoid 0
  • [BEAM-12879] Issue affecting examples

    [BEAM-12879] Issue affecting examples

    [BEAM-12879] Issue - https://issues.apache.org/jira/browse/BEAM-12879?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

    Affecting the examples when reading files from the gs://geobeam bucket. Workaround - download zip files and put them in the own bucket.

    opened by Vadoid 0
  • Create BQ table from SHP schema

    Create BQ table from SHP schema

    Changes covering creation of the BQ table from the shp file schema + examples of loading "generic shapefile". By doing so we avoid the headache of table creation before, so any SHP could be loaded.

    opened by Vadoid 0
  • Geobeam Util  get_bigquery_schema_dataflow only for GDB files

    Geobeam Util get_bigquery_schema_dataflow only for GDB files

    As far as I understand current implementation of get_bigquery_schema_dataflow only works for GDB files and doesn't work for SHP files. Should we make the autoschema work for shapefiles as well via Fiona?

    opened by Vadoid 0
  • Updated the following: Dockerfile to get more recent gdal version

    Updated the following: Dockerfile to get more recent gdal version

    beam sdk 2.36.0 --> beam sdk 2.40.0 Addition of metview binary with the fix for debian curl 7.73.0 --> curl 7.83.1 (open ssl needed) geos 3.9.0 --> curl 3.10.3 sqlite 3330000 --> sqlite 3380500 proj 7.2.1 --> proj 9.0.0 (using cmake) openjpeg 2.3.1 --> openjpeg 2.5.0 addition of hdf5 1.10.5 addition of netcdf-c 4.9.0 gdal 3.2.1 --> gdal 3.5.1 (using cmake) making sure numpy always gets installed with gdal addition gcloud components alpha

    added longer timeout to cloudbuild.yaml because the intial build takes 1h20min at least

    opened by bahmandar 0
  • get_bigquery_schema_dataflow() issue and questions

    get_bigquery_schema_dataflow() issue and questions

    Hi, I am trying to use geobeam to ingest a shapefile into BigQuery, and creating the table with a schema from the shapefile if the table does not exist. I came across few issues and questions.

    I attempt this using a modified example shapefile_nfhl.py. And ran with this command.

    python -m shapefile_nfhl --runner DataflowRunner --project my-project --temp_location gs://mybucket-geobeam/data --region australia-southeast1 --worker_harness_container_image gcr.io/dataflow-geobeam/example --experiment use_runner_v2 --service_account_email [email protected] --gcs_url gs://geobeam/examples/510104_20170217.zip --dataset examples --table output_table
    

    Using get_bigquery_schema_dataflow() from geobeam.util is throwing error due to undefined variable.

    NameError: name 'gcs_url' is not defined
    

    I have opened a PR to fix this. #38

    Once the function is fixed, it seems that it does not accept a shapefile. Passing in the GCS URL to the zipped shapefile is throwing this error.

    Traceback (most recent call last):
      File "fiona/_shim.pyx", line 83, in fiona._shim.gdal_open_vector
      File "fiona/_err.pyx", line 291, in fiona._err.exc_wrap_pointer
    fiona._err.CPLE_OpenFailedError: '/vsigs/geobeam/examples/510104_20170217.zip' not recognized as a supported file format.
    

    Am I using the function in a wrong way or (zipped) shapefile is not support for this? For reference, this is the modified template. Thank you!

    opened by muazamkamal 0
  • centroid_only = false error for a particular GeoTIFF dataset

    centroid_only = false error for a particular GeoTIFF dataset

    When ingesting this cropland dataset https://developers.google.com/earth-engine/datasets/catalog/USDA_NASS_CDL?hl=en#citations: if I set the centroid_only parameter to false, I get the following error: Failed 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 319; errors: 1. Please look into the errors[] collection for more details.' reason: 'invalid'> [while running 'WriteToBigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs-ptransform-31'

    Full steps to reproduce are in the 'Ingesting EE data to BQ' blog.

    opened by remylouisew 0
Releases(v0.1.0)
Owner
Google Cloud Platform
Google Cloud Platform
OSMnx: Python for street networks. Retrieve, model, analyze, and visualize street networks and other spatial data from OpenStreetMap.

OSMnx OSMnx is a Python package that lets you download geospatial data from OpenStreetMap and model, project, visualize, and analyze real-world street

Geoff Boeing 4k Jan 8, 2023
Read and write rasters in parallel using Rasterio and Dask

dask-rasterio dask-rasterio provides some methods for reading and writing rasters in parallel using Rasterio and Dask arrays. Usage Read a multiband r

Dymaxion Labs 85 Aug 30, 2022
Location field and widget for Django. It supports Google Maps, OpenStreetMap and Mapbox

django-location-field Let users pick locations using a map widget and store its latitude and longitude. Stable version: django-location-field==2.1.0 D

Caio Ariede 481 Dec 29, 2022
Deal with Bing Maps Tiles and Pixels / WGS 84 coordinates conversions, and generate grid Shapefiles

PyBingTiles This is a small toolkit in order to deal with Bing Tiles, used i.e. by Facebook for their Data for Good datasets. Install Clone this repos

Shoichi 1 Dec 8, 2021
A short term landscape evolution using a path sampling method to solve water and sediment flow continuity equations and model mass flows over complex topographies.

r.sim.terrain A short-term landscape evolution model that simulates topographic change for both steady state and dynamic flow regimes across a range o

Brendan Harmon 7 Oct 21, 2022
Using Global fishing watch's data to build a machine learning model that can identify illegal fishing and poaching activities through satellite and geo-location data.

Using Global fishing watch's data to build a machine learning model that can identify illegal fishing and poaching activities through satellite and geo-location data.

Ayush Mishra 3 May 6, 2022
This repository contains the scripts to derivate the ENU and ECEF coordinates from the longitude, latitude, and altitude values encoded in the NAD83 coordinates.

This repository contains the scripts to derivate the ENU and ECEF coordinates from the longitude, latitude, and altitude values encoded in the NAD83 coordinates.

Luigi Cruz 1 Feb 7, 2022
A Django application that provides country choices for use with forms, flag icons static files, and a country field for models.

Django Countries A Django application that provides country choices for use with forms, flag icons static files, and a country field for models. Insta

Chris Beaven 1.2k Jan 3, 2023
Python bindings and utilities for GeoJSON

geojson This Python library contains: Functions for encoding and decoding GeoJSON formatted data Classes for all GeoJSON Objects An implementation of

Jazzband 765 Jan 6, 2023
Manipulation and analysis of geometric objects

Shapely Manipulation and analysis of geometric objects in the Cartesian plane. Shapely is a BSD-licensed Python package for manipulation and analysis

null 3.1k Jan 3, 2023
Python interface to PROJ (cartographic projections and coordinate transformations library)

pyproj Python interface to PROJ (cartographic projections and coordinate transformations library). Documentation Stable: http://pyproj4.github.io/pypr

null 832 Dec 31, 2022
Rasterio reads and writes geospatial raster datasets

Rasterio Rasterio reads and writes geospatial raster data. Geographic information systems use GeoTIFF and other formats to organize and store gridded,

Mapbox 1.9k Jan 7, 2023
Fiona reads and writes geographic data files

Fiona Fiona reads and writes geographic data files and thereby helps Python programmers integrate geographic information systems with other computer s

null 987 Jan 4, 2023
Python bindings and utilities for GeoJSON

geojson This Python library contains: Functions for encoding and decoding GeoJSON formatted data Classes for all GeoJSON Objects An implementation of

Jazzband 763 Dec 26, 2022
Documentation and samples for ArcGIS API for Python

ArcGIS API for Python ArcGIS API for Python is a Python library for working with maps and geospatial data, powered by web GIS. It provides simple and

Esri 1.4k Dec 30, 2022
Search and download Copernicus Sentinel satellite images

sentinelsat Sentinelsat makes searching, downloading and retrieving the metadata of Sentinel satellite images from the Copernicus Open Access Hub easy

null 837 Dec 28, 2022
python toolbox for visualizing geographical data and making maps

geoplotlib is a python toolbox for visualizing geographical data and making maps data = read_csv('data/bus.csv') geoplotlib.dot(data) geoplotlib.show(

Andrea Cuttone 976 Dec 11, 2022
geemap - A Python package for interactive mapping with Google Earth Engine, ipyleaflet, and ipywidgets.

A Python package for interactive mapping with Google Earth Engine, ipyleaflet, and folium

Qiusheng Wu 2.4k Dec 30, 2022
Python interface to PROJ (cartographic projections and coordinate transformations library)

pyproj Python interface to PROJ (cartographic projections and coordinate transformations library). Documentation Stable: http://pyproj4.github.io/pypr

null 832 Dec 31, 2022