geobeam - adds GIS capabilities to your Apache Beam and Dataflow pipelines.

Google Cloud Platform

Last update: Nov 8, 2022

Related tags

Geolocation dataflow-geobeam

Overview

geobeam adds GIS capabilities to your Apache Beam pipelines.

What does geobeam do?

geobeam enables you to ingest and analyze massive amounts of geospatial data in parallel using Dataflow. geobeam provides a set of FileBasedSource classes that make it easy to read, process, and write geospatial data, and provides a set of helpful Apache Beam transforms and utilities that make it easier to process GIS data in your Dataflow pipelines.

See the Full Documentation for complete API specification.

Requirements

Apache Beam 2.27+
Python 3.7+

Note: Make sure the Python version used to run the pipeline matches the version in the built container.

Supported input types

File format	Data type	Geobeam class
`tiff`	raster	`GeotiffSource`
`shp`	vector	`ShapefileSource`
`gdb`	vector	`GeodatabaseSource`

Included libraries

geobeam includes several python modules that allow you to perform a wide variety of operations and analyses on your geospatial data.

Module	Version	Description
gdal	3.2.1	python bindings for GDAL
rasterio	1.1.8	reads and writes geospatial raster data
fiona	1.8.18	reads and writes geospatial vector data
shapely	1.7.1	manipulation and analysis of geometric objects in the cartesian plane

How to Use

1. Install the module

pip install geobeam

2. Write your pipeline

Write a normal Apache Beam pipeline using one of geobeams file sources. See geobeam/examples for inspiration.

3. Run

Run locally

python -m geobeam.examples.geotiff_dem \
  --gcs_url gs://geobeam/examples/dem-clipped-test.tif \
  --dataset=examples \
  --table=dem \
  --band_column=elev \
  --centroid_only=true \
  --runner=DirectRunner \
  --temp_location <temp gs://> \
  --project <project_id>

You can also run "locally" in Cloud Shell using the py-37 container variants

Note: Some of the provided examples may take a very long time to run locally...

Run in Dataflow

Write a Dockerfile

This will run in Dataflow as a custom container based on the dataflow-geobeam/base image. See [geobeam/examples/Dockerfile] for an example that installed the latest geobeam from source.

FROM gcr.io/dataflow-geobeam/base
# FROM gcr.io/dataflow-geobeam/base-py37

RUN pip install geobeam

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

# build locally with docker
docker build -t gcr.io/<project_id>/example
docker push gcr.io/<project_id>/example

# or build with Cloud Build
gcloud builds submit --tag gcr.io/<project_id>/<name> --timeout=3600s --machine-type=n1-highcpu-8

Start the Dataflow job

Note on Python versions

If you are starting a Dataflow job on a machine running Python 3.7, you must use the images suffixed with py-37. (Cloud Shell runs Python 3.7 by default, as of Feb 2021). A separate version of the base image is built for Python 3.7, and is available at gcr.io/dataflow-geobeam/base-py37. The Python 3.7-compatible examples image is similarly-named gcr.io/dataflow-geobeam/example-py37

# run the geotiff_soilgrid example in dataflow
python -m geobeam.examples.geotiff_soilgrid \
  --gcs_url gs://geobeam/examples/AWCh3_M_sl1_250m_ll.tif \
  --dataset=examples \
  --table=soilgrid \
  --band_column=h3 \
  --runner=DataflowRunner \
  --worker_harness_container_image=gcr.io/dataflow-geobeam/example \
  --experiment=use_runner_v2 \
  --temp_location=<temp bucket> \
  --service_account_email <service account> \
  --region us-central1 \
  --max_num_workers 2 \
  --machine_type c2-standard-30 \
  --merge_blocks 64

Examples

Polygonize Raster

def run(options):
  from geobeam.io import GeotiffSource
  from geobeam.fn import format_record

  with beam.Pipeline(options) as p:
    (p  | 'ReadRaster' >> beam.io.Read(GeotiffSource(gcs_url))
        | 'FormatRecord' >> beam.Map(format_record, 'elev', 'float')
        | 'WriteToBigquery' >> beam.io.WriteToBigQuery('geo.dem'))

Validate and Simplify Shapefile

def run(options):
  from geobeam.io import ShapefileSource
  from geobeam.fn import make_valid, filter_invalid, format_record

  with beam.Pipeline(options) as p:
    (p  | 'ReadShapefile' >> beam.io.Read(ShapefileSource(gcs_url))
        | 'Validate' >> beam.Map(make_valid)
        | 'FilterInvalid' >> beam.Filter(filter_invalid)
        | 'FormatRecord' >> beam.Map(format_record)
        | 'WriteToBigquery' >> beam.io.WriteToBigQuery('geo.parcel'))

See geobeam/examples/ for complete examples.

A number of example pipelines are available in the geobeam/examples/ folder. To run them in your Google Cloud project, run the included terraform file to set up the Bigquery dataset and tables used by the example pipelines.

Open up Bigquery GeoViz to visualize your data.

Shapefile Example

The National Flood Hazard Layer loaded from a shapefile. Example pipeline at geobeam/examples/shapefile_nfhl.py

Raster Example

The Digital Elevation Model is a high-resolution model of elevation measurements at 1-meter resolution. (Values converted to centimeters). Example pipeline: geobeam/examples/geotiff_dem.py.

Included Transforms

The geobeam.fn module includes several Beam Transforms that you can use in your pipelines.

Module	Description
`geobeam.fn.make_valid`	Attempt to make all geometries valid.
`geobeam.fn.filter_invalid`	Filter out invalid geometries that cannot be made valid
`geobeam.fn.format_record`	Format the (props, geom) tuple received from a FileSource into a `dict` that can be inserted into the destination table

Execution parameters

Each FileSource accepts several parameters that you can use to configure how your data is loaded and processed. These can be parsed as pipeline arguments and passed into the respective FileSources as seen in the examples pipelines.

Parameter	Input type	Description	Default	Required?
`skip_reproject`	All	True to skip reprojection during read	`False`	No
`in_epsg`	All	An EPSG integer to override the input source CRS to reproject from		No
`band_number`	Raster	The raster band to read from	`1`	No
`include_nodata`	Raster	True to include `nodata` values	`False`	No
`centroid_only`	Raster	True to only read pixel centroids	`False`	No
`merge_blocks`	Raster	Number of block windows to combine during read. Larger values will generate larger, better-connected polygons.		No
`layer_name`	Vector	Name of layer to read		Yes
`gdb_name`	Vector	Name of geodatabase directory in a gdb zip archive		Yes, for GDB files

License

This is not an officially supported Google product, though support will be provided on a best-effort basis.

Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Comments

Add get_bigquery_schema_dataflow

Added get_bigquery_schema_dataflow to create a schema that can read files from Google Cloud Storage and generate schemas that can be used natively with Google Dataflow

opened by mbforr 7
Unable to get worker harness container image
Unable to get the worker harness container images used in the examples.

~~➜ docker pull gcr.io/dataflow-geobeam/example-py37~~

~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

~~➜ docker pull gcr.io/dataflow-geobeam/base-py37~~

~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

~~➜ docker pull gcr.io/dataflow-geobeam/example~~

~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

~~➜ docker pull gcr.io/dataflow-geobeam/base~~

~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

➜ docker pull gcr.io/dataflow-geobeam/base Using default tag: latest Error response from daemon: manifest for gcr.io/dataflow-geobeam/base:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/dataflow-geobeam/base/manifests/latest"
opened by jskontorp 5
Encountering errors while trying to run examples geotiff examples

tl;dr This may be an issue with my environment (I'm running Python 3.9.13), but I've had no success getting any of the examples involving gridded data (e.g., geobeam.examples.geotiff_dem) to run locally. Have these been tested recently?

I was bumping up against TypeError: only size-1 arrays can be converted to Python scalars [while running 'ElevToCentimeters'], which were fixed by using x.astype(int) where appropriate. Then I hit TypeError: format_record() takes from 1 to 2 positional arguments but 3 were given [while running 'FormatRecords']. (Maybe this line needs to be something like 'FormatRecords' >> beam.Map(format_rasterpolygon_record, 'int', known_args.band_column) instead?) Then I got TypeError: Object of type 'ndarray' is not JSON serializable [while running 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] and stopped to jump in here. If it's just my environment, then I'm making changes needlessly. If it's the code, it seemed better for these fixes to be applied directly in the upstream repo.

Any advice you have would be much appreciated!!
bug documentation

opened by lzachmann 3

ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

I am fairly new to python and Apache beam, however, I used the shapefile_nfhl.py as an example to create a reader for GeoJSON files, therefore I imported the GeoJSONSource (as per documentation) from geobeam.io but when I run the application I get the following error ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

Am I missing something as I did follow the instructions to install geobeam. pip install geobeam

I have tried this with python 3.7, 3.9 and 3.10, versions 3.7 and 3.9 gives this error where as 3.10 does not work at all - getting issues while installing rasterio.

I am also running this on macOS Monterey (12.2.1)

Here is my code:

def run(pipeline_args, known_args): 
    import apache_beam as beam
    from apache_beam.io.gcp.internal.clients import bigquery as beam_bigquery
    from apache_beam.options.pipeline_options import PipelineOptions, SetupOptions
    from geobeam.io import GeoJSONSource
    from geobeam.fn import format_record, make_valid, filter_invalid

    pipeline_options = PipelineOptions([
        '--experiments', 'use_beam_bq_sink',
    ] + pipeline_args)

    with beam.Pipeline(options=pipeline_options) as p:
        (p
         | beam.io.Read(GeoJSONSource(known_args.gcs_url,
             layer_name=known_args.layer_name))
         | 'MakeValid' >> beam.Map(make_valid)
         | 'FilterInvalid' >> beam.Filter(filter_invalid)
         | 'FormatRecords' >> beam.Map(format_record)
         | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
             beam_bigquery.TableReference(
                 datasetId=known_args.dataset,
                 tableId=known_args.table),
             method=beam.io.WriteToBigQuery.Method.FILE_LOADS,
             write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
             create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))


if __name__ == '__main__':
    import logging
    import argparse

    logging.getLogger().setLevel(logging.INFO)

    parser = argparse.ArgumentParser()
    parser.add_argument('--gcs_url')
    parser.add_argument('--dataset')
    parser.add_argument('--table')
    parser.add_argument('--layer_name')
    parser.add_argument('--in_epsg', type=int, default=None)
    known_args, pipeline_args = parser.parse_known_args()

    run(pipeline_args, known_args)```

opened by migaelhartzenberg 3

Issue installing geobeam on GCP CloudShell

Seeing the below issue while installing geobeam on GCP Cloud Shell.

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-07s73wh5/orjson/

Version of Python used from venv is 3.7.3

(env) --@cloudshell:~ $ python
Python 3.7.3 (default, Jan 22 2021, 20:04:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Detailed error message

Collecting orjson<4.0; python_version >= "3.6" (from apache-beam[gcp]>=2.27.0->geobeam)
  Using cached https://files.pythonhosted.org/packages/75/cd/eac8908d0b4a4b08067bc68c04e52d7601b0ed86bf2a6a3264c46dd51a84/orjson-3.6.3.tar.gz
  Installing build dependencies ... done
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/usr/lib/python3.7/tokenize.py", line 447, in open
        buffer = _builtin_open(filename, 'rb')
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-07s73wh5/orjson/setup.py'

opened by hashkanna 1

Adding ESRIServerSource and GeoJSONSource

Hey @tjwebb wanted to send these over - still need to do some testing but wanted to run them by you first.

GeoJSONSource - this one should be fairly straightforward as it is a single file and Fiona can read the file natively

ESRIServerSource - I added a package that can handle the transformation of ESRI JSON to GeoJSON, as well as loop through a layer request since the ESRI REST API generally limits features that can be requested to 1000 or 2000. I can write some of this code natively or we can use the package, but not sure if we want to limit the dependencies. The package in question is here.

https://github.com/openaddresses/pyesridump

Also any tips for testing locally would be great!

opened by mbforr 1
Unable to load 5GB tif file to bigquery

It works fine for 1GB tif file. While trying to load 2GB ~ 5GB tif file it is failing with multiple errors during write to bigquery.

If you would like to reproduce the errors, then you could get these datasets from here - https://files.isric.org/soilgrids/former/2017-03-10/data/ BDRLOG_M_250m_ll.tif OCDENS_M_sl1_250m_ll.tif ORCDRC_M_sl1_250m_ll.tif

"Error processing instruction process_bundle-1256. Original traceback is Traceback (most recent call last): File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1374, in apache_beam.runners.common._OutputProcessor.process_outputs File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_file_loads.py", line 248, in process writer.write(row) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1396, in write return self._file_handle.write(self._coder.encode(row) + b'\n') File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filesystemio.py", line 205, in write self._uploader.put(b) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 663, in put self._conn.send_bytes(data.tobytes()) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes self._send(buf) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 289, in _execute response = task() File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 362, in lambda: self.create_worker().do_instruction(request), request) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 606, in do_instruction return getattr(self, request_type)( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 644, in process_bundle bundle_processor.process_bundle(instruction_id)) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 999, in process_bundle input_op_by_transform_id[element.transform_id].process_encoded( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 228, in process_encoded self.output(decoded_value) File "apache_beam/runners/worker/operations.py", line 357, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 359, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 829, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process File "apache_beam/runners/worker/operations.py", line 838, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process File "apache_beam/runners/common.py", line 1247, in apache_beam.runners.common.DoFnRunner.process_with_sized_restriction File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 886, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1321, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "/usr/local/lib/python3.8/site-packages/future/utils/init.py", line 446, in raise_with_traceback raise exc.with_traceback(traceback) File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1374, in apache_beam.runners.common._OutputProcessor.process_outputs File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_file_loads.py", line 248, in process writer.write(row) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1396, in write return self._file_handle.write(self._coder.encode(row) + b'\n') File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filesystemio.py", line 205, in write self._uploader.put(b) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 663, in put self._conn.send_bytes(data.tobytes()) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes self._send(buf) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) RuntimeError: BrokenPipeError: [Errno 32] Broken pipe [while running 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)-ptransform-3851']

opened by aswinramakrishnan 1

Not able to docker or gcloud submit

Hi Travis, Thank you for creating geobeam package for our requirement.

I am raising an issue here to just keep track.

While using docker build -

 ---> 56341244044b
Step 9/23 : RUN wget -q https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz     && tar -xzf curl-${CURL_VERSION}.tar.gz && cd curl-${CURL_VERSION}     && ./configure --prefix=/usr/local     && echo "building CURL ${CURL_VERSION}..."     && make --quiet -j$(nproc) && make --quiet install     && cd $WORKDIR && rm -rf curl-${CURL_VERSION}.tar.gz curl-${CURL_VERSION}
 ---> Running in 72bc532c27b8
The command '/bin/sh -c wget -q https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz     && tar -xzf curl-${CURL_VERSION}.tar.gz && cd curl-${CURL_VERSION}     && ./configure --prefix=/usr/local     && echo "building CURL ${CURL_VERSION}..."     && make --quiet -j$(nproc) && make --quiet install     && cd $WORKDIR && rm -rf curl-${CURL_VERSION}.tar.gz curl-${CURL_VERSION}' returned a non-zero code: 5
(global-env) arama3@C02S4AEYG8WP dataflow-geobeam % 
(global-env) arama3@C02S4AEYG8WP dataflow-geobeam % 
(global-env) arama3@C02S4AEYG8WP dataflow-geobeam % docker image ls                                                                                                        
REPOSITORY                                                      TAG        IMAGE ID       CREATED         SIZE
<none>                                                          <none>     56341244044b   5 minutes ago   2.55GB

while trying to do gcloud submit command -

libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferOp.lo -MD -MP -MF .deps/BufferOp.Tpo -c BufferOp.cpp  -fPIC -DPIC -o .libs/BufferOp.o
libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferOp.lo -MD -MP -MF .deps/BufferOp.Tpo -c BufferOp.cpp -o BufferOp.o >/dev/null 2>&1
libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferBuilder.lo -MD -MP -MF .deps/BufferBuilder.Tpo -c BufferBuilder.cpp -o BufferBuilder.o >/dev/null 2>&1
libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferParameters.lo -MD -MP -MF .deps/BufferParameters.Tpo -c BufferParameters.cpp  -fPIC -DPIC -o .libs/BufferParameters.o
libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferParameters.lo -MD -MP -MF .deps/BufferParameters.Tpo -c BufferParameters.cpp -o BufferParameters.o >/dev/null 2>&1
libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferSubgraph.lo -MD -MP -MF .deps/BufferSubgraph.Tpo -c BufferSubgraph.cpp  -fPIC -DPIC -o .libs/BufferSubgraph.o
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
Your build timed out. Use the [--timeout=DURATION] flag to change the timeout threshold.
ERROR: (gcloud.builds.submit) build ee748b6f-5347-4061-a81b-7f46959086c5 completed with status "TIMEOUT"

documentation customer-reported

opened by aswinramakrishnan 1

Docstring type of fn.format_record is type but takes in string
Hi! Cool project that I want to test out but I noticed an inconsistency with the docstring and the code. Not sure which should be followed.

https://github.com/GoogleCloudPlatform/dataflow-geobeam/blob/21479252be373b795a5c7d6626021b01a042e5de/geobeam/fn.py#L67-L91

The docstring should be

band_type (str, optional): Default to int. The data type of the raster band column to store in the database. ... p | beam.Map(geobeam.fn.format_record, band_column='elev', band_type='float'

or the code should be

def format_record(element, band_column=None, band_type=int): import json props, geom = element cast = band_type

Thanks!
opened by jtmiclat 1
Create BQ table from shapefile

Changes covering creation of the BQ table from the shp file schema + examples of loading "generic shapefile". By doing so we avoid the headache of table creation before, so any SHP could be loaded.

opened by Vadoid 0
[BEAM-12879] Issue affecting examples

[BEAM-12879] Issue - https://issues.apache.org/jira/browse/BEAM-12879?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

Affecting the examples when reading files from the gs://geobeam bucket. Workaround - download zip files and put them in the own bucket.

opened by Vadoid 0
Create BQ table from SHP schema

Changes covering creation of the BQ table from the shp file schema + examples of loading "generic shapefile". By doing so we avoid the headache of table creation before, so any SHP could be loaded.

opened by Vadoid 0
Geobeam Util get_bigquery_schema_dataflow only for GDB files

As far as I understand current implementation of get_bigquery_schema_dataflow only works for GDB files and doesn't work for SHP files. Should we make the autoschema work for shapefiles as well via Fiona?

opened by Vadoid 0
Updated the following: Dockerfile to get more recent gdal version

beam sdk 2.36.0 --> beam sdk 2.40.0 Addition of metview binary with the fix for debian curl 7.73.0 --> curl 7.83.1 (open ssl needed) geos 3.9.0 --> curl 3.10.3 sqlite 3330000 --> sqlite 3380500 proj 7.2.1 --> proj 9.0.0 (using cmake) openjpeg 2.3.1 --> openjpeg 2.5.0 addition of hdf5 1.10.5 addition of netcdf-c 4.9.0 gdal 3.2.1 --> gdal 3.5.1 (using cmake) making sure numpy always gets installed with gdal addition gcloud components alpha

added longer timeout to cloudbuild.yaml because the intial build takes 1h20min at least

opened by bahmandar 0
get_bigquery_schema_dataflow() issue and questions
Hi, I am trying to use geobeam to ingest a shapefile into BigQuery, and creating the table with a schema from the shapefile if the table does not exist. I came across few issues and questions.

I attempt this using a modified example shapefile_nfhl.py. And ran with this command.

python -m shapefile_nfhl --runner DataflowRunner --project my-project --temp_location gs://mybucket-geobeam/data --region australia-southeast1 --worker_harness_container_image gcr.io/dataflow-geobeam/example --experiment use_runner_v2 --service_account_email [email protected] --gcs_url gs://geobeam/examples/510104_20170217.zip --dataset examples --table output_table

Using get_bigquery_schema_dataflow() from geobeam.util is throwing error due to undefined variable.

NameError: name 'gcs_url' is not defined

I have opened a PR to fix this. #38

Once the function is fixed, it seems that it does not accept a shapefile. Passing in the GCS URL to the zipped shapefile is throwing this error.

Traceback (most recent call last): File "fiona/_shim.pyx", line 83, in fiona._shim.gdal_open_vector File "fiona/_err.pyx", line 291, in fiona._err.exc_wrap_pointer fiona._err.CPLE_OpenFailedError: '/vsigs/geobeam/examples/510104_20170217.zip' not recognized as a supported file format.

Am I using the function in a wrong way or (zipped) shapefile is not support for this? For reference, this is the modified template. Thank you!
opened by muazamkamal 0
centroid_only = false error for a particular GeoTIFF dataset

When ingesting this cropland dataset https://developers.google.com/earth-engine/datasets/catalog/USDA_NASS_CDL?hl=en#citations: if I set the centroid_only parameter to false, I get the following error: Failed 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 319; errors: 1. Please look into the errors[] collection for more details.' reason: 'invalid'> [while running 'WriteToBigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs-ptransform-31'

Full steps to reproduce are in the 'Ingesting EE data to BQ' blog.

opened by remylouisew 0

Releases(v0.1.0)

v0.1.0(Feb 4, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Google Cloud Platform

GitHub

OSMnx: Python for street networks. Retrieve, model, analyze, and visualize street networks and other spatial data from OpenStreetMap.

OSMnx OSMnx is a Python package that lets you download geospatial data from OpenStreetMap and model, project, visualize, and analyze real-world street

4k Jan 8, 2023

Read and write rasters in parallel using Rasterio and Dask

dask-rasterio dask-rasterio provides some methods for reading and writing rasters in parallel using Rasterio and Dask arrays. Usage Read a multiband r

85 Aug 30, 2022

Location field and widget for Django. It supports Google Maps, OpenStreetMap and Mapbox

django-location-field Let users pick locations using a map widget and store its latitude and longitude. Stable version: django-location-field==2.1.0 D

481 Dec 29, 2022

Deal with Bing Maps Tiles and Pixels / WGS 84 coordinates conversions, and generate grid Shapefiles

PyBingTiles This is a small toolkit in order to deal with Bing Tiles, used i.e. by Facebook for their Data for Good datasets. Install Clone this repos

1 Dec 8, 2021

A short term landscape evolution using a path sampling method to solve water and sediment flow continuity equations and model mass flows over complex topographies.

r.sim.terrain A short-term landscape evolution model that simulates topographic change for both steady state and dynamic flow regimes across a range o

7 Oct 21, 2022

Using Global fishing watch's data to build a machine learning model that can identify illegal fishing and poaching activities through satellite and geo-location data.

3 May 6, 2022

This repository contains the scripts to derivate the ENU and ECEF coordinates from the longitude, latitude, and altitude values encoded in the NAD83 coordinates.

1 Feb 7, 2022

A Django application that provides country choices for use with forms, flag icons static files, and a country field for models.

Django Countries A Django application that provides country choices for use with forms, flag icons static files, and a country field for models. Insta

1.2k Jan 3, 2023

geobeam - adds GIS capabilities to your Apache Beam and Dataflow pipelines.

Related tags

Overview

What does geobeam do?

Requirements

Supported input types

Included libraries

How to Use

1. Install the module

2. Write your pipeline

3. Run

Run locally

Run in Dataflow

Write a Dockerfile

Start the Dataflow job

Note on Python versions

Examples

Polygonize Raster

Validate and Simplify Shapefile

Shapefile Example

Raster Example

Included Transforms

Execution parameters

License

Comments

Releases(v0.1.0)

v0.1.0(Feb 4, 2021)

Owner

Google Cloud Platform

OSMnx: Python for street networks. Retrieve, model, analyze, and visualize street networks and other spatial data from OpenStreetMap.

Read and write rasters in parallel using Rasterio and Dask

Location field and widget for Django. It supports Google Maps, OpenStreetMap and Mapbox

Deal with Bing Maps Tiles and Pixels / WGS 84 coordinates conversions, and generate grid Shapefiles

A short term landscape evolution using a path sampling method to solve water and sediment flow continuity equations and model mass flows over complex topographies.

Using Global fishing watch's data to build a machine learning model that can identify illegal fishing and poaching activities through satellite and geo-location data.

This repository contains the scripts to derivate the ENU and ECEF coordinates from the longitude, latitude, and altitude values encoded in the NAD83 coordinates.

A Django application that provides country choices for use with forms, flag icons static files, and a country field for models.

Python bindings and utilities for GeoJSON

Manipulation and analysis of geometric objects

Python interface to PROJ (cartographic projections and coordinate transformations library)

Rasterio reads and writes geospatial raster datasets

Fiona reads and writes geographic data files

Python bindings and utilities for GeoJSON

Documentation and samples for ArcGIS API for Python

Search and download Copernicus Sentinel satellite images

python toolbox for visualizing geographical data and making maps

geemap - A Python package for interactive mapping with Google Earth Engine, ipyleaflet, and ipywidgets.

Python interface to PROJ (cartographic projections and coordinate transformations library)