PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Kevin Schaich

Last update: Dec 24, 2022

Related tags

Data Analysis python uber h3 geocoding geospatial gis hexagonal-architecture

Overview

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

PySpark bindings for the H3 core library.

For available functions, please see the vanilla Python binding documentation at:

uber.github.io/h3-py

Installation

From PyPI:

pip install h3-pyspark

From conda

conda config --add channels conda-forge
conda install h3-pyspark

Usage

>> >>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution')) >>> df.show() +---------+-----------+----------+---------------+ | lat| lng|resolution| h3_9| +---------+-----------+----------+---------------+ |37.769377|-122.388903| 9|89283082e73ffff| +---------+-----------+----------+---------------+ ">

>>> from pyspark.sql import SparkSession, functions as F
>>> import h3_pyspark
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}])
>>>
>>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
>>> df.show()

+---------+-----------+----------+---------------+
|      lat|        lng|resolution|           h3_9|
+---------+-----------+----------+---------------+
|37.769377|-122.388903|         9|89283082e73ffff|
+---------+-----------+----------+---------------+

Publishing

Bump version in setup.cfg
Publish:

python3 -m build
python3 -m twine upload --repository pypi dist/*

Comments

'TypeError: must be real number, not NoneType' when using h3_pyspark

Hi, I have the following spark dataframe and the column of h3 indices is created by applying the lat, lng pairs and the resolution to h3_pypark.geo_to_h3(lat, lng, resolution) function. However I encountered the following error when I tried to check if there's any null in the index column. And it's not only isNull() not working but also any other subsetting operations which all throw me the same error, could anyone provide some insights on what might be the issue and how to fix it? Thanks in advance!

dataframe:

errors:

opened by Tingmi 5
Fix indexing for polygons and lines

Catches some edge cases where h3_line and polyfill would miss. Could be overbroad, which is why the docstrings are changed to say superset, but at least it should be complete

opened by rwaldman 1
Better error handling when null values are passed in
Currently the behavior for all UDFs is that if any row in your dataframe has a null value, the entire build will fail.

This type behavior would be better/more resilient:

@F.udf(T.ArrayType(T.StringType())) def index_shape(geometry, resolution): if geometry is None: return None return _index_shape(geometry, resolution)
opened by kevinschaich 1
Fix bug in index_shape function which missed hexes for long line segments

Fixes #8

Previous behavior for problematic line:

New behavior for same line:

Previous behavior for problematic polygon:

New behavior for same polygon:

cc: @deankieserman @rwaldman

opened by kevinschaich 0
Bug in index_shape function which misses several hexes

Reported by @rwaldman – we can miss several hexes in the worst case if a line's start and endpoints are east-to-west and towards the north or south edge:

Proposed solution is for long line segments (≥ s where s = hex side length) to interpolate several points along the line based on the selected resolution, so that we catch the ones in between:

opened by kevinschaich 0

polyfill fails with valid multipolygon geojson

h3_pyspark.polyfill fails when a valid multipolygon geojson is provided this is expected behavior when utilizing the h3 native library.

however, i thought it would be helpful if this library is able to accept multipolygons. could I get permission to push a PR?

implementation in src/h3_pyspark/__init__.py

@F.udf(returnType=T.ArrayType(T.StringType()))
@handle_nulls
def polyfill(polygons, res, geo_json_conformant):
    # NOTE: this behavior differs from default
    # h3-pyspark expect `polygons` argument to be a valid GeoJSON string
    polygons = json.loads(polygons)
    type_ = polygons["type"].lower()
    if type_ == "multipolygon":
        output = []
        for i in polygons["coordinates"]:
            _polygon = {"type": "Polygon", "coordinates": i}
            output.extend(list(h3.polyfill(_polygon, res, geo_json_conformant)))
        return sanitize_types(output)
    return sanitize_types(h3.polyfill(polygons, res, geo_json_conformant))

test in tests/test_core.py

multipolygon = '{"type": "MultiPolygon","coordinates": [[[[108.98309290409088,13.240363245242063],[108.98343622684479,13.240363245242063],[108.98343622684479,13.240634779729014],[108.98309290409088,13.240634779729014],[108.98309290409088,13.240363245242063]]],[[[108.98349523544312,13.240002939397714],[108.98389220237732,13.240002939397714],[108.98389220237732,13.240269252464502],[108.98349523544312,13.240269252464502],[108.98349523544312,13.240002939397714]]]]}'

def test_polyfill_multipolygon(self):
        h3_test_args, h3_pyspark_test_args = get_test_args(h3.polyfill)
        print(h3_pyspark_test_args)
        integer = 12
        data = {
            "res": integer,
            "geo_json_conformant": True,
            "geojson": multipolygon,
        }
        df = spark.createDataFrame([data])
        actual = df.withColumn("actual", h3_pyspark.polyfill(*h3_pyspark_test_args))
        actual = actual.collect()[0]["actual"]
        print(actual)
        expected = []
        for i in json.loads(multipolygon)["coordinates"]:
            _polygon = {"type": "Polygon", "coordinates": i}
            expected.extend(list(h3.polyfill(_polygon, integer, True)))
        expected = sanitize_types(expected)
        assert sort(actual) == sort(expected)

opened by kangeugine 0

Releases(1.2.6)

1.2.6(Mar 10, 2022)
Add edge cases for lines (#11)

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.5...1.2.6
Source code(tar.gz)
Source code(zip)
1.2.4(Mar 4, 2022)
What's Changed

Handle null values in inputs to UDFs by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/10

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.3...1.2.4
Source code(tar.gz)
Source code(zip)
1.2.3(Feb 24, 2022)
What's Changed

Add error handling for bad geometries by @deankieserman in https://github.com/kevinschaich/h3-pyspark/pull/3

Fix bug in index_shape function which missed hexes for long line segments by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/9

New Contributors

@deankieserman made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/3

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.2...1.2.3
Source code(tar.gz)
Source code(zip)
1.2.2(Jan 5, 2022)

Source code(tar.gz)
Source code(zip)
1.1.0(Dec 8, 2021)
What's Changed

Create LICENSE by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/1

Add extension functions (index_shape, k_ring_distinct) for spatial indexing & buffers by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/2

New Contributors

@kevinschaich made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/1

Full Changelog: https://github.com/kevinschaich/h3-pyspark/commits/1.1.0
Source code(tar.gz)
Source code(zip)

Owner

Kevin Schaich

Solving awesome problems @palantir. Part-time open source junkie. Purveyor of hot coffee and thoughtful photographs.

GitHub https://uber.github.io/h3-py/intro.html

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges: Optimus is the missing framework to prof

1.3k Dec 30, 2022

Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

16 Jun 9, 2022

Churn prediction with PySpark

It is expected to develop a machine learning model that can predict customers who will leave the company.

3 Aug 13, 2021

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

2 Dec 12, 2021

Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. ?? Motiv

31 Dec 16, 2022

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

3 Apr 27, 2022

Pyspark project that able to do joins on the spark data frames.

SPARK JOINS This project is to perform inner, all outer joins and semi joins. create_df.py: load_data.py : helps to put data into Spark data frames. d

1 Dec 14, 2021

PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra The purpose of this project is to demonstrate a structured streaming pipeline with Apache

5 Nov 13, 2022

A data structure that extends pyspark.sql.DataFrame with metadata information.

MetaFrame A data structure that extends pyspark.sql.DataFrame with metadata info

8 Feb 15, 2022

A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

9.9k Dec 31, 2022

Evidence enables analysts to deliver a polished business intelligence system using SQL and markdown.

Evidence enables analysts to deliver a polished business intelligence system using SQL and markdown

915 Dec 26, 2022

PyPSA: Python for Power System Analysis

1 Python for Power System Analysis Contents 1 Python for Power System Analysis 1.1 About 1.2 Documentation 1.3 Functionality 1.4 Example scripts as Ju

758 Dec 30, 2022

A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

1 Nov 8, 2021

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database, using a set of "harvesters", whose job it

20 Sep 28, 2022

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

1 Jul 13, 2021

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Related tags

Overview

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

Installation

Usage

Publishing

Comments

'TypeError: must be real number, not NoneType' when using h3_pyspark

Fix indexing for polygons and lines

Better error handling when null values are passed in

Fix bug in index_shape function which missed hexes for long line segments

Bug in index_shape function which misses several hexes

polyfill fails with valid multipolygon geojson

Releases(1.2.6)

1.2.6(Mar 10, 2022)

1.2.4(Mar 4, 2022)

What's Changed

1.2.3(Feb 24, 2022)

What's Changed

New Contributors

1.2.2(Jan 5, 2022)

1.1.0(Dec 8, 2021)

What's Changed

New Contributors

Owner

Kevin Schaich

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Pyspark Spotify ETL

Churn prediction with PySpark

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Instant search for and access to many datasets in Pyspark.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Calculate multilateral price indices in Python (with Pandas and PySpark).

Pyspark project that able to do joins on the spark data frames.

PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

A data structure that extends pyspark.sql.DataFrame with metadata information.

A computer algebra system written in pure Python

Evidence enables analysts to deliver a polished business intelligence system using SQL and markdown.

PyPSA: Python for Power System Analysis

A forecasting system dedicated to smart city data

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

h3-js provides a JavaScript version of H3, a hexagon-based geospatial indexing system.

PySpark Cheat Sheet - learn PySpark and develop apps faster

Pyspark sam - Analyze Big Sequence Alignments with PySpark in AWS EMR

TODO aplication made with Python's FastAPI framework and Hexagonal Architecture