Raster processing benchmarks for Python and R packages

Krzysztof Dyba

Last update: Oct 24, 2022

Related tags

Geolocation python benchmark performance r gis raster

Overview

Raster processing benchmarks

This repository contains a collection of raster processing benchmarks for Python and R packages. The tests cover the most common operations such as loading data, extracting values by points, downsampling, calculating NDVI, writing multilayer, cropping by extent and calculating zonal statistics. The comparison is made from the user's perspective (the simplest functions are used and the code is not optimized), so the results do not represent the best performance.

The detailed results are available at https://kadyb.github.io/raster-benchmark/report.html.

Software

Python:

Reproduction

Download raster data (851 MB) from Google Drive or Earth Explorer (original source, registration required) and then unzip to data/.
Run all benchmarks using batch script (run_benchmarks.sh) or single benchmarks files.

Batch script

cd raster-benchmark
./run_benchmarks.sh

Single benchmark

Rscript stars/crop.R

python3 rasterio/crop.py

Dataset

Landsat 8 satellite scene (10 bands, 30 m resolution, 7771 x 7871 pixels) was used for tests.

Scene ID: LC08_L1TP_190024_20200418_20200822_02_T1

Hardware configuration

CPU: Intel Xeon CPU E5-2620 v2 @ 2.10GHz
RAM: 64 GB
OS: Pop!_OS 20.04 LTS

Acknowledgment

Landsat-8 image courtesy of the U.S. Geological Survey, https://earthexplorer.usgs.gov/

Comments

Use vectorized selection for rioxarray

Hi,

I came across this benchmark while comparing rasterstats and rioxarray. Looking into the rioxarray/extract_points.py, I found that it is using iteration instead of much faster vectorized selection. This results in 1000x speedup.

I have opened a PR that shows this improvement. Would be good to test and merge. https://github.com/kadyb/raster-benchmark/pull/13

opened by spatialthoughts 5
Use vectorized selection with xarray

Use vectorized selection instead of iteration. This gives more than 1000x speedup.

Tested on my machine.

Existing code task;package;time extract-points;rioxarray;44,26 extract-points;rioxarray;44,79 extract-points;rioxarray;46,4 extract-points;rioxarray;45,64 extract-points;rioxarray;46,77 extract-points;rioxarray;46,94 extract-points;rioxarray;47,95 extract-points;rioxarray;45,25 extract-points;rioxarray;45,09 extract-points;rioxarray;43,81

Updated code task;package;time extract-points-update;rioxarray;0,1 extract-points-update;rioxarray;0,08 extract-points-update;rioxarray;0,08 extract-points-update;rioxarray;0,08 extract-points-update;rioxarray;0,08 extract-points-update;rioxarray;0,07 extract-points-update;rioxarray;0,07 extract-points-update;rioxarray;0,06 extract-points-update;rioxarray;0,06 extract-points-update;rioxarray;0,06

opened by spatialthoughts 1

crop timings using GDAL

### files
input = "stack.tif"
output = "crop.tif"

### create polygon
library(sf)
bbox = st_bbox(c(xmin = 598500, xmax = 727500, ymax = 5781000, ymin = 5682100),
               crs = st_crs(32633))
bbox = st_as_sfc(bbox)
write_sf(bbox, "bbox.gpkg")

### GDAL translate
coords = "598500 5781000 727500 5682100" # xmin ymax xmax ymin
main = "gdal_translate -projwin"
cmd = paste(main, coords, input, output)

if (file.exists(output)) file.remove(output)
system.time(system(cmd))
#>  user  system elapsed
#> 4.501   0.876   5.403

### GDAL warp
coords = "598500 5682100 727500 5781000" # xmin ymin xmax ymax
main = "gdalwarp -te"
cmd = paste(main, coords, input, output)

if (file.exists(output)) file.remove(output)
system.time(system(cmd))
#>  user  system elapsed
#> 5.717   0.951   6.680

### GDAL warp (polygon)
main = "gdalwarp -cutline"
polygon = "bbox.gpkg"
cmd = paste(main, polygon, input, output)

if (file.exists(output)) file.remove(output)
system.time(system(cmd))
#>   user  system elapsed
#> 14.502   2.437  17.008

### GDAL warp (polygon extent)
cmd = paste(main, polygon, "-crop_to_cutline", input, output)

if (file.exists(output)) file.remove(output)
system.time(system(cmd))
#>  user  system elapsed
#> 6.183   1.211   7.451

opened by kadyb 1

Test data.table to calculate NDVI

It looks like data.table may be the fastest tool for arithmetic operations on rasters in R.

library(data.table)

rasters = list.files("data/LC08_L1TP_190024_20200418_20200822_02_T1/",
                     pattern = "\\.TIF$", full.names = TRUE)

#########################
### stars

system.time({ras = stars::read_stars(rasters, along = 3, proxy = FALSE)})
#>   user  system elapsed
#> 41.970  13.556  55.622

names(ras) = "landsat"
ras_names = c("B1", "B10", "B11", "B2", "B3", "B4", "B5", "B6", "B7", "B9")
ras = stars::st_set_dimensions(ras, 3, values = ras_names, names = "band")

mat = ras$landsat
dim(mat) = c(dim(mat)[1] * dim(mat)[2], dim(mat)[3])
colnames(mat) = ras_names

system.time({dt = as.data.table(mat)})
#>  user  system elapsed
#> 9.201   5.484  14.711

#########################
### terra

ras = terra::rast(rasters)
ras_names = c("B1", "B10", "B11", "B2", "B3", "B4", "B5", "B6", "B7", "B9")
names(ras) = ras_names

system.time({dt = as.data.table(terra::values(ras))})
#>   user  system elapsed
#> 31.897  18.470  50.534

#########################
### NDVI data.table

t_vec = numeric(20)
for (i in seq_len(20)) {

  t = system.time((dt$B5 - dt$B4) / (dt$B5 + dt$B4))
  t_vec[i] = unname(t["elapsed"])

}

mean(t_vec)
#> 1.2864
sd(t_vec)
#> 0.132985

opened by kadyb 1

Arithmetic operations are faster for vectors in R

One interesting observation is that arithmetic operations (e.g. computing NDVI) are faster for vector/matrices than for raster objects (stars or SpatRaster) in R. At this moment, this is especially true for the terra package, where it is probably better to convert the raster to a matrix, perform calculations, and then assign the results back to the raster object.

library(stars)
library(terra)

ndvi = function(red, nir) {(nir - red) / (nir + red)}

rasters = list.files("data/LC08_L1TP_190024_20200418_20200822_02_T1/",
                     pattern = "\\.TIF$", full.names = TRUE)


### calculate NDVI using 'stars' objects
ras = read_stars(rasters, along = 3, proxy = FALSE)

red = adrop(ras[,,,6])
nir = adrop(ras[,,,7])

system.time(ndvi(red, nir))
#>  user  system elapsed
#> 0.756   0.711   1.467


### calculate NDVI using 'SpatRaster' objects
ras = rast(rasters)
ras = ras * 1

red = ras[[6]]
nir = ras[[7]]

system.time(ndvi(red, nir))
#>  user  system elapsed
#> 4.141   3.320   7.462


### calculate NDVI using vectors
vred = values(red, mat = FALSE)
vnir = values(nir, mat = FALSE)

system.time(ndvi(vred, vnir))
#>  user  system elapsed
#> 0.558   0.209   0.767


### calculate NDVI using vectors
### and assign values to new 'SpatRaster'
new = rast(red)
system.time(new <- setValues(new, ndvi(vred, vnir)))
#>  user  system elapsed
#> 0.989   0.768   1.758

opened by kadyb 0

Feedback
[x] Add a new task with reading all data into the memory

[x] Load all data into memory in terra and raster so that the results will be fair to the other packages

[x] Add link to this repository in main plot

NDVI:

[x] Remove data.table package

[x] Preselect RED and NIR bands in R packages

[x] Create appendix with timings where vector/matrix are used (not raster objects) in R: #9

[x] Load only two bands to reduce testing time overall
opened by kadyb 0
raster and terra are slower on better PC
Benchmarks were run on two different hardware configurations. Almost all packages noted better times on the newer PC (test 2) except selected functions from raster and terra.

# test 1 hardware CPU: Intel Core i5 3210M 2.50 GHz RAM: 8 GB OS: Windows 8.1 # test 2 hardware CPU: Intel Xeon CPU E5-2620 v2 @ 2.10GHz RAM: 64 GB OS: Pop!_OS 20.04 LTS

Timings are given as medians of 10 iterations. task | package | test 1 | test 2 -- | -- | -- | -- extract-points | raster | 22.3 | 24.5 crop | terra | 6.3 | 8.3 extract-points | terra | 6.7 | 7.2 write | terra | 38.9 | 42.0 ndvi | raster-fun | 8.1 | 9.0 ndvi | raster-overlay | 5.7 | 6.5 ndvi | terra-fun | 8.8 | 11.6 ndvi | terra-lapp | 7.3 | 8.9

(I rerun below tests so the results are not accidental)

For reference:

timings.csv - test 1 (15.02)

timings.csv - test 2 (16.02)
opened by kadyb 0
Compression in `rasterio` doesn't work

Currently file compression in rasterio doesn't work. The file should be about 1 GB, but is actually 2 GB.

https://github.com/kadyb/raster-benchmark/blob/main/rasterio/write.py#L28
bug

opened by kadyb 0
Ideas
Results validation.

New test scenario where the dataset doesn't fit in memory - block processing.

Several datasets (different sizes and data types, e.g. float16).

Docker.

Log memory demand.

Reorganize files in the repository.
opened by kadyb 1
Package download statistics
R:

https://r-hub.github.io/cranlogs/

Python:

https://github.com/ContinuumIO/anaconda-package-data

https://github.com/grimbough/anaconda-download-stats

https://packaging.python.org/guides/analyzing-pypi-package-downloads/#package-downloads-over-time

https://github.com/ofek/pypinfo
opened by kadyb 0

Owner

Krzysztof Dyba

Spatial Data Science | Remote Sensing | R

GitHub https://kadyb.github.io/raster-benchmark/report.html

Advanced raster and geometry manipulations

buzzard In a nutshell, the buzzard library provides powerful abstractions to manipulate together images and geometries that come from different kind o

30 Jun 20, 2022

Summary statistics of geospatial raster datasets based on vector geometries.

rasterstats rasterstats is a Python module for summarizing geospatial raster datasets based on vector geometries. It includes functions for zonal stat

437 Dec 23, 2022

User friendly Rasterio plugin to read raster datasets.

rio-tiler User friendly Rasterio plugin to read raster datasets. Documentation: https://cogeotiff.github.io/rio-tiler/ Source Code: https://github.com

372 Dec 23, 2022

🌐 Local tile server for viewing geospatial raster files with ipyleaflet

?? Local Tile Server for Geospatial Rasters Need to visualize a rather large raster (gigabytes) you have locally? This is for you. A Flask application

192 Jan 4, 2023

🌐 Local tile server for viewing geospatial raster files with ipyleaflet or folium

?? Local Tile Server for Geospatial Rasters Need to visualize a rather large (gigabytes) raster you have locally? This is for you. A Flask application

192 Jan 4, 2023

Python package for earth-observing satellite data processing

Satpy The Satpy package is a python library for reading and manipulating meteorological remote sensing data and writing it to various image and data f

882 Dec 27, 2022

Geospatial Image Processing for Python

GIPPY Gippy is a Python library for image processing of geospatial raster data. The core of the library is implemented as a C++ library, libgip, with

83 Aug 19, 2022

A toolbox for processing earth observation data with Python.

eo-box eobox is a Python package with a small collection of tools for working with Remote Sensing / Earth Observation data. Package Overview So far, t

13 Jan 6, 2022

Processing and interpolating spatial data with a twist of machine learning

Documentation | Documentation (dev version) | Contact | Part of the Fatiando a Terra project About Verde is a Python library for processing spatial da

468 Dec 20, 2022

framework for large-scale SAR satellite data processing

pyroSAR A Python Framework for Large-Scale SAR Satellite Data Processing The pyroSAR package aims at providing a complete solution for the scalable or

389 Dec 21, 2022

This is a simple python code to get IP address and its location using python

IP address & Location finder @DEV/ED : Pavan Ananth Sharma Dependencies: ip2geotools Note: use pip install ip2geotools to install this in your termin

2 Jul 5, 2022

OSMnx: Python for street networks. Retrieve, model, analyze, and visualize street networks and other spatial data from OpenStreetMap.

OSMnx OSMnx is a Python package that lets you download geospatial data from OpenStreetMap and model, project, visualize, and analyze real-world street

4k Jan 8, 2023

Raster processing benchmarks for Python and R packages

Related tags

Overview

Raster processing benchmarks

Software

Reproduction

Dataset

Hardware configuration

Acknowledgment

Comments

Owner

Krzysztof Dyba

Advanced raster and geometry manipulations

Summary statistics of geospatial raster datasets based on vector geometries.

User friendly Rasterio plugin to read raster datasets.

🌐 Local tile server for viewing geospatial raster files with ipyleaflet

🌐 Local tile server for viewing geospatial raster files with ipyleaflet or folium

Python package for earth-observing satellite data processing

Geospatial Image Processing for Python

A toolbox for processing earth observation data with Python.

Processing and interpolating spatial data with a twist of machine learning

framework for large-scale SAR satellite data processing

This is a simple python code to get IP address and its location using python

OSMnx: Python for street networks. Retrieve, model, analyze, and visualize street networks and other spatial data from OpenStreetMap.

Python bindings and utilities for GeoJSON

Python interface to PROJ (cartographic projections and coordinate transformations library)

Python bindings and utilities for GeoJSON

Documentation and samples for ArcGIS API for Python

python toolbox for visualizing geographical data and making maps

geemap - A Python package for interactive mapping with Google Earth Engine, ipyleaflet, and ipywidgets.

Python interface to PROJ (cartographic projections and coordinate transformations library)