Raster processing benchmarks for Python and R packages

Overview

Raster processing benchmarks

This repository contains a collection of raster processing benchmarks for Python and R packages. The tests cover the most common operations such as loading data, extracting values by points, downsampling, calculating NDVI, writing multilayer, cropping by extent and calculating zonal statistics. The comparison is made from the user's perspective (the simplest functions are used and the code is not optimized), so the results do not represent the best performance.

The detailed results are available at https://kadyb.github.io/raster-benchmark/report.html.

Software

Python:

R:

Reproduction

  1. Download raster data (851 MB) from Google Drive or Earth Explorer (original source, registration required) and then unzip to data/.
  2. Run all benchmarks using batch script (run_benchmarks.sh) or single benchmarks files.

Batch script

cd raster-benchmark
./run_benchmarks.sh

Single benchmark

Rscript stars/crop.R
python3 rasterio/crop.py

Dataset

Landsat 8 satellite scene (10 bands, 30 m resolution, 7771 x 7871 pixels) was used for tests.

Scene ID: LC08_L1TP_190024_20200418_20200822_02_T1

Hardware configuration

  • CPU: Intel Xeon CPU E5-2620 v2 @ 2.10GHz
  • RAM: 64 GB
  • OS: Pop!_OS 20.04 LTS

Acknowledgment

Landsat-8 image courtesy of the U.S. Geological Survey, https://earthexplorer.usgs.gov/

Comments
  • Use vectorized selection for rioxarray

    Use vectorized selection for rioxarray

    Hi,

    I came across this benchmark while comparing rasterstats and rioxarray. Looking into the rioxarray/extract_points.py, I found that it is using iteration instead of much faster vectorized selection. This results in 1000x speedup.

    I have opened a PR that shows this improvement. Would be good to test and merge. https://github.com/kadyb/raster-benchmark/pull/13

    opened by spatialthoughts 5
  • Use vectorized selection with xarray

    Use vectorized selection with xarray

    Use vectorized selection instead of iteration. This gives more than 1000x speedup.

    Tested on my machine.

    Existing code task;package;time extract-points;rioxarray;44,26 extract-points;rioxarray;44,79 extract-points;rioxarray;46,4 extract-points;rioxarray;45,64 extract-points;rioxarray;46,77 extract-points;rioxarray;46,94 extract-points;rioxarray;47,95 extract-points;rioxarray;45,25 extract-points;rioxarray;45,09 extract-points;rioxarray;43,81

    Updated code task;package;time extract-points-update;rioxarray;0,1 extract-points-update;rioxarray;0,08 extract-points-update;rioxarray;0,08 extract-points-update;rioxarray;0,08 extract-points-update;rioxarray;0,08 extract-points-update;rioxarray;0,07 extract-points-update;rioxarray;0,07 extract-points-update;rioxarray;0,06 extract-points-update;rioxarray;0,06 extract-points-update;rioxarray;0,06

    opened by spatialthoughts 1
  • crop timings using GDAL

    crop timings using GDAL

    ### files
    input = "stack.tif"
    output = "crop.tif"
    
    ### create polygon
    library(sf)
    bbox = st_bbox(c(xmin = 598500, xmax = 727500, ymax = 5781000, ymin = 5682100),
                   crs = st_crs(32633))
    bbox = st_as_sfc(bbox)
    write_sf(bbox, "bbox.gpkg")
    
    ### GDAL translate
    coords = "598500 5781000 727500 5682100" # xmin ymax xmax ymin
    main = "gdal_translate -projwin"
    cmd = paste(main, coords, input, output)
    
    if (file.exists(output)) file.remove(output)
    system.time(system(cmd))
    #>  user  system elapsed
    #> 4.501   0.876   5.403
    
    ### GDAL warp
    coords = "598500 5682100 727500 5781000" # xmin ymin xmax ymax
    main = "gdalwarp -te"
    cmd = paste(main, coords, input, output)
    
    if (file.exists(output)) file.remove(output)
    system.time(system(cmd))
    #>  user  system elapsed
    #> 5.717   0.951   6.680
    
    ### GDAL warp (polygon)
    main = "gdalwarp -cutline"
    polygon = "bbox.gpkg"
    cmd = paste(main, polygon, input, output)
    
    if (file.exists(output)) file.remove(output)
    system.time(system(cmd))
    #>   user  system elapsed
    #> 14.502   2.437  17.008
    
    ### GDAL warp (polygon extent)
    cmd = paste(main, polygon, "-crop_to_cutline", input, output)
    
    if (file.exists(output)) file.remove(output)
    system.time(system(cmd))
    #>  user  system elapsed
    #> 6.183   1.211   7.451
    
    opened by kadyb 1
  • Test data.table to calculate NDVI

    Test data.table to calculate NDVI

    It looks like data.table may be the fastest tool for arithmetic operations on rasters in R.

    library(data.table)
    
    rasters = list.files("data/LC08_L1TP_190024_20200418_20200822_02_T1/",
                         pattern = "\\.TIF$", full.names = TRUE)
    
    #########################
    ### stars
    
    system.time({ras = stars::read_stars(rasters, along = 3, proxy = FALSE)})
    #>   user  system elapsed
    #> 41.970  13.556  55.622
    
    names(ras) = "landsat"
    ras_names = c("B1", "B10", "B11", "B2", "B3", "B4", "B5", "B6", "B7", "B9")
    ras = stars::st_set_dimensions(ras, 3, values = ras_names, names = "band")
    
    mat = ras$landsat
    dim(mat) = c(dim(mat)[1] * dim(mat)[2], dim(mat)[3])
    colnames(mat) = ras_names
    
    system.time({dt = as.data.table(mat)})
    #>  user  system elapsed
    #> 9.201   5.484  14.711
    
    #########################
    ### terra
    
    ras = terra::rast(rasters)
    ras_names = c("B1", "B10", "B11", "B2", "B3", "B4", "B5", "B6", "B7", "B9")
    names(ras) = ras_names
    
    system.time({dt = as.data.table(terra::values(ras))})
    #>   user  system elapsed
    #> 31.897  18.470  50.534
    
    #########################
    ### NDVI data.table
    
    t_vec = numeric(20)
    for (i in seq_len(20)) {
    
      t = system.time((dt$B5 - dt$B4) / (dt$B5 + dt$B4))
      t_vec[i] = unname(t["elapsed"])
    
    }
    
    mean(t_vec)
    #> 1.2864
    sd(t_vec)
    #> 0.132985
    
    opened by kadyb 1
  • Arithmetic operations are faster for vectors in R

    Arithmetic operations are faster for vectors in R

    One interesting observation is that arithmetic operations (e.g. computing NDVI) are faster for vector/matrices than for raster objects (stars or SpatRaster) in R. At this moment, this is especially true for the terra package, where it is probably better to convert the raster to a matrix, perform calculations, and then assign the results back to the raster object.

    library(stars)
    library(terra)
    
    ndvi = function(red, nir) {(nir - red) / (nir + red)}
    
    rasters = list.files("data/LC08_L1TP_190024_20200418_20200822_02_T1/",
                         pattern = "\\.TIF$", full.names = TRUE)
    
    
    ### calculate NDVI using 'stars' objects
    ras = read_stars(rasters, along = 3, proxy = FALSE)
    
    red = adrop(ras[,,,6])
    nir = adrop(ras[,,,7])
    
    system.time(ndvi(red, nir))
    #>  user  system elapsed
    #> 0.756   0.711   1.467
    
    
    ### calculate NDVI using 'SpatRaster' objects
    ras = rast(rasters)
    ras = ras * 1
    
    red = ras[[6]]
    nir = ras[[7]]
    
    system.time(ndvi(red, nir))
    #>  user  system elapsed
    #> 4.141   3.320   7.462
    
    
    ### calculate NDVI using vectors
    vred = values(red, mat = FALSE)
    vnir = values(nir, mat = FALSE)
    
    system.time(ndvi(vred, vnir))
    #>  user  system elapsed
    #> 0.558   0.209   0.767
    
    
    ### calculate NDVI using vectors
    ### and assign values to new 'SpatRaster'
    new = rast(red)
    system.time(new <- setValues(new, ndvi(vred, vnir)))
    #>  user  system elapsed
    #> 0.989   0.768   1.758
    
    opened by kadyb 0
  • Feedback

    Feedback

    • [x] Add a new task with reading all data into the memory
    • [x] Load all data into memory in terra and raster so that the results will be fair to the other packages
    • [x] Add link to this repository in main plot

    NDVI:

    • [x] Remove data.table package
    • [x] Preselect RED and NIR bands in R packages
    • [x] Create appendix with timings where vector/matrix are used (not raster objects) in R: #9
    • [x] Load only two bands to reduce testing time overall
    opened by kadyb 0
  • raster and terra are slower on better PC

    raster and terra are slower on better PC

    Benchmarks were run on two different hardware configurations. Almost all packages noted better times on the newer PC (test 2) except selected functions from raster and terra.

    # test 1 hardware
    CPU: Intel Core i5 3210M 2.50 GHz
    RAM: 8 GB
    OS: Windows 8.1
    
    # test 2 hardware
    CPU: Intel Xeon CPU E5-2620 v2 @ 2.10GHz
    RAM: 64 GB
    OS: Pop!_OS 20.04 LTS
    

    Timings are given as medians of 10 iterations. task | package | test 1 | test 2 -- | -- | -- | -- extract-points | raster | 22.3 | 24.5 crop | terra | 6.3 | 8.3 extract-points | terra | 6.7 | 7.2 write | terra | 38.9 | 42.0 ndvi | raster-fun | 8.1 | 9.0 ndvi | raster-overlay | 5.7 | 6.5 ndvi | terra-fun | 8.8 | 11.6 ndvi | terra-lapp | 7.3 | 8.9

    (I rerun below tests so the results are not accidental)

    For reference:

    1. timings.csv - test 1 (15.02)
    2. timings.csv - test 2 (16.02)
    opened by kadyb 0
  • Compression in `rasterio` doesn't work

    Compression in `rasterio` doesn't work

    Currently file compression in rasterio doesn't work. The file should be about 1 GB, but is actually 2 GB.

    https://github.com/kadyb/raster-benchmark/blob/main/rasterio/write.py#L28

    bug 
    opened by kadyb 0
  • Ideas

    Ideas

    1. Results validation.
    2. New test scenario where the dataset doesn't fit in memory - block processing.
    3. Several datasets (different sizes and data types, e.g. float16).
    4. Docker.
    5. Log memory demand.
    6. Reorganize files in the repository.
    opened by kadyb 1
  • Package download statistics

    Package download statistics

    R:

    • https://r-hub.github.io/cranlogs/

    Python:

    • https://github.com/ContinuumIO/anaconda-package-data
    • https://github.com/grimbough/anaconda-download-stats
    • https://packaging.python.org/guides/analyzing-pypi-package-downloads/#package-downloads-over-time
    • https://github.com/ofek/pypinfo
    opened by kadyb 0
Owner
Krzysztof Dyba
Spatial Data Science | Remote Sensing | R
Krzysztof Dyba
Advanced raster and geometry manipulations

buzzard In a nutshell, the buzzard library provides powerful abstractions to manipulate together images and geometries that come from different kind o

Earthcube Lab 30 Jun 20, 2022
Summary statistics of geospatial raster datasets based on vector geometries.

rasterstats rasterstats is a Python module for summarizing geospatial raster datasets based on vector geometries. It includes functions for zonal stat

Matthew Perry 437 Dec 23, 2022
User friendly Rasterio plugin to read raster datasets.

rio-tiler User friendly Rasterio plugin to read raster datasets. Documentation: https://cogeotiff.github.io/rio-tiler/ Source Code: https://github.com

null 372 Dec 23, 2022
🌐 Local tile server for viewing geospatial raster files with ipyleaflet

?? Local Tile Server for Geospatial Rasters Need to visualize a rather large raster (gigabytes) you have locally? This is for you. A Flask application

Bane Sullivan 192 Jan 4, 2023
🌐 Local tile server for viewing geospatial raster files with ipyleaflet or folium

?? Local Tile Server for Geospatial Rasters Need to visualize a rather large (gigabytes) raster you have locally? This is for you. A Flask application

Bane Sullivan 192 Jan 4, 2023
Python package for earth-observing satellite data processing

Satpy The Satpy package is a python library for reading and manipulating meteorological remote sensing data and writing it to various image and data f

PyTroll 882 Dec 27, 2022
Geospatial Image Processing for Python

GIPPY Gippy is a Python library for image processing of geospatial raster data. The core of the library is implemented as a C++ library, libgip, with

GIPIT 83 Aug 19, 2022
A toolbox for processing earth observation data with Python.

eo-box eobox is a Python package with a small collection of tools for working with Remote Sensing / Earth Observation data. Package Overview So far, t

null 13 Jan 6, 2022
Processing and interpolating spatial data with a twist of machine learning

Documentation | Documentation (dev version) | Contact | Part of the Fatiando a Terra project About Verde is a Python library for processing spatial da

Fatiando a Terra 468 Dec 20, 2022
framework for large-scale SAR satellite data processing

pyroSAR A Python Framework for Large-Scale SAR Satellite Data Processing The pyroSAR package aims at providing a complete solution for the scalable or

John Truckenbrodt 389 Dec 21, 2022
This is a simple python code to get IP address and its location using python

IP address & Location finder @DEV/ED : Pavan Ananth Sharma Dependencies: ip2geotools Note: use pip install ip2geotools to install this in your termin

Pavan Ananth Sharma 2 Jul 5, 2022
OSMnx: Python for street networks. Retrieve, model, analyze, and visualize street networks and other spatial data from OpenStreetMap.

OSMnx OSMnx is a Python package that lets you download geospatial data from OpenStreetMap and model, project, visualize, and analyze real-world street

Geoff Boeing 4k Jan 8, 2023
Python bindings and utilities for GeoJSON

geojson This Python library contains: Functions for encoding and decoding GeoJSON formatted data Classes for all GeoJSON Objects An implementation of

Jazzband 765 Jan 6, 2023
Python interface to PROJ (cartographic projections and coordinate transformations library)

pyproj Python interface to PROJ (cartographic projections and coordinate transformations library). Documentation Stable: http://pyproj4.github.io/pypr

null 832 Dec 31, 2022
Python bindings and utilities for GeoJSON

geojson This Python library contains: Functions for encoding and decoding GeoJSON formatted data Classes for all GeoJSON Objects An implementation of

Jazzband 763 Dec 26, 2022
Documentation and samples for ArcGIS API for Python

ArcGIS API for Python ArcGIS API for Python is a Python library for working with maps and geospatial data, powered by web GIS. It provides simple and

Esri 1.4k Dec 30, 2022
python toolbox for visualizing geographical data and making maps

geoplotlib is a python toolbox for visualizing geographical data and making maps data = read_csv('data/bus.csv') geoplotlib.dot(data) geoplotlib.show(

Andrea Cuttone 976 Dec 11, 2022
geemap - A Python package for interactive mapping with Google Earth Engine, ipyleaflet, and ipywidgets.

A Python package for interactive mapping with Google Earth Engine, ipyleaflet, and folium

Qiusheng Wu 2.4k Dec 30, 2022
Python interface to PROJ (cartographic projections and coordinate transformations library)

pyproj Python interface to PROJ (cartographic projections and coordinate transformations library). Documentation Stable: http://pyproj4.github.io/pypr

null 832 Dec 31, 2022