Big Data & Cloud Computing for Oceanography

Overview

DS2 Class 2022, Big Data & Cloud Computing for Oceanography

Home of the 2022 ISblue Big Data & Cloud Computing for Oceanography class (IMT-A, ENSTA, IUEM) given by:

This repo is a place holder for the class practice session and for projects developed by students.

Practice notebooks

See https://github.com/obidam/ds2-2022/blob/main/practice/README.md

Projects

See https://github.com/obidam/ds2-2022/blob/main/project/README.md


You might also like...
Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter. Fancy data functions that will make your life as a data scientist easier.
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges: Optimus is the missing framework to prof

Comments
  • ValueError: unrecognized engine zarr must be one of: ['netcdf4', 'scipy', 'store']

    ValueError: unrecognized engine zarr must be one of: ['netcdf4', 'scipy', 'store']

    Hello,

    When I run this line:

    ds = cat["en4"].to_dask()

    sometimes, no problem; however, sometimes, there's an error:

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-24-dbd9e41d9ee1> in <module>()
    ----> 1 ds = cat["en4"].to_dask()
          2 print("Size of the dataset:", ds.nbytes/1e9,"Gb")
          3 ds
    
    7 frames
    /usr/local/lib/python3.7/dist-packages/xarray/backends/plugins.py in get_backend(engine)
        133                 f"backends {installed_engines}. Consider explicitly selecting one of the "
        134                 "installed engines via the ``engine`` parameter, or installing "
    --> 135                 "additional IO dependencies, see:\n"
        136                 "http://xarray.pydata.org/en/stable/getting-started-guide/installing.html\n"
        137                 "http://xarray.pydata.org/en/stable/user-guide/io.html"
    
    ValueError: unrecognized engine zarr must be one of: ['netcdf4', 'scipy', 'store']
    

    I tried !pip install xarray[complete] as https://github.com/pydata/xarray/issues/5395 suggested, but it still didn't work.

    help wanted Pangeo 
    opened by Beam-coder 4
  • Computing Steric Height

    Computing Steric Height

    Hello!

    We have three doubts:

    1. To compute Conservative Temperature, should we use gsw.CT_from_t(ds.SA, ds.temperature, ds.P) , giving the pressure at different levels or gsw.CT_from_pt(ds.SA, ds.temperature), where reference pressure is taken at 0 dbar?
    2. To apply gsw.geo_strf_dyn_height which pressure reference should we use? Also I would like to know which dimensions are used here. In the docs TEOS-10 it first defines pref as 10 dbar but later in the example it applies 1000 without any units.
    3. How should we pass from steric height anomaly to steric height?

    These doubts are better specified in the code found at the end.

    We think we are not really understanding how to use the pressure reference coherently as the value of the steric height anomaly we are obtaining is of the order of metres and we are expecting values in the order of millimetres.

    # Necessary imports
    import numpy as np
    import xarray as xr
    import dask
    import matplotlib.pyplot as plt
    import gsw
    from intake import open_catalog
    
    # Open dataset:
    catalog_url = 'https://raw.githubusercontent.com/obidam/ds2-2022/main/ds2_data_catalog.yml'
    cat = open_catalog(catalog_url)
    ds = cat["en4"].to_dask()
    # Select only first 34 depth levels
    ds = ds.sel(depth=slice(ds.depth[0],ds.depth[34]))
    
    # Treat data:
    ## Pass depth values to negative (positive direction upwards).
    ds['depth'] = - ds['depth']
    ## Pass temperature values to degrees celsius.
    ds['temperature'] = ds['temperature'] - 273.15
    ## Calculate pressure values from depth + latitude.
    ### Then make it coincide with Dataset dimensions so it can be 
    ### applied to other gsw functions:
    ds['P'] = gsw.p_from_z(ds.depth, ds.lat).transpose('lat', 'depth') .expand_dims({'lon' : ds.lon}).where(
                                ds['temperature'] == ds['temperature']).transpose('time', 'depth', 'lat', 'lon') 
    
    ## Calculate Absolute Salinity and Conservative Temperature.
    ds['SA'] = gsw.SA_from_SP(ds.salinity,ds.P,ds.lon,ds.lat)
    ##!!! Here we have a doubt among which one to use for CT:
    ds['CT'] = gsw.CT_from_t(ds.SA, ds.temperature, ds.P)
    ## --> or should we use this one?:
    ##ds['CT'] = gsw.CT_from_pt(ds.SA, ds.temperature)
    
    
    ## Average among years. Put pressure dimension axis as first axis for dynamic height.
    ds = ds.groupby('time.year').mean('time').transpose('depth', 'year', 'lat', 'lon','bnds')
    
    ## Apply the dynamic height anomaly parallelized function and divide by the gravity acceleration to obtain steric height anomaly.
    ## We also have a doubt if the reference  should  be equal to 1000 ¿?¿?
    g_0 = 9.7963 # m/s^2 
    ds['st_height_anom'] = xr.apply_ufunc( gsw.geo_strf_dyn_height, ds.SA, ds.CT, ds.P, 1000, dask='parallelized', output_dtypes=[ds.SA.dtype])/g_0
    
    # Testing for 4 years, should take less than 1 min.
    ds = ds.sel(year=slice(2015,2019))
    (ds['st_height_anom'].isel(depth=0).mean(dim=('lat','lon'))).plot()
    
    opened by Marioherreroglez 3
Owner
Ocean's Big Data Mining
Tools for data mining of large and/or diverse ocean's datasets, primarily oriented toward ocean's physic, observations and (re)-analysis
Ocean's Big Data Mining
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 5, 2023
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

ROOT 2k Dec 29, 2022
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 6, 2022
Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão

Otacilio Filho 4 Jan 23, 2022
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 3, 2023
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

null 898 Jan 9, 2023
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

Zed(Zijun) Chen 40 Dec 12, 2022