A data structure that extends pyspark.sql.DataFrame with metadata information.

Overview

MetaFrame

A data structure that extends pyspark.sql.DataFrame with metadata information.

Usage

from metaframe import MetaFrame
mf = MetaFrame(df=df, metadata={"columns": ["a", "b"]})
mf = mf.withColumn("new_col", F.lit(1))
mf = mf.set_metadata(columns=["a", "b", "new_col"])
mf.show()
assert mf.metadata["columns"] == ["a", "b", "new_col"]
You might also like...
Instant search for and access to many datasets in Pyspark.
Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. 😃 Motiv

Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system
PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark PySpark bindings for the H3 core library. For available functions,

Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Random dataframe and database table generator
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Monitor the stability of a pandas or spark dataframe ⚙︎
Monitor the stability of a pandas or spark dataframe ⚙︎

Population Shift Monitoring popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra
PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra The purpose of this project is to demonstrate a structured streaming pipeline with Apache

A collection of learning outcomes data analysis using Python and SQL, from DQLab.
A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu

Comments
  • Lazy import from init

    Lazy import from init

    This is a single class library. Users will most likely to use the MetaFrame class only. So it would be better if we can do that like from metaframe import MetaFrame, instead of current from metaframe.metaframe import MetaFrame.

    One way to do this is to define MetaFrame within __init__.py, or just importing it there. But sometimes this causes some unwanted overheads if importing metaframe.MetaFrame is costly.

    And, if we put package metadata like __version__, there might be build time dependency problems. metaframe.__init__.py will be imported at the build time. There should be an error since pyspark is not a build time dependency. (I did't tested this. Not 100% sure.)

    I think we can try lazy module imports to handle this. It's kind of hacky but I used it somewhere else, and saw some similer approaches.

    It should be something like

    # __init__.py
    
    def __getattr__(name):
        if name == "Crawler":
            from metaframe.metaframe import MetaFrame
            return MetaFrame
    

    But.. this works only for python>3.7 (see PEP-562)

    opened by burakyilmaz321 4
  • Import MetaFrame from init lazily

    Import MetaFrame from init lazily

    This PR applies the suggested implementation in #3 and open to discussion.

    Now, users can import MetaFrame from the top-level package. Note that this is working only for python>=3.7.

    before;

    from metaframe.metaframe import MetaFrame
    

    after;

    from metaframe import MetaFrame
    

    Closes #3

    opened by sagmansercan 1
  • Versioning and pypi releases

    Versioning and pypi releases

    How should we maintain versions?

    I propose not using scm_setuptools here since it brings some amount of coupling with git that we don't need here. I think we can manually maintain versions by hand.

    How should we publish new releases?

    Let's start with simple before automating releases. We can manually publish a package to pypi.

    opened by burakyilmaz321 1
  • Allow MetaFrame to be extended by inheritance

    Allow MetaFrame to be extended by inheritance

    It would be a useful feature if we allow users to extend MetaFrame with their custom methods. Currently, we are creating a MetaFrame instance instead of a DataFrame. I think changing this to cls(...) should work, but I haven't tested it. Example usage:

    class MyDataFrame(MetaFrame):
        def my_method(self):
            ...
    
    opened by kori73 2
Owner
Invent Analytics
Invent Analytics
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 3, 2023
:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges: Optimus is the missing framework to prof

Iron 1.3k Dec 30, 2022
Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

Brady Law 2 Dec 1, 2021
Pyspark project that able to do joins on the spark data frames.

SPARK JOINS This project is to perform inner, all outer joins and semi joins. create_df.py: load_data.py : helps to put data into Spark data frames. d

Joshua 1 Dec 14, 2021
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

null 10k Jan 1, 2023
Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Cedric Zhuang 1.1k Dec 28, 2022
Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

null 16 Jun 9, 2022
Churn prediction with PySpark

It is expected to develop a machine learning model that can predict customers who will leave the company.

null 3 Aug 13, 2021