ETL flow framework based on Yaml configs in Python

Павел Максимов

Last update: Jul 6, 2022

Related tags

Overview

ETL framework based on Yaml configs in Python

A light framework for creating data streams. Setting up streams through configuration in the Yaml file. There is a schedule, task pools, concurrency limitation. Works quickly, does not require a lot of resources. Runs on Windows and Linux. Flow run in parallel via threading library. Internally SQLite Database. Native data transformation. There is a web interface.

At the moment there are connectors to sources

CSV file
SQLite
Postgres
MySQL
Yandex Metrika Management API
Yandex Metrika Stats API
Yandex Metrika Logs API
Yandex Direct API
Yandex Direct Report API
Criteo
Google Sheets

Storages

Save to csv file
Clickhouse

Documentation

Requirements

python >=3.9
virtual environment

Settings

It is highly recommended to install in a virtual environment.

Flowmaster needs a home, '{HOME}/FlowMaster' is the default,
but you can lay foundation somewhere else if you prefer
(optional)

For Windows

setx FLOWMASTER_HOME "{YOUR_PATH}"

For Linux

export FLOWMASTER_HOME={YOUR_PATH}

Installing

pip install flowmaster==0.7.1

# For install web UI.
pip install flowmaster[webui]==0.7.1

# Optional libraries.
pip install flowmaster[clickhouse,postgres,mysql,yandexdirect,yandexmetrika,criteo,googlesheets]==0.7.1

Run

flowmaster run --help
flowmaster run

WEB UI

http://localhost:8822

CHANGELOG

Support

Telegram support chat

Author

Pavel Maksimov

My contacts Telegram, Facebook

Удачи тебе, друг! Поставь звездочку ;)

You might also like...

signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

44 Oct 14, 2022

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

1.8k Jan 9, 2023

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Mathematical modeling is a powerful method for the analysis of complex biological systems. Although there are many researches devoted on produ

22 Dec 27, 2022

PyChemia, Python Framework for Materials Discovery and Design

PyChemia, Python Framework for Materials Discovery and Design PyChemia is an open-source Python Library for materials structural search. The purpose o

61 Oct 2, 2022

wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Python based Wikidata framework for easy dataframe extraction wikirepo is a Python package that provides a framework to easily source and leverage sta

35 Jan 4, 2023

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams Motivation When dataset freshness is critical, the annotating of high speed

4 Aug 2, 2022

Comments

No such file or directory: '/home/ubuntu/FlowMaster/pools.yaml'

Привет, очень хороший проект, однако столкнулся со следующей проблемой при устанвоке библиотеки

с ванильным python pip такого пакета вообще не видно
при установке через conda установка проходит замечательно, однако при запуске получаю

(base) ubuntu@primary:~/FlowMaster$ flowmaster run
Traceback (most recent call last):
  File "/home/ubuntu/miniforge3/bin/flowmaster", line 5, in <module>
    from flowmaster.__main__ import app
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/__main__.py", line 9, in <module>
    import flowmaster.cli.notebook
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/cli/notebook.py", line 5, in <module>
    from flowmaster.service import (
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/service.py", line 11, in <module>
    from flowmaster.operators.etl.policy import ETLNotebook
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/operators/etl/__init__.py", line 3, in <module>
    from flowmaster.operators.etl.providers.abstract import ProviderAbstract, ExportAbstract
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/operators/etl/providers/__init__.py", line 4, in <module>
    from flowmaster.operators.etl.providers.criteo import CriteoProvider
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/operators/etl/providers/criteo/__init__.py", line 2, in <module>
    from flowmaster.operators.etl.providers.criteo.export import (
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/operators/etl/providers/criteo/export.py", line 8, in <module>
    from flowmaster.executors import SleepIteration
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/executors/__init__.py", line 16, in <module>
    from flowmaster.pool import pools
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/pool.py", line 106, in <module>
    pools_dict = YamlHelper.parse_file(str(Settings.POOL_CONFIG_FILEPATH))
  File "/home/ubuntu/miniforge3/lib/python3.9/site-packages/flowmaster/utils/yaml_helper.py", line 14, in parse_file
    with open(path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/FlowMaster/pools.yaml'

Что я делаю не так?(

opened by micweeks 1

Releases(0.7.1)

0.7.1(Aug 29, 2021)
prevented planned of tasks from one instance of the operator class

fixed error GeneratorExit

fixed transform array type for Clickhouse loader

Source code(tar.gz)
Source code(zip)
0.6.1(Jun 22, 2021)
Redesigned executor

New

add politics 'time_limit_seconds_from_worktime', 'soft_time_limit_seconds'.

add provider 'flowmaster'

Fixing

fix schedule (interval seconds mode)

add logging 'loguru'

fix clear_statuses_of_lost_items

fix allow_execute_flow

change command 'db reset'

There are backward incompatible changes

new field 'expires_utc' in FlowItem

rename command 'run' to 'run_local' and rename command 'run_thread' to 'run'

add new class ExecutorIterationTask.

change, moving and rename class ThreadExecutor to ThreadAsyncExecutor.

change and rename class SleepTask to SleepIteration.

change and rename class TaskPool to NextIterationInPools.

ETLOperator return ExecutorIterationTask.

rename func order_flow to ordering_flow_tasks.

rename func start_executor to sync_executor.

rename field FlowItem.config_hash to FlowItem.notebook_hash

change FLOW_CONFIGS_DIR and rename FLOW_CONFIGS_DIR to NOTEBOOKS_DIR

rename objects config to notebook

add class Settings

Source code(tar.gz)
Source code(zip)
0.5.0(May 25, 2021)

Source code(tar.gz)
Source code(zip)
0.3.1(May 15, 2021)
There are backward incompatible changes

Add local executor

Fix Yandex Direct provider

Refactoring

Source code(tar.gz)
Source code(zip)
0.2.2(May 13, 2021)

Add provider Yandex Direct Refactoring

Incompatible changes
Source code(tar.gz)
Source code(zip)
0.1.3(May 2, 2021)

Source code(tar.gz)
Source code(zip)
0.1.0(May 1, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Павел Максимов

Python Data Engineer, Python Developer, ETL, Разработчик рекомендательных систем

GitHub

Python ELT Studio, an application for building ELT (and ETL) data flows.

The Python Extract, Load, Transform Studio is an application for performing ELT (and ETL) tasks. Under the hood the application consists of a two parts.

55 Nov 18, 2022

ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

0 Jul 7, 2021

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

ETL Pipeline with Airflow, Spark, s3, MongoDB and Amazon Redshift

214 Jan 2, 2023

Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

16 Jun 9, 2022

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

2 Dec 12, 2021

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

1 Nov 1, 2021

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

Airflow ETL With EKS EFS Sagemaker

Airflow ETL With EKS EFS & Sagemaker (en desarrollo) Diagrama de la solución Imp

1 Feb 14, 2022

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022

ETL flow framework based on Yaml configs in Python

Related tags

Overview

ETL framework based on Yaml configs in Python

Documentation

Requirements

Settings

Installing

Run

WEB UI

CHANGELOG

Support

Author

You might also like...

signac-flow - manage workflows with signac

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Randomisation-based inference in Python based on data resampling and permutation.

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

PyChemia, Python Framework for Materials Discovery and Design

wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Comments

No such file or directory: '/home/ubuntu/FlowMaster/pools.yaml'

Releases(0.7.1)

0.7.1(Aug 29, 2021)

0.6.1(Jun 22, 2021)

New

Fixing

There are backward incompatible changes

0.5.0(May 25, 2021)

0.3.1(May 15, 2021)

0.2.2(May 13, 2021)

0.1.3(May 2, 2021)

0.1.0(May 1, 2021)

Owner

Павел Максимов

Python ELT Studio, an application for building ELT (and ETL) data flows.

ETL pipeline on movie data using Python and postgreSQL

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

Pyspark Spotify ETL

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Airflow ETL With EKS EFS Sagemaker

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.