Awesome DataOps

A curated list of awesome DataOps tools.

Awesome DataOps
Resources
- Books
- Other Lists
- Slack
Contributing

Data Catalog

Tools related to data cataloging.

Amundsen - Data discovery and metadata engine for improving the productivity when interacting with data.
Apache Atlas - Provides open metadata management and governance capabilities to build a data catalog.
CKAN - Open-source DMS (data management system) for powering data hubs and data portals.
DataHub - LinkedIn's generalized metadata search & discovery tool.
Magda - A federated, open-source data catalog for all your big data and small data.
Metacat - Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra.
OpenMetadata - A Single place to discover, collaborate and get your data right.

Data Exploration

Tools for performing data exploration.

Apache Zeppelin - Enables data-driven, interactive data analytics and collaborative documents.
Jupyter Notebook - Web-based notebook environment for interactive computing.
JupyterLab - The next-generation user interface for Project Jupyter.
Jupytext - Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts.
Polynote - The polyglot notebook with first-class Scala support.

Data Ingestion

Tools for performing data ingestion.

Amazon Kinesis - Easily collect, process, and analyze video and data streams in real time.
Apache Gobblin - A framework that simplifies common aspects of big data such as data ingestion.
Apache Kafka - Open-source distributed event streaming platform used by thousands of companies.
Apache Pulsar - Distributed pub-sub messaging platform with a flexible messaging model and intuitive API.
Embulk - A parallel bulk data loader that helps data transfer between various storages.
Fluentd - Collects events from various data sources and writes them to files.
Google PubSub - Ingest events for streaming into BigQuery, data lakes or operational databases.
Nakadi - A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues.
Pravega - An open source distributed storage service implementing Streams.
RabbitMQ - One of the most popular open source message brokers.

Data Lake

Tools related to storing data in data lakes.

Delta Lake - An open source project that enables building a Lakehouse architecture on top of data lakes.
LakeFS - Open source tool that transforms your object storage into a Git-like repository.

Data Workflow

Tools related to data workflow/pipeline.

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows.
Apache Oozie - An extensible, scalable and reliable system to manage complex Hadoop workloads.
Azkaban - Batch workflow job scheduler created at LinkedIn to run Hadoop jobs.
Dagster - An orchestration platform for the development, production, and observation of data assets.
Luigi - Python module that helps you build complex pipelines of batch jobs.
Prefect - A workflow management system, designed for modern infrastructure.

Data Processing

Tools related to data processing (batch and stream).

Apache Beam - A unified model for defining both batch and streaming data-parallel processing pipelines.
Apache Flink - An open source stream processing framework with powerful capabilities.
Apache Hadoop MapReduce - A framework for writing applications which process vast amounts of data.
Apache Hudi - Hadoop Upserts Deletes and Incrementals.
Apache Nifi - An easy to use, powerful, and reliable system to process and distribute data.
Apache Samza - A distributed stream processing framework which uses Apache Kafka and Hadoop YARN.
Apache Spark - A unified analytics engine for large-scale data processing.
Apache Storm - An open source distributed realtime computation system.
Apache Tez - A generic data-processing pipeline engine envisioned as a low-level engine.
Faust - A stream processing library, porting the ideas from Kafka Streams to Python.

Data Quality

Tools for ensuring data quality.

Cerberus - Lightweight, extensible data validation library for Python.
Great Expectations - A Python data validation framework that allows to test your data against datasets.
JSON Schema - A vocabulary that allows you to annotate and validate JSON documents.

Data Serialization

Tools related to data serialization.

Apache Avro - A data serialization system which is compact, fast and provides rich data structures.
Apache ORC - A self-describing type-aware columnar file format designed for Hadoop workloads.
Apache Parquet - A columnar storage format which provides efficient storage and encoding of data.
Kryo - A fast and efficient binary object graph serialization framework for Java.
ProtoBuf - Language-neutral, platform-neutral, extensible mechanism for serializing structured data.

Data Compression

Pigz - A parallel implementation of gzip for modern multi-processor, multi-core machines.
Snappy - Open source compression library that is fast, stable and robuts.

Data Visualization

Tools for performing data visualization (DataViz).

Apache Superset - A modern data exploration and data visualization platform.
Count - SQL/drag-and-drop querying and visualisation tool based on notebooks.
Dash - Analytical Web Apps for Python, R, Julia, and Jupyter.
Data Studio - Reporting solution for power users who want to go beyond the data and dashboards of GA.
HUE - A mature SQL Assistant for querying Databases & Data Warehouses.
Lux - Fast and easy data exploration by automating the visualization and data analysis process.
Metabase - The simplest, fastest way to get business intelligence and analytics to everyone.
Redash - Connect to any data source, easily visualize, dashboard and share your data.
Tableau - Powerful and fastest growing data visualization tool used in the business intelligence industry.

Data Warehouse

Tools related to storing data in data warehouses (DW).

Amazon Redshift - Accelerate your time to insights with fast, easy, and secure cloud data warehousing.
Apache Hive - Facilitates reading, writing, and managing large datasets residing in distributed storage.
Google BigQuery - Serverless, highly scalable, and cost-effective multicloud data warehouse.

Database

Database tools for storing data.

Columnar Database

Apache Cassandra - Open source column based DBMS designed to handle large amounts of data.
Apache Druid - Designed to quickly ingest massive quantities of event data, and provide low-latency queries.
Apache HBase - An open-source, distributed, versioned, column-oriented store.
Scylla - Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies.

Document-Oriented Database

Apache CouchDB - An open-source document-oriented NoSQL database, implemented in Erlang.
Elasticsearch - A distributed document oriented database with a RESTful search engine.
MongoDB - A cross-platform document database that uses JSON-like documents with optional schemas.
RethinkDB - The first open-source scalable database built for realtime applications.

Graph Database

ArangoDB - A scalable open-source multi-model database natively supporting graph, document and search.
Neo4j - A high performance graph store with all the features expected of a mature and robust database.
Titan - A highly scalable graph database optimized for storing and querying large graphs.

Key-Value Database

Apache Accumulo - A sorted, distributed key-value store that provides robust and scalable data storage.
etcd - Distributed reliable key-value store for the most critical data of a distributed system.
Memcached - A high performance multithreaded event-based key/value cache store.
Redis - An in-memory key-value database that persists on disk.

Relational Database

CockroachDB - A distributed database designed to build, scale, and manage data-intensive apps.
Crate - A distributed SQL database that makes it simple to store and analyze massive amounts of data.
MariaDB - A replacement of MySQL with more features, new storage engines and better performance.
MySQL - One of the most popular open source transactional databases.
PostgreSQL - An advanced RDBMS that supports an extended subset of the SQL standard.
RQLite - A lightweight, distributed relational database, which uses SQLite as its storage engine.

Time Series Database

Akumuli - Can be used to capture, store and process time-series data in real-time.
InfluxDB - Scalable datastore for metrics, events, and real-time analytics.
QuestDB - An open source SQL database designed to process time series data, faster.
TimescaleDB - Open-source time-series SQL database optimized for fast ingest and complex queries.

Vector Database

Milvus - An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy.
Pinecone - Managed and distributed vector similarity search used with a lightweight SDK.

File System

Tools related to file system and data storage.

Alluxio - A virtual distributed storage system.
Amazon Simple Storage Service (S3) - Object storage built to retrieve any amount of data from anywhere
Apache Hadoop Distributed File System (HDFS) - A distributed file system.
GlusterFS - A software defined distributed storage that can scale to several petabytes.
Google Cloud Storage (GCS) - Object storage for companies of all sizes, to store any amount of data.
LizardFS - A highly reliable, scalable and efficient distributed file system.
MinIO - High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API.
SeaweedFS - A fast distributed storage system for blobs, objects, files, and data lake.
Swift - A distributed object storage system designed to scale from a single machine to thousands of servers.

Logging and Monitoring

Tools used for logging and monitoring data workflows.

Grafana - Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more.
Loki - A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus.
Prometheus - A monitoring system and time series database.

SQL Query Engine

Tools for parallel processing SQL statements.

Apache Drill - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
Apache Impala - Lightning-fast, distributed SQL queries for petabytes of data.
Dremio - Power high-performing BI dashboards and interactive analytics directly on data lake.
Presto - A distributed SQL query engine for big data.
Trino - A fast distributed SQL query engine for big data analytics.

Resources

Where to discover new tools and discuss about existing ones.

Books

Data Mesh: Delivering Data-Driven Value at Scale (O'Reilly)
Designing Data-Intensive Applications (O'Reilly)
Fundamentals of Data Engineering (O'Reilly)
Getting Started with Impala (O'Reilly)
Learning and Operating Presto (O'Reilly)
Learning Spark: Lightning-Fast Data Analytics (O'Reilly)
Spark in Action (O'Reilly)
Spark: The Definitive Guide (O'Reilly)

Other Lists

Slack

Contributing

All contributions are welcome! Please take a look at the contribution guidelines first.

🔬 A curated list of awesome machine learning strategies & tools in financial market.

1.6k Dec 30, 2022

A curated list of awesome Python asyncio frameworks, libraries, software and resources

Awesome asyncio A carefully curated list of awesome Python asyncio frameworks, libraries, software and resources. The Python asyncio module introduced

3.8k Jan 8, 2023

A curated list of awesome Dash (plotly) resources

Awesome Dash A curated list of awesome Dash (plotly) resources Dash is a productive Python framework for building web applications. Written on top of

1.7k Dec 26, 2022

A curated list of awesome Jupyter projects, libraries and resources

Awesome Jupyter A curated list of awesome Jupyter projects, libraries and resources. Jupyter is an open-source web application that allows you to crea

3.1k Dec 30, 2022

A curated list of awesome synthetic data for text location and recognition

awesome-SynthText A curated list of awesome synthetic data for text location and recognition and OCR datasets. Text location SynthText SynthText_Chine

283 Jan 5, 2023

A curated list of awesome things related to Pydantic! 🌪️

Awesome Pydantic A curated list of awesome things related to Pydantic. These packages have not been vetted or approved by the pydantic team. Feel free

186 Jan 5, 2023

A curated list of awesome packages, articles, and other cool resources from the Wagtail community.

Awesome Wagtail A curated list of awesome packages, articles, and other cool resources from the Wagtail community. Wagtail is a Python CMS powered by

1.7k Jan 3, 2023

A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.

224 Jan 4, 2023

A curated list of amazingly awesome Cybersecurity datasets

758 Dec 28, 2022

A curated list of awesome mathematics resources

6.7k Jan 5, 2023

A curated list of awesome things related to Textual

Awesome Textual | A curated list of awesome things related to Textual. Textual is a TUI (Text User Interface) framework for Python inspired by modern

5 May 8, 2022

DataOps framework for Machine Learning projects.

Noronha DataOps Noronha is a Python framework designed to help you orchestrate and manage ML projects life-cycle. It hosts Machine Learning models ins

52 Oct 30, 2022

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

625 Jan 2, 2023

A curated list of FOSS tools to improve the Hacker News experience

Awesome-Hackernews Hacker News is a social news website focusing on computer technologies, hacking and startups. It promotes any content likely to "gr

141 Dec 27, 2022

An ongoing curated list of OS X best applications, libraries, frameworks and tools to help developers set up their macOS Laptop.

macOS Development Setup Welcome to MacOS Local Development & Setup. An ongoing curated list of OS X best applications, libraries, frameworks and tools

3 Apr 3, 2022

🏆 A ranked list of awesome python developer tools and libraries. Updated weekly.

Add DVC and CML tools (MLOps)

What is this tool for?

DVC is for Data Versioning and tracking pipelines. CML is CI/CD for Machine Learning

What's the difference between this tool and similar ones?

It uses Git-native ways, and supports cloud engines such as; GCP, AWS, Azure.

Related Links

DVC website: https://dvc.org/ CML website: https://cml.dev/

DVC repository: https://github.com/iterative/dvc CML repository: https://github.com/iterative/cml

Anyone who agrees with this pull request could submit an Approve review to it.

opened by mertbozkir 1
Added whylogs

Added whylogs in the logging and monitoring section

What is this tool for?

Data logging, drift detection, data quality degradation detecting

What's the difference between this tool and similar ones?

There are no other data logging tools.

Anyone who agrees with this pull request could submit an Approve review to it.

opened by dleybz 0

A curated list of awesome DataOps tools

Related tags

Overview

Awesome DataOps

Data Catalog

Data Exploration

Data Ingestion

Data Lake

Data Workflow

Data Processing

Data Quality

Data Serialization

Data Compression

Data Visualization

Data Warehouse

Database

Columnar Database

Document-Oriented Database

Graph Database

Key-Value Database

Relational Database

Time Series Database

Vector Database

File System

Logging and Monitoring

SQL Query Engine

Resources

Books

Other Lists

Slack

Contributing

You might also like...

🔬 A curated list of awesome machine learning strategies & tools in financial market.

A curated list of awesome Python asyncio frameworks, libraries, software and resources

A curated list of awesome Dash (plotly) resources

A curated list of awesome Jupyter projects, libraries and resources

A curated list of awesome synthetic data for text location and recognition

A curated list of awesome things related to Pydantic! 🌪️

A curated list of awesome packages, articles, and other cool resources from the Wagtail community.

A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.

A curated list of amazingly awesome Cybersecurity datasets

A curated list of awesome mathematics resources

A curated list of awesome things related to Textual

DataOps framework for Machine Learning projects.

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

A curated list of FOSS tools to improve the Hacker News experience

An ongoing curated list of OS X best applications, libraries, frameworks and tools to help developers set up their macOS Laptop.

🏆 A ranked list of awesome python developer tools and libraries. Updated weekly.

🏆 A ranked list of awesome Python open-source libraries and tools. Updated weekly.

A Curated Collection of Awesome Python Scripts

An curated collection of awesome resources about networking in cybersecurity

Comments

Add DVC and CML tools (MLOps)

What is this tool for?

What's the difference between this tool and similar ones?

Related Links

Added whylogs

What is this tool for?

What's the difference between this tool and similar ones?

Owner

Kelvin S. do Prado

Tools for writing awesome Fabric files

Tencent Yun tools with python

MLops tools review for execution on multiple cluster types: slurm, kubernetes, dask...

Create pinned requirements.txt inside a Docker image using pip-tools

This repository contains useful docker-swarm-tools.

Tools and Docker images to make a fast Ruby on Rails development environment

Helperpod - A CLI tool to run a Kubernetes utility pod with pre-installed tools that can be used for debugging/testing purposes inside a Kubernetes cluster

Bugbane - Application security tools for CI/CD pipeline

A curated list of awesome tools for Sphinx Python Documentation Generator

A curated list of awesome tools for SQLAlchemy