2637 Repositories
Python data-engineering Libraries
Scrapping the data from each page of biocides listed on the BAUA website into a csv file
Scrapping the data from each page of biocides listed on the BAUA website into a csv file
Experiments with Fourier layers on simulation data.
Factorized Fourier Neural Operators This repository contains the code to reproduce the results in our NeurIPS 2021 ML4PS workshop paper, Factorized Fo
The first GANs-based omics-to-omics translation framework
OmiTrans Please also have a look at our multi-omics multi-task DL freamwork 👀 : OmiEmbed The FIRST GANs-based omics-to-omics translation framework Xi
Auxiliary data to the CHIIR paper Searching to Learn with Instructional Scaffolding
Searching to Learn with Instructional Scaffolding This is the data and analysis code for the paper "Searching to Learn with Instructional Scaffolding"
Source Code and data for my paper titled Linguistic Knowledge in Data Augmentation for Natural Language Processing: An Example on Chinese Question Matching
Description The source code and data for my paper titled Linguistic Knowledge in Data Augmentation for Natural Language Processing: An Example on Chin
A program that analyzes data from inertia measurement units installed in aircraft and generates g-exceedance curves.
A program that analyzes data from inertia measurement units installed in aircraft and generates g-exceedance curves.
Data wrangling & common calculations for results from qMem measurement software
qMem Datawrangler This script processes output of qMem measurement software into an Origin ® compatible *.csv files and matplotlib graphs to quickly v
Django based webapp pulling in crypto news and price data via api
Deploy Django in Production FTA project implementing containerization of Django Web Framework into Docker to be placed into Azure Container Services a
Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.
SynthText Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Ved
🛰️ Scripts démontrant l'utilisation de l'imagerie RADARSAT-1 à partir d'un seau AWS | 🛰️ Scripts demonstrating the use of RADARSAT-1 imagery from an AWS bucket
🛰️ Scripts démontrant l'utilisation de l'imagerie RADARSAT-1 à partir d'un seau AWS | 🛰️ Scripts demonstrating the use of RADARSAT-1 imagery from an AWS bucket
Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).
Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER. @inproceedings{tedes
This is a python script to grab data from Zyxel NSA310 NAS and display in Home Asisstant as sensors.
Home-Assistant Python Scripts Python Scripts for Home-Assistant (http://www.home-assistant.io) Zyxel-NSA310-Home-Assistant Monitoring This is a python
A real world application of a Recurrent Neural Network on a binary classification of time series data
What is this This is a real world application of a Recurrent Neural Network on a binary classification of time series data. This project includes data
APIlocal_dbAWS_RDS - Disclaimer! All data used is for educational purposes only.
APIlocal_dbAWS_RDS Disclaimer! All data used is for educational purposes only. ETL pipeline diagram. Aim of project By creating a fully working pipe
Flexible time series feature extraction & processing
tsflex is a toolkit for flexible time series processing & feature extraction, that is efficient and makes few assumptions about sequence data. Useful
DuBE: Duple-balanced Ensemble Learning from Skewed Data
DuBE: Duple-balanced Ensemble Learning from Skewed Data "Towards Inter-class and Intra-class Imbalance in Class-imbalanced Learning" (IEEE ICDE 2022 S
Fast mesh denoising with data driven normal filtering using deep variational autoencoders
Fast mesh denoising with data driven normal filtering using deep variational autoencoders This is an implementation for the paper entitled "Fast mesh
IMBENS: class-imbalanced ensemble learning in Python.
IMBENS: class-imbalanced ensemble learning in Python. Links: [Documentation] [Gallery] [PyPI] [Changelog] [Source] [Download] [知乎/Zhihu] [中文README] [a
Research code for the paper "Variational Gibbs inference for statistical estimation from incomplete data".
Variational Gibbs inference (VGI) This repository contains the research code for Simkus, V., Rhodes, B., Gutmann, M. U., 2021. Variational Gibbs infer
Deep Learning Emotion decoding using EEG data from Autism individuals
Deep Learning Emotion decoding using EEG data from Autism individuals This repository includes the python and matlab codes using for processing EEG 2D
This repository contains all code and data for the Inside Out Visual Place Recognition task
Inside Out Visual Place Recognition This repository contains code and instructions to reproduce the results for the Inside Out Visual Place Recognitio
Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data
Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Ye
Tool for running a high throughput data ingestion/transformation workload with MongoDB
Mongo Mangler The mongo-mangler tool is a lightweight Python utility, which you can run from a low-powered machine to execute a high throughput data i
Gathering data of likes on Tinder within the past 7 days
tinder_likes_data Gathering data of Likes Sent on Tinder within the past 7 days. Versions November 25th, 2021 - Functionality to get the name and age
Empresas do Brasil (CNPJs)
Biblioteca em Python que coleta informações cadastrais de empresas do Brasil (CNPJ) obtidas de fontes oficiais (Receita Federal) e exporta para um formato legível por humanos (CSV ou JSON).
E-Commerce recommender demo with real-time data and a graph database
🔍 E-Commerce recommender demo 🔍 This is a simple stream setup that uses Memgraph to ingest real-time data from a simulated online store. Data is str
Redis OM Python makes it easy to model Redis data in your Python applications.
Object mapping, and more, for Redis and Python Redis OM Python makes it easy to model Redis data in your Python applications. Redis OM Python | Redis
A procedural Blender pipeline for photorealistic training image generation
BlenderProc2 A procedural Blender pipeline for photorealistic rendering. Documentation | Tutorials | Examples | ArXiv paper | Workshop paper Features
Open source platform for Data Science Management automation
Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •
Learning from graph data using Keras
Steps to run = Download the cora dataset from this link : https://linqs.soe.ucsc.edu/data unzip the files in the folder input/cora cd code python eda
Data pipelines built with polars
valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host
CinnaMon is a Python library which offers a number of tools to detect, explain, and correct data drift in a machine learning system
CinnaMon is a Python library which offers a number of tools to detect, explain, and correct data drift in a machine learning system
🗂️ 🔍 Geospatial Data Management and Search API - Django Apps
Geospatial Data API in Django Resonant GeoData (RGD) is a series of Django applications well suited for cataloging and searching annotated geospatial
A package designed to scrape data from Yahoo Finance.
yahoostock A package designed to scrape data from Yahoo Finance. Installation The most simple installation method is through PIP. pip install yahoosto
An index of algorithms for learning causality with data
awesome-causality-algorithms An index of algorithms for learning causality with data. Please cite our survey paper if this index is helpful. @article{
Simple and Distributed Machine Learning
Synapse Machine Learning SynapseML (previously MMLSpark) is an open source library to simplify the creation of scalable machine learning pipelines. Sy
Qlib is an AI-oriented quantitative investment platform
Qlib is an AI-oriented quantitative investment platform, which aims to realize the potential, empower the research, and create the value of AI technologies in quantitative investment.
Automatically download the cwru data set, and then divide it into training data set and test data set
Automatically download the cwru data set, and then divide it into training data set and test data set.自动下载cwru数据集,然后分训练数据集和测试数据集
Improve current data preprocessing for FTM's WOB data to analyze Shell and Dutch Governmental contacts.
We're the hackathon leftovers, but we are Too Good To Go ;-). A repo by Lukas Schubotz and Raymon van Dinter. We aim to improve current data preprocessing for FTM's WOB data to analyze Shell and Dutch Governmental contacts.
Project under the certification "Data Analysis with Python" on FreeCodeCamp
Sea Level Predictor Assignment You will anaylize a dataset of the global average sea level change since 1880. You will use the data to predict the sea
Tools for the analysis, simulation, and presentation of Lorentz TEM data.
ltempy ltempy is a set of tools for Lorentz TEM data analysis, simulation, and presentation. Features Single Image Transport of Intensity Equation (SI
The FIRST GANs-based omics-to-omics translation framework
OmiTrans Please also have a look at our multi-omics multi-task DL freamwork 👀 : OmiEmbed The FIRST GANs-based omics-to-omics translation framework Xi
Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format
Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.
NYCT-GTFS - Real-time NYC subway data parsing for humans
NYCT-GTFS - Real-time NYC subway data parsing for humans This python library provides a human-friendly, native python interface for dealing with the N
Data and codes for ACL 2021 paper: Towards Emotional Support Dialog Systems
Emotional-Support-Conversation Copyright © 2021 CoAI Group, Tsinghua University. All rights reserved. Data and codes are for academic research use onl
Gathers data and displays metrics related to climate change and resource depletion on a PowerBI report.
Apocalypse Status Dashboard Purpose Climate change and resource depletion are grave long-term dangers. The code in this repository will pull data from
Compilation of resources and insights that helped me on my journey to data scientist
Compilation of resources and insights that helped me on my journey to data scientist
A collection of online resources to help you on your Tech journey.
Everything Tech Resources & Projects About The Project Coming from an engineering background and looking to up skill yourself on a new field can be di
This is Assignment1 code for the Web Data Processing System.
This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).
In the case of your data having only 1 channel while want to use timm models
timm_custom Description In the case of your data having only 1 channel while want to use timm models (with or without pretrained weights), run the fol
Minimum Bounding Box of Geospatial data
BBOX Problem definition: The spatial data users often are required to obtain the coordinates of the minimum bounding box of vector and raster data in
A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.
Cookiecutter Data Science A logical, reasonably standardized, but flexible project structure for doing and sharing data science work. Project homepage
Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).
PandasVault — Advanced Pandas Functions and Code Snippets The only Pandas utility package you would ever need. It has no exotic external dependencies
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)
PandaPy "I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to
Automatically visualize your pandas dataframe via a single print! 📊 💡
A Python API for Intelligent Visual Discovery Lux is a Python library that facilitate fast and easy data exploration by automating the visualization a
Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
Intel(R) Extension for Scikit-learn* Installation | Documentation | Examples | Support | FAQ With Intel(R) Extension for Scikit-learn you can accelera
Finding project directories in Python (data science) projects, just like there R rprojroot and here packages
Find relative paths from a project root directory Finding project directories in Python (data science) projects, just like there R here and rprojroot
Intake is a lightweight package for finding, investigating, loading and disseminating data.
Intake: A general interface for loading data Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps
Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.
Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and many other libraries. Documenta
A columnar data container that can be compressed.
Unmaintained Package Notice Unfortunately, and due to lack of resources, the Blosc Development Team is unable to maintain this package anymore. During
Metrics to evaluate quality and efficacy of synthetic datasets.
An Open Source Project from the Data to AI Lab, at MIT Metrics for Synthetic Data Generation Projects Website: https://sdv.dev Documentation: https://
NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
A DSL for data-driven computational pipelines
"Dataflow variables are spectacularly expressive in concurrent programming" Henri E. Bal , Jennifer G. Steiner , Andrew S. Tanenbaum Quick overview Ne
Integrate bus data from a variety of sources (batch processing and real time processing).
Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r
Python script for diving image data to train test and val
dataset-division-to-train-val-test-python python script for dividing image data to train test and val If you have an image dataset in the following st
Data repo for one-among.us
Our Data Data repo for one-among.us File Structure Directory /people/userid/: Data for a specific person info.json5: Profile information page.md: Pr
Pydantic based mock data generation
This library offers powerful mock data generation capabilities for pydantic based models. It can also be used with other libraries that use pydantic as a foundation, for example SQLModel, Beanie and ormar.
Maze generator and solver with python
Procedural-Maze-Generator-Algorithms Check out my youtube channel : Auctux Ressources Thanks to Jamis Buck Book : Mazes for programmers Requirements P
Helping you manage your data science projects sanely.
PyDS CLI Helping you manage your data science projects sanely. Requirements Anaconda/Miniconda/Miniforge/Mambaforge (Mambaforge recommended!) git on y
Source files for the data lake demo video using the AWS TICKIT database
Data Lake Demo Source code for video demonstration detailed in the post, Building a Simple Data Lake on AWS . Build a simple data lake on AWS using a
The official PyTorch code for NeurIPS 2021 ML4AD Paper, "Does Thermal data make the detection systems more reliable?"
MultiModal-Collaborative (MMC) Learning Framework for integrating RGB and Thermal spectral modalities This is the official code for NeurIPS 2021 Machi
This repository contains part of the code used to make the images visible in the article "How does an AI Imagine the Universe?" published on Towards Data Science.
Generative Adversarial Network - Generating Universe This repository contains part of the code used to make the images visible in the article "How doe
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
Command line utilities for tabular data files This is a set of command line utilities for manipulating large tabular data files. Files of numeric and
Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer
Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão
Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data.
Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data. Then used Yahoo Finance to get the related stock data and displayed them in the form of charts.
The fastai book, published as Jupyter Notebooks
English / Spanish / Korean / Chinese / Bengali / Indonesian The fastai book These notebooks cover an introduction to deep learning, fastai, and PyTorc
Extension to fastai for volumetric medical data
FAIMED 3D use fastai to quickly train fully three-dimensional models on radiological data Classification from faimed3d.all import * Load data in vari
Compare MLOps Platforms. Breakdowns of SageMaker, VertexAI, AzureML, Dataiku, Databricks, h2o, kubeflow, mlflow...
Compare MLOps Platforms. Breakdowns of SageMaker, VertexAI, AzureML, Dataiku, Databricks, h2o, kubeflow, mlflow...
Graph Convolutional Neural Networks with Data-driven Graph Filter (GCNN-DDGF)
Graph Convolutional Gated Recurrent Neural Network (GCGRNN) Improved from Graph Convolutional Neural Networks with Data-driven Graph Filter (GCNN-DDGF
Project5 Data processing system
Project5-Data-processing-system User just needed to copy both these file to a folder and open Project5.py using cmd or using any python ide. It is to
Epidemiology analysis package
zEpid zEpid is an epidemiology analysis package, providing easy to use tools for epidemiologists coding in Python 3.5+. The purpose of this library is
Explorative Data Analysis Guidelines
Explorative Data Analysis Get data into a usable format! Find out if the following predictive modeling phase will be successful! Combine everything in
cleanlab is the data-centric ML ops package for machine learning with noisy labels.
cleanlab is the data-centric ML ops package for machine learning with noisy labels. cleanlab cleans labels and supports finding, quantifying, and lear
Data imputations library to preprocess datasets with missing data
Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.
Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.
Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License.
Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas.
Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas. Its objective is to ex
dirty_cat is a Python module for machine-learning on dirty categorical variables.
dirty_cat dirty_cat is a Python module for machine-learning on dirty categorical variables.
Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.
Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln
A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.
Feature Engineering & Feature Selection A comprehensive guide [pdf] [markdown] for Feature Engineering and Feature Selection, with implementations and
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.
Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm
MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data
MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data. We demonstrate its use
Dump Data from FTDI Serial Port to Binary File on MacOS
Dump Data from FTDI Serial Port to Binary File on MacOS
Crypto Stats and Tweets Data Pipeline using Airflow
Crypto Stats and Tweets Data Pipeline using Airflow Introduction Project Overview This project was brought upon through Udacity's nanodegree program.
A package to predict protein inter-residue geometries from sequence data
trRosetta This package is a part of trRosetta protein structure prediction protocol developed in: Improved protein structure prediction using predicte
An optimization and data collection toolbox for convenient and fast prototyping of computationally expensive models.
An optimization and data collection toolbox for convenient and fast prototyping of computationally expensive models. Hyperactive: is very easy to lear
68 keypoint annotations for COFW test data
68 keypoint annotations for COFW test data This repository contains manually annotated 68 keypoints for COFW test data (original annotation of CFOW da
Centralized whale instance using github actions, sourcing metadata from bigquery-public-data.
Whale Demo Instance: Bigquery Public Data This is a fully-functioning demo instance of the whale data catalog, actively scraping data from Bigquery's
Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)
FFT-accelerated Interpolation-based t-SNE (FIt-SNE) Introduction t-Stochastic Neighborhood Embedding (t-SNE) is a highly successful method for dimensi
An interactive UMAP visualization of the MNIST data set.
Code for an interactive UMAP visualization of the MNIST data set. Demo at https://grantcuster.github.io/umap-explorer/. You can read more about the de