2836 Python Data-science-at-scale Libraries

Simple API for UCI Machine Learning Dataset Repository (search, download, analyze)

A simple API for working with University of California, Irvine (UCI) Machine Learning (ML) repository Table of Contents Introduction About Page of the

223 Dec 5, 2022

An Integrated Experimental Platform for time series data anomaly detection.

Curve Sorry to tell contributors and users. We decided to archive the project temporarily due to the employee work plan of collaborators. There are no

486 Dec 21, 2022

Recurrent Scale Approximation (RSA) for Object Detection

Recurrent Scale Approximation (RSA) for Object Detection Codebase for Recurrent Scale Approximation for Object Detection in CNN published at ICCV 2017

239 Dec 28, 2022

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

17 Mar 10, 2022

Python script to tabulate data formats like json, csv, html, etc

pyT PyT is a a command line tool and as well a library for visualising various data formats like: JSON HTML Table CSV XML, etc. Features Print table o

1 Dec 30, 2021

MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Surveying Attack Models for CAVs Guide to Installing CARLA and Collecting Data Our project focuses on surveying attack models for Connveced Autonomous

1 Dec 9, 2021

Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

23 Sep 5, 2022

GDSC UIET KUK 📍 , welcomes you all to this amazing event where you will be introduced to the world of coding 💻 .

9 Mar 24, 2022

Optimus: the first large-scale pre-trained VAE language model

Optimus: the first pre-trained Big VAE language model This repository contains source code necessary to reproduce the results presented in the EMNLP 2

314 Dec 19, 2022

SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

363 Dec 30, 2022

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

VILLA: Vision-and-Language Adversarial Training This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports

109 Dec 31, 2022

Creating predictive checklists from data using integer programming.

Learning Optimal Predictive Checklists A Python package to learn simple predictive checklists from data subject to customizable constraints. For more

5 Apr 19, 2022

Code and Data for the paper: Molecular Contrastive Learning with Chemical Element Knowledge Graph [AAAI 2022]

Knowledge-enhanced Contrastive Learning (KCL) Molecular Contrastive Learning with Chemical Element Knowledge Graph [ AAAI 2022 ]. We construct a Chemi

58 Dec 26, 2022

A simple algorithm for extracting tree height in sparse scene from point cloud data.

TREE HEIGHT EXTRACTION IN SPARSE SCENES BASED ON UAV REMOTE SENSING This is the offical python implementation of the paper "Tree Height Extraction in

6 Oct 28, 2022

My notes on Data structure and Algos in golang implementation and python

My notes on DS and Algo Table of Contents Arrays LinkedList Trees Types of trees: Tree/Graph Traversal Algorithms Heap Priorty Queue Trie Graphs Graph

0 Feb 13, 2022

Minimal pure Python library for working with little-endian list representation of bit strings.

bitlist Minimal Python library for working with bit vectors natively. Purpose This library allows programmers to work with a native representation of

0 Jul 25, 2022

Google, Facebook, Amazon, Microsoft, Netflix tech interview questions

Algorithm and Data Structures Interview Questions HackerRank | Practice, Tutorials & Interview Preparation Solutions This repository consists of solut

8 Oct 4, 2022

Learn to code in any language. If

Learn to Code It is an intiiative undertaken by Student Ambassadors Club, Jamshoro for students who are absolute begineers in programming and want to

15 Oct 19, 2022

Multiple Imputation with Random Forests in Python

miceforest: Fast, Memory Efficient Imputation with lightgbm Fast, memory efficient Multiple Imputation by Chained Equations (MICE) with lightgbm. The

202 Dec 31, 2022

Python ts2vg package provides high-performance algorithm implementations to build visibility graphs from time series data.

ts2vg: Time series to visibility graphs The Python ts2vg package provides high-performance algorithm implementations to build visibility graphs from t

26 Dec 17, 2022

Very useful and necessary functions that simplify working with data

Additional-function-for-pandas Very useful and necessary functions that simplify working with data random_fill_nan(module_name, nan) - Replaces all sp

2 Dec 2, 2021

A tool to nowcast quarterly data with monthly indicators: US consumption example

MIDAS_Nowcaster A tool to nowcast quarterly data with monthly indicators: US consumption example Pulls data directly from FRED from a list of codes -

3 Oct 6, 2022

Data Exfiltration without ever making a connection. Using TCP header space.

TCPwned PoC toy code to exfiltrate data without ever making a TCP connection. This will never show up in firewall logs, much less, actually be monitor

2 Nov 21, 2022

Upgini : data search library for your machine learning pipelines

Automated data search library for your machine learning pipelines → find & deliver relevant external data & features to boost ML accuracy :chart_with_upwards_trend:

175 Jan 8, 2023

A light weight data augmentation tool for training CNNs and Viola Jones detectors

hey-daug A light weight data augmentation tool for training CNNs and Viola Jones detectors (Haar Cascades). This tool inflates your data by up to six

2 Nov 23, 2019

Open source annotation tool for machine learning practitioners.

doccano doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequ

7.1k Jan 1, 2023

Tools for curating biomedical training data for large-scale language modeling

242 Dec 25, 2022

The repository for the paper "When Do You Need Billions of Words of Pretraining Data?"

pretraining-learning-curves This is the repository for the paper When Do You Need Billions of Words of Pretraining Data? Edge Probing We use jiant1 fo

19 Nov 25, 2022

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed+Megatron trained the world's most powerful language model: MT-530B DeepSpeed is hiring, come join us! DeepSpeed is a deep learning optimizat

8.4k Dec 28, 2022

Information Gain Filtration (IGF) is a method for filtering domain-specific data during language model finetuning. IGF shows significant improvements over baseline fine-tuning without data filtration.

Information Gain Filtration Information Gain Filtration (IGF) is a method for filtering domain-specific data during language model finetuning. IGF sho

4 Jul 28, 2022

The code is the training example of AAAI2022 Security AI Challenger Program Phase 8: Data Centric Robot Learning on ML models.

Example code of [Tianchi AAAI2022 Security AI Challenger Program Phase 8]

22 Oct 14, 2022

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

DANeS - Open-source E-newspaper dataset Source: Technology vector created by macrovector - www.freepik.com. DANeS is an open-source E-newspaper datase

64 Aug 17, 2022

Label Studio is a multi-type data labeling and annotation tool with standardized output format

Website • Docs • Twitter • Join Slack Community What is Label Studio? Label Studio is an open source data labeling tool. It lets you label data types

11.7k Jan 9, 2023

Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

2 Nov 28, 2021

Async boto3 with Autogenerated Data Classes

awspydk Async boto3 with Autogenerated JIT Data Classes Motivation This library is forked from an internal project that works with a lot of backend AW

1 Dec 5, 2021

A command line tool that creates a super timeline from SentinelOne's Deep Visibility data

S1SuperTimeline A command line tool that creates a super timeline from SentinelOne's Deep Visibility data What does it do? The script accepts a S1QL q

2 Feb 8, 2022

Visualize your pandas data with one-line code

PandasEcharts 简介基于pandas和pyecharts的可视化工具安装 pip 安装 $ pip install pandasecharts 源码安装 $ git clone https://github.com/gamersover/pandasecharts $ cd pand

2 Apr 13, 2022

A Pythonic framework for threat modeling

pytm: A Pythonic framework for threat modeling Introduction Traditional threat modeling too often comes late to the party, or sometimes not at all. In

644 Dec 20, 2022

ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

ClearML - Auto-Magical Suite of tools to streamline your ML workflow Experiment Manager, MLOps and Data-Management ClearML Formerly known as Allegro T

3.9k Jan 1, 2023

Official Repository for "Robust On-Policy Data Collection for Data Efficient Policy Evaluation" (NeurIPS 2021 Workshop on OfflineRL).

Robust On-Policy Data Collection for Data-Efficient Policy Evaluation Source code of Robust On-Policy Data Collection for Data-Efficient Policy Evalua

Autonomous Agents Research Group (University of Edinburgh)

2 Oct 9, 2022

Self-Supervised Image Denoising via Iterative Data Refinement

Self-Supervised Image Denoising via Iterative Data Refinement Yi Zhang1, Dasong Li1, Ka Lung Law2, Xiaogang Wang1, Hongwei Qin2, Hongsheng Li1 1CUHK-S

72 Jan 1, 2023

Data Consistency for Magnetic Resonance Imaging

Data Consistency for Magnetic Resonance Imaging Data Consistency (DC) is crucial for generalization in multi-modal MRI data and robustness in detectin

19 Dec 12, 2022

Offical implementation of Shunted Self-Attention via Multi-Scale Token Aggregation

Shunted Transformer This is the offical implementation of Shunted Self-Attention via Multi-Scale Token Aggregation by Sucheng Ren, Daquan Zhou, Shengf

156 Dec 27, 2022

MapReader: A computer vision pipeline for the semantic exploration of maps at scale

MapReader A computer vision pipeline for the semantic exploration of maps at scale MapReader is an end-to-end computer vision (CV) pipeline designed b

25 Dec 26, 2022

Automatic Data-Regularized Actor-Critic (Auto-DrAC)

Auto-DrAC: Automatic Data-Regularized Actor-Critic This is a PyTorch implementation of the methods proposed in Automatic Data Augmentation for General

89 Dec 13, 2022

This repository is for the preprint "A generative nonparametric Bayesian model for whole genomes"

BEAR Overview This repository contains code associated with the preprint A generative nonparametric Bayesian model for whole genomes (2021), which pro

10 Sep 18, 2022

Code for "On Memorization in Probabilistic Deep Generative Models"

On Memorization in Probabilistic Deep Generative Models This repository contains the code necessary to reproduce the experiments in On Memorization in

3 Jun 9, 2022

Removing Inter-Experimental Variability from Functional Data in Systems Neuroscience

Removing Inter-Experimental Variability from Functional Data in Systems Neuroscience This repository is the official implementation of [https://www.bi

6 Oct 9, 2022

Auditing Black-Box Prediction Models for Data Minimization Compliance

Data-Minimization-Auditor An auditing tool for model-instability based data minimization that is introduced in "Auditing Black-Box Prediction Models f

2 Mar 24, 2022

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Evolutionary Scale Modeling This repository contains code and pre-trained weights for Transformer protein language models from Facebook AI Research, i

1.6k Jan 9, 2023

[NeurIPS 2021] "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators"

G-PATE This is the official code base for our NeurIPS 2021 paper: "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of T

14 Oct 12, 2022

TAUFE: Task-Agnostic Undesirable Feature DeactivationUsing Out-of-Distribution Data

A deep neural network (DNN) has achieved great success in many machine learning tasks by virtue of its high expressive power. However, its prediction can be easily biased to undesirable features, which are not essential for solving the target task and are even imperceptible to a human, thereby resulting in poor generalization

8 Dec 7, 2022

Equivariant layers for RC-complement symmetry in DNA sequence data

Equi-RC Equivariant layers for RC-complement symmetry in DNA sequence data This is a repository that implements the layers as described in "Reverse-Co

7 May 19, 2022

Hide sensitive information in images

Data-Preserved Script allowing to blur the most sensitive information on images. Prerequisites Before you begin, ensure you have met the following req

2 Dec 1, 2021

Official PyTorch implementation of Data-free Knowledge Distillation for Object Detection, WACV 2021.

Introduction This repository is the official PyTorch implementation of Data-free Knowledge Distillation for Object Detection, WACV 2021. Data-free Kno

50 Jan 5, 2023

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models This repository is the official implementation of the fol

41 Dec 6, 2022

Sign data using symmetric-key algorithm encryption.

Sign data using symmetric-key algorithm encryption. Validate signed data and identify possible validation errors. Uses sha-(1, 224, 256, 385 and 512)/hmac for signature encryption. Custom hash algorithms are allowed. Useful shortcut functions for signing (and validating) dictionaries and URLs.

39 Jun 10, 2022

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

302 Dec 29, 2022

CCPD: a diverse and well-annotated dataset for license plate detection and recognition

CCPD (Chinese City Parking Dataset, ECCV) UPdate on 10/03/2019. CCPD Dataset is now updated. We are confident that images in subsets of CCPD is much m

1.8k Dec 30, 2022

Primitives for machine learning and data science.

An Open Source Project from the Data to AI Lab, at MIT MLPrimitives Pipelines and primitives for machine learning and data science. Documentation: htt

65 Dec 29, 2022

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset This repository contains links to data and code to fetch and reproduce

19 Dec 16, 2022

An helper library to scrape data from Instagram effortlessly, using the Influencer Hunters APIs.

Instagram Scraper An utility library to scrape data from Instagram hassle-free Go to the website » View Demo · Report Bug · Request Feature About The

2 Jul 6, 2022

tradingview socket api for fetching real time prices.

tradingView-API tradingview socket api for fetching real time prices. How to run git clone https://github.com/mohamadkhalaj/tradingView-API.git cd tra

35 Dec 31, 2022

This repository is used to provide data to zzhack,

This repository is used to provide data to zzhack, but you don't have to care about anything, just write your thinking down, and you can see your thinking is rendered in zzhack perfectly

5 Apr 29, 2022

[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data (NeurIPS 2021) This repository provides the official PyTorch implementation

155 Nov 30, 2021

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

1 Nov 30, 2021

Experiments with Fourier layers on simulation data.

Factorized Fourier Neural Operators This repository contains the code to reproduce the results in our NeurIPS 2021 ML4PS workshop paper, Factorized Fo

57 Dec 25, 2022

The first GANs-based omics-to-omics translation framework

OmiTrans Please also have a look at our multi-omics multi-task DL freamwork 👀 : OmiEmbed The FIRST GANs-based omics-to-omics translation framework Xi

1 Nov 30, 2021

Auxiliary data to the CHIIR paper Searching to Learn with Instructional Scaffolding

Searching to Learn with Instructional Scaffolding This is the data and analysis code for the paper "Searching to Learn with Instructional Scaffolding"

2 Mar 2, 2022

A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models with machine learning (NeurIPS 2021 Datasets and Benchmarks Track)

ClimART - A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models Official PyTorch Implementation Using deep le

21 Dec 31, 2022

Source Code and data for my paper titled Linguistic Knowledge in Data Augmentation for Natural Language Processing: An Example on Chinese Question Matching

Description The source code and data for my paper titled Linguistic Knowledge in Data Augmentation for Natural Language Processing: An Example on Chin

3 Jun 28, 2022

A program that analyzes data from inertia measurement units installed in aircraft and generates g-exceedance curves.

1 Dec 2, 2021

A multi-scale unsupervised learning for deformable image registration

A multi-scale unsupervised learning for deformable image registration Shuwei Shao, Zhongcai Pei, Weihai Chen, Wentao Zhu, Xingming Wu and Baochang Zha

2 Apr 13, 2022

Data wrangling & common calculations for results from qMem measurement software

qMem Datawrangler This script processes output of qMem measurement software into an Origin ® compatible *.csv files and matplotlib graphs to quickly v

1 Nov 30, 2021

Django based webapp pulling in crypto news and price data via api

Deploy Django in Production FTA project implementing containerization of Django Web Framework into Docker to be placed into Azure Container Services a

0 Sep 21, 2022

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

SynthText Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Ved

1.8k Dec 28, 2022

🛰️ Scripts démontrant l'utilisation de l'imagerie RADARSAT-1 à partir d'un seau AWS | 🛰️ Scripts demonstrating the use of RADARSAT-1 imagery from an AWS bucket

Agence spatiale canadienne - Canadian Space Agency

4 May 18, 2022

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER. @inproceedings{tedes

40 Dec 11, 2022

Python Data-science-at-scale Resources

Python data-science-at-scale Libraries

Simple API for UCI Machine Learning Dataset Repository (search, download, analyze)

An Integrated Experimental Platform for time series data anomaly detection.

Recurrent Scale Approximation (RSA) for Object Detection

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

Python script to tabulate data formats like json, csv, html, etc

MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Pipeline for training LSA models using Scikit-Learn.

GDSC UIET KUK 📍 , welcomes you all to this amazing event where you will be introduced to the world of coding 💻 .

Optimus: the first large-scale pre-trained VAE language model

SentAugment is a data augmentation technique for semi-supervised learning in NLP.

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

Creating predictive checklists from data using integer programming.

Code and Data for the paper: Molecular Contrastive Learning with Chemical Element Knowledge Graph [AAAI 2022]

A simple algorithm for extracting tree height in sparse scene from point cloud data.

My notes on Data structure and Algos in golang implementation and python

Minimal pure Python library for working with little-endian list representation of bit strings.

Google, Facebook, Amazon, Microsoft, Netflix tech interview questions

Learn to code in any language. If

Multiple Imputation with Random Forests in Python

Python ts2vg package provides high-performance algorithm implementations to build visibility graphs from time series data.

Very useful and necessary functions that simplify working with data

A tool to nowcast quarterly data with monthly indicators: US consumption example

Data Exfiltration without ever making a connection. Using TCP header space.

Upgini : data search library for your machine learning pipelines

A light weight data augmentation tool for training CNNs and Viola Jones detectors

Open source annotation tool for machine learning practitioners.

Tools for curating biomedical training data for large-scale language modeling

The repository for the paper "When Do You Need Billions of Words of Pretraining Data?"

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Information Gain Filtration (IGF) is a method for filtering domain-specific data during language model finetuning. IGF shows significant improvements over baseline fine-tuning without data filtration.

The code is the training example of AAAI2022 Security AI Challenger Program Phase 8: Data Centric Robot Learning on ML models.

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Label Studio is a multi-type data labeling and annotation tool with standardized output format

Data preprocessing rosetta parser for python

Async boto3 with Autogenerated Data Classes

A command line tool that creates a super timeline from SentinelOne's Deep Visibility data

Visualize your pandas data with one-line code

A Pythonic framework for threat modeling

ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

Official Repository for "Robust On-Policy Data Collection for Data Efficient Policy Evaluation" (NeurIPS 2021 Workshop on OfflineRL).

Self-Supervised Image Denoising via Iterative Data Refinement

Data Consistency for Magnetic Resonance Imaging

Offical implementation of Shunted Self-Attention via Multi-Scale Token Aggregation

MapReader: A computer vision pipeline for the semantic exploration of maps at scale

Automatic Data-Regularized Actor-Critic (Auto-DrAC)

This repository is for the preprint "A generative nonparametric Bayesian model for whole genomes"

Code for "On Memorization in Probabilistic Deep Generative Models"

Removing Inter-Experimental Variability from Functional Data in Systems Neuroscience

Auditing Black-Box Prediction Models for Data Minimization Compliance

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

[NeurIPS 2021] "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators"

TAUFE: Task-Agnostic Undesirable Feature DeactivationUsing Out-of-Distribution Data

Equivariant layers for RC-complement symmetry in DNA sequence data

Hide sensitive information in images

Official PyTorch implementation of Data-free Knowledge Distillation for Object Detection, WACV 2021.

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

Sign data using symmetric-key algorithm encryption.

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

CCPD: a diverse and well-annotated dataset for license plate detection and recognition

Primitives for machine learning and data science.

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

An helper library to scrape data from Instagram effortlessly, using the Influencer Hunters APIs.

tradingview socket api for fetching real time prices.

This repository is used to provide data to zzhack,

[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

Experiments with Fourier layers on simulation data.

The first GANs-based omics-to-omics translation framework

Auxiliary data to the CHIIR paper Searching to Learn with Instructional Scaffolding

A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models with machine learning (NeurIPS 2021 Datasets and Benchmarks Track)

Source Code and data for my paper titled Linguistic Knowledge in Data Augmentation for Natural Language Processing: An Example on Chinese Question Matching

A program that analyzes data from inertia measurement units installed in aircraft and generates g-exceedance curves.

A multi-scale unsupervised learning for deformable image registration

Data wrangling & common calculations for results from qMem measurement software

Django based webapp pulling in crypto news and price data via api

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

🛰️ Scripts démontrant l'utilisation de l'imagerie RADARSAT-1 à partir d'un seau AWS | 🛰️ Scripts demonstrating the use of RADARSAT-1 imagery from an AWS bucket

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).