Collie is for uncovering RDMA NIC performance anomalies

Bytedance Inc.

Last update: Dec 11, 2022

Related tags

Miscellaneous Collie

Overview

Collie

Collie is for uncovering RDMA NIC performance anomalies.

Overview

Prerequisite
Quick Start
Content
Publication
Copyright

Prerequisite

Two hosts with RDMA NICs.
- Connected to the same switch is recommended since Collie currently does not take network(fabric) effect into consideration. But Collie should work once two hosts are connected and RDMA communication enabled.
Set up passwordless SSH login (e.g., ssh public/private keys login).
- Collie currently uses passwordless SSH login to run traffic_engine on different hosts.
Google gflags and glog library installed.
- Collie uses glog for logging and gflags for commandline flags processing.
Collie should supports all types of RDMA NICs and drivers that follow IB verbs specification, but currently we've only tested with Mellanox and Broadcom RNICs.

Quick Start

Environment Setup

Install prerequisites.

apt-get install -y libgflags-dev libgoogle-glog-dev

Setup passwordless SSH login.

Build Traffic Engine

Build the traffic engine without GPU and CUDA:

cd traffic_engine && make -j8

OR buidl the traffic engine that supports GPU Direct RDMA:

cd traffic_engine && GDR=1 make -j8

NOTICE: GDR is supported only for Tesla or Quadro GPUs according to GPUDirect RDMA.

Please refer to traffic_engine/README for more details.

How to Run: Arguments and Examples

Collie uses JSON configuration file to set parameters for a given RDMA subsystem.

Configuration Example: see ./example.json
- username -- Collie uses SSH to run engines on different hosts, so it needs the username for login.
- iplist -- the client IP and the server IP, given in a list.
- logpath -- the logging path for Collie. Users can get detailed results of anomalies and the reproduce scripts for Collie here.
- engine -- the path for traffic engine.
- iters -- at most iters tests that Collie would run.
- bars -- user's expected performance.
  - tx_pfc_bar -- TX (sent) PFC pause duration in us per second.
  - rx_pfc_bar -- RX (received) PFC pause duration in us per second.
  - bps_bar -- bits per second of the entire NIC.
  - pps_bar -- packets per second of the entire NIC.
Quick Run Example

python3 search/collie.py --config  ./example.json

Content

Collie consists of two components, the traffic engine and the search algorithms (the monitor is included as a part of search algorithm).

Traffic Engine (./traffic_engine)

Traffic engine is an independent part that implemented in C/C++. Users can use the engine to generate flexible traffic of different patterns. See ./traffic_engine/README for more details and examples of complex traffic patterns.It is recommended to reproduce the anomalies (see Appendix of our NSDI paper) with the tool.
Search Algorithms (./search)

Our simulated-annealing (SA) based algorithm and minimal feature set (MFS) are implemented in python scripts.
- space.py -- the search space. Space defines the search space (upper/lower bounds, granularity for each parameter). Each Point has several Traffics (e.g., one A->B and one B->A). Each Traffic has two Endhost, one server and one client, as well as many other attributes that describe this traffic (e.g., QP type).
- engine.py -- given a point, running collie_engine to set up the corresponding traffic described in the Point. If users need to set up traffics in different ways (rather than SSH), please modify the Engine class.
- anneal.py -- the simulated-annealing based algorithm and minimal feature set algorithm are implemented here. If users need to modify the temperature and mutation logics, please modify here.
- logger.py -- logging assistant functions for logging results and reproduce scripts.
- bone.py -- monitor performance counters and collect statistic results based on vendor's tools.
- hardware.py -- monitor diagnostic counters and collect statistic results based on vendor's tools. (Unfortunately currently diagnostic counters tools like NeoHost is not publicly available and open-sourced, so we only provide performance counter based code for NDA reasons.)
- collie.py -- read user parameters and call SA to search.

Copyright

Collie is provided under the MIT license. See LICENSE for more details.

Metrics-advisor - Analyze reshaped metrics from TiDB cluster Prometheus and give some advice about anomalies and correlation.

metrics-advisor Analyze reshaped metrics from TiDB cluster Prometheus and give some advice about anomalies and correlation. Team freedeaths mashenjun

3 Jan 7, 2022

DeepLearning Anomalies Detection with Bluetooth Sensor Data

Final Year Project. Constructing models to create offline anomalies detection using Travel Time Data collected from Bluetooth sensors along the route.

1 Jan 10, 2022

Docker image with Uvicorn managed by Gunicorn for high-performance FastAPI web applications in Python 3.6 and above with performance auto-tuning. Optionally with Alpine Linux.

Supported tags and respective Dockerfile links python3.8, latest (Dockerfile) python3.7, (Dockerfile) python3.6 (Dockerfile) python3.8-slim (Dockerfil

2.1k Dec 31, 2022

peace-performance (Rust) binding for python. To calculate star ratings and performance points for all osu! gamemodes

peace-performance-python Fast, To calculate star ratings and performance points for all osu! gamemodes peace-performance (Rust) binding for python bas

9 Sep 19, 2022

FastAPI framework, high performance, easy to learn, fast to code, ready for production

FastAPI framework, high performance, easy to learn, fast to code, ready for production Documentation: https://fastapi.tiangolo.com Source Code: https:

53k Jan 2, 2023

Cython implementation of Toolz: High performance functional utilities

CyToolz Cython implementation of the toolz package, which provides high performance utility functions for iterables, functions, and dictionaries. tool

894 Jan 2, 2023

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

DISCONTINUATION OF PROJECT. This project will no longer be maintained by Intel. Intel will not provide or guarantee development of or support for this

3.9k Dec 20, 2022

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

21.1k Jan 1, 2023

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

6.9k Jan 4, 2023

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

3k Jan 3, 2023

ML-Ensemble – high performance ensemble learning

A Python library for high performance ensemble learning ML-Ensemble combines a Scikit-learn high-level API with a low-level computational graph framew

764 Dec 31, 2022

Simple, realtime visualization of neural network training performance.

pastalog Simple, realtime visualization server for training neural networks. Use with Lasagne, Keras, Tensorflow, Torch, Theano, and basically everyth

416 Dec 29, 2022

A high performance implementation of HDBSCAN clustering. http://hdbscan.readthedocs.io/en/latest/

HDBSCAN Now a part of scikit-learn-contrib HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over va

91 Dec 29, 2022

The no-nonsense, minimalist REST and app backend framework for Python developers, with a focus on reliability, correctness, and performance at scale.

The Falcon Web Framework Falcon is a reliable, high-performance Python web framework for building large-scale app backends and microservices. It encou

9k Jan 3, 2023

53.1k Jan 6, 2023

Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.

Mimesis - Fake Data Generator Description Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes

3.8k Dec 29, 2022

3.8k Jan 1, 2023

High-performance cross-platform Video Processing Python framework powerpacked with unique trailblazing features :fire:

Releases | Gears | Documentation | Installation | License VidGear is a High-Performance Video Processing Python Library that provides an easy-to-use,

2.6k Dec 28, 2022

Backtest 1000s of minute-by-minute trading algorithms for training AI with automated pricing data from: IEX, Tradier and FinViz. Datasets and trading performance automatically published to S3 for building AI training datasets for teaching DNNs how to trade. Runs on Kubernetes and docker-compose. 150 million trading history rows generated from +5000 algorithms. Heads up: Yahoo's Finance API was disabled on 2019-01-03 https://developer.yahoo.com/yql/

Stock Analysis Engine Build and tune investment algorithms for use with artificial intelligence (deep neural networks) with a distributed stack for ru

828 Dec 28, 2022

Collie is for uncovering RDMA NIC performance anomalies

Related tags

Overview

Collie

Overview

Prerequisite

Quick Start

Environment Setup

Build Traffic Engine

How to Run: Arguments and Examples

Content

Copyright

You might also like...

Metrics-advisor - Analyze reshaped metrics from TiDB cluster Prometheus and give some advice about anomalies and correlation.

DeepLearning Anomalies Detection with Bluetooth Sensor Data

Docker image with Uvicorn managed by Gunicorn for high-performance FastAPI web applications in Python 3.6 and above with performance auto-tuning. Optionally with Alpine Linux.

peace-performance (Rust) binding for python. To calculate star ratings and performance points for all osu! gamemodes

FastAPI framework, high performance, easy to learn, fast to code, ready for production

Cython implementation of Toolz: High performance functional utilities

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

ML-Ensemble – high performance ensemble learning

Simple, realtime visualization of neural network training performance.

A high performance implementation of HDBSCAN clustering. http://hdbscan.readthedocs.io/en/latest/

The no-nonsense, minimalist REST and app backend framework for Python developers, with a focus on reliability, correctness, and performance at scale.

FastAPI framework, high performance, easy to learn, fast to code, ready for production

Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.

Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.

High-performance cross-platform Video Processing Python framework powerpacked with unique trailblazing features :fire:

Owner

Bytedance Inc.

DownTime-Score is a Small project aimed to Monitor the performance and the availabillity of a variety of the Vital and Critical Moroccan Web Portals

Performance data for WASM SIMD instructions.

This is an online course where you can learn and master the skill of low-level performance analysis and tuning.

EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way.

Developed a website to analyze and generate report of students based on the curriculum that represents student’s academic performance.

An interactive tool with which to explore the possible imaging performance of candidate ngEHT architectures.

Terrible sudoku solver with spaghetti code and performance issues

Performance monitoring and testing of OpenStack

Defichain maxi - Scripts to optimize performance on defichain rewards

A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile