Collie is for uncovering RDMA NIC performance anomalies

Related tags

Miscellaneous Collie
Overview

Collie

Collie is for uncovering RDMA NIC performance anomalies.

Overview

Prerequisite

  • Two hosts with RDMA NICs.

    • Connected to the same switch is recommended since Collie currently does not take network(fabric) effect into consideration. But Collie should work once two hosts are connected and RDMA communication enabled.
  • Set up passwordless SSH login (e.g., ssh public/private keys login).

    • Collie currently uses passwordless SSH login to run traffic_engine on different hosts.
  • Google gflags and glog library installed.

    • Collie uses glog for logging and gflags for commandline flags processing.
  • Collie should supports all types of RDMA NICs and drivers that follow IB verbs specification, but currently we've only tested with Mellanox and Broadcom RNICs.

Quick Start

Environment Setup

  • Install prerequisites.
apt-get install -y libgflags-dev libgoogle-glog-dev
  • Setup passwordless SSH login.

Build Traffic Engine

  • Build the traffic engine without GPU and CUDA:
cd traffic_engine && make -j8
  • OR buidl the traffic engine that supports GPU Direct RDMA:
cd traffic_engine && GDR=1 make -j8

NOTICE: GDR is supported only for Tesla or Quadro GPUs according to GPUDirect RDMA.

Please refer to traffic_engine/README for more details.

How to Run: Arguments and Examples

Collie uses JSON configuration file to set parameters for a given RDMA subsystem.

  • Configuration Example: see ./example.json

    • username -- Collie uses SSH to run engines on different hosts, so it needs the username for login.
    • iplist -- the client IP and the server IP, given in a list.
    • logpath -- the logging path for Collie. Users can get detailed results of anomalies and the reproduce scripts for Collie here.
    • engine -- the path for traffic engine.
    • iters -- at most iters tests that Collie would run.
    • bars -- user's expected performance.
      • tx_pfc_bar -- TX (sent) PFC pause duration in us per second.
      • rx_pfc_bar -- RX (received) PFC pause duration in us per second.
      • bps_bar -- bits per second of the entire NIC.
      • pps_bar -- packets per second of the entire NIC.
  • Quick Run Example

python3 search/collie.py --config  ./example.json

Content

Collie consists of two components, the traffic engine and the search algorithms (the monitor is included as a part of search algorithm).

  • Traffic Engine (./traffic_engine)

    Traffic engine is an independent part that implemented in C/C++. Users can use the engine to generate flexible traffic of different patterns. See ./traffic_engine/README for more details and examples of complex traffic patterns.It is recommended to reproduce the anomalies (see Appendix of our NSDI paper) with the tool.

  • Search Algorithms (./search)

    Our simulated-annealing (SA) based algorithm and minimal feature set (MFS) are implemented in python scripts.

    • space.py -- the search space. Space defines the search space (upper/lower bounds, granularity for each parameter). Each Point has several Traffics (e.g., one A->B and one B->A). Each Traffic has two Endhost, one server and one client, as well as many other attributes that describe this traffic (e.g., QP type).
    • engine.py -- given a point, running collie_engine to set up the corresponding traffic described in the Point. If users need to set up traffics in different ways (rather than SSH), please modify the Engine class.
    • anneal.py -- the simulated-annealing based algorithm and minimal feature set algorithm are implemented here. If users need to modify the temperature and mutation logics, please modify here.
    • logger.py -- logging assistant functions for logging results and reproduce scripts.
    • bone.py -- monitor performance counters and collect statistic results based on vendor's tools.
    • hardware.py -- monitor diagnostic counters and collect statistic results based on vendor's tools. (Unfortunately currently diagnostic counters tools like NeoHost is not publicly available and open-sourced, so we only provide performance counter based code for NDA reasons.)
    • collie.py -- read user parameters and call SA to search.

Copyright

Collie is provided under the MIT license. See LICENSE for more details.

You might also like...
Metrics-advisor - Analyze reshaped metrics from TiDB cluster Prometheus and give some advice about anomalies and correlation.

metrics-advisor Analyze reshaped metrics from TiDB cluster Prometheus and give some advice about anomalies and correlation. Team freedeaths mashenjun

DeepLearning Anomalies Detection with Bluetooth Sensor Data

Final Year Project. Constructing models to create offline anomalies detection using Travel Time Data collected from Bluetooth sensors along the route.

Docker image with Uvicorn managed by Gunicorn for high-performance FastAPI web applications in Python 3.6 and above with performance auto-tuning. Optionally with Alpine Linux.
Docker image with Uvicorn managed by Gunicorn for high-performance FastAPI web applications in Python 3.6 and above with performance auto-tuning. Optionally with Alpine Linux.

Supported tags and respective Dockerfile links python3.8, latest (Dockerfile) python3.7, (Dockerfile) python3.6 (Dockerfile) python3.8-slim (Dockerfil

peace-performance (Rust) binding for python. To calculate star ratings and performance points for all osu! gamemodes

peace-performance-python Fast, To calculate star ratings and performance points for all osu! gamemodes peace-performance (Rust) binding for python bas

FastAPI framework, high performance, easy to learn, fast to code, ready for production
FastAPI framework, high performance, easy to learn, fast to code, ready for production

FastAPI framework, high performance, easy to learn, fast to code, ready for production Documentation: https://fastapi.tiangolo.com Source Code: https:

Cython implementation of Toolz: High performance functional utilities

CyToolz Cython implementation of the toolz package, which provides high performance utility functions for iterables, functions, and dictionaries. tool

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

DISCONTINUATION OF PROJECT. This project will no longer be maintained by Intel. Intel will not provide or guarantee development of or support for this

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

ML-Ensemble – high performance ensemble learning
ML-Ensemble – high performance ensemble learning

A Python library for high performance ensemble learning ML-Ensemble combines a Scikit-learn high-level API with a low-level computational graph framew

Simple, realtime visualization of neural network training performance.
Simple, realtime visualization of neural network training performance.

pastalog Simple, realtime visualization server for training neural networks. Use with Lasagne, Keras, Tensorflow, Torch, Theano, and basically everyth

A high performance implementation of HDBSCAN clustering. http://hdbscan.readthedocs.io/en/latest/

HDBSCAN Now a part of scikit-learn-contrib HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over va

The no-nonsense, minimalist REST and app backend framework for Python developers, with a focus on reliability, correctness, and performance at scale.
The no-nonsense, minimalist REST and app backend framework for Python developers, with a focus on reliability, correctness, and performance at scale.

The Falcon Web Framework Falcon is a reliable, high-performance Python web framework for building large-scale app backends and microservices. It encou

FastAPI framework, high performance, easy to learn, fast to code, ready for production
FastAPI framework, high performance, easy to learn, fast to code, ready for production

FastAPI framework, high performance, easy to learn, fast to code, ready for production Documentation: https://fastapi.tiangolo.com Source Code: https:

Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.
Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.

Mimesis - Fake Data Generator Description Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes

Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.
Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.

Mimesis - Fake Data Generator Description Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes

High-performance cross-platform Video Processing Python framework powerpacked with unique trailblazing features :fire:
High-performance cross-platform Video Processing Python framework powerpacked with unique trailblazing features :fire:

Releases | Gears | Documentation | Installation | License VidGear is a High-Performance Video Processing Python Library that provides an easy-to-use,

Backtest 1000s of minute-by-minute trading algorithms for training AI with automated pricing data from: IEX, Tradier and FinViz. Datasets and trading performance automatically published to S3 for building AI training datasets for teaching DNNs how to trade. Runs on Kubernetes and docker-compose. >150 million trading history rows generated from +5000 algorithms. Heads up: Yahoo's Finance API was disabled on 2019-01-03 https://developer.yahoo.com/yql/
Owner
Bytedance Inc.
Bytedance Inc.
DownTime-Score is a Small project aimed to Monitor the performance and the availabillity of a variety of the Vital and Critical Moroccan Web Portals

DownTime-Score DownTime-Score is a Small project aimed to Monitor the performance and the availabillity of a variety of the Vital and Critical Morocca

adnane-tebbaa 5 Apr 30, 2022
Performance data for WASM SIMD instructions.

WASM SIMD Data This repository contains code and data which can be used to generate a JSON file containing information about the WASM SIMD proposal. F

Evan Nemerson 5 Jul 24, 2022
This is an online course where you can learn and master the skill of low-level performance analysis and tuning.

Performance Ninja Class This is an online course where you can learn to find and fix low-level performance issues, for example CPU cache misses and br

Denis Bakhvalov 1.2k Dec 30, 2022
EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way.

EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way.

EasyBuild community 87 Dec 27, 2022
Developed a website to analyze and generate report of students based on the curriculum that represents student’s academic performance.

Developed a website to analyze and generate report of students based on the curriculum that represents student’s academic performance. We have developed the system such that, it will automatically parse data onto the database from excel file, which will in return reduce time consumption of analysis of data.

VIJETA CHAVHAN 3 Nov 8, 2022
An interactive tool with which to explore the possible imaging performance of candidate ngEHT architectures.

ngEHTexplorer An interactive tool with which to explore the possible imaging performance of candidate ngEHT architectures. Welcome! ngEHTexplorer is a

Avery Broderick 7 Jan 28, 2022
Terrible sudoku solver with spaghetti code and performance issues

SudokuSolver Terrible sudoku solver with spaghetti code and performance issues - if it's unable to figure out next step it will stop working, it never

Kamil Bizoń 1 Dec 5, 2021
Performance monitoring and testing of OpenStack

Browbeat Browbeat is a performance tuning and analysis tool for OpenStack. Browbeat is free, Open Source software. Analyze and tune your Cloud for opt

cloud-bulldozer 83 Dec 14, 2022
Defichain maxi - Scripts to optimize performance on defichain rewards

defichain_maxi This script is made to optimize your defichain vault rewards by m

kuegi 75 Dec 31, 2022
A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

matrixprofile-ts matrixprofile-ts is a Python 2 and 3 library for evaluating time series data using the Matrix Profile algorithms developed by the Keo

Target 696 Dec 26, 2022