FedScale: Benchmarking Model and System Performance of Federated Learning

Overview

FedScale: Benchmarking Model and System Performance of Federated Learning (Paper)

This repository contains scripts and instructions of building FedScale, a diverse set of challenging and realistic benchmark datasets to facilitate scalable, comprehensive, and reproducible federated learning (FL) research. FedScale datasets are large-scale, encompassing a diverse range of important FL tasks, such as image classification, object detection, language modeling, speech recognition, and reinforcement learning. For each dataset, we provide a unified evaluation protocol using realistic data splits and evaluation metrics. To meet the pressing need for reproducing realistic FL at scale, we have also built an efficient evaluation platform, FedScale Automated Runtime (FAR), to simplify and standardize the process of FL experimental setup and model evaluation. Our evaluation platform provides flexible APIs to implement new FL algorithms and include new execution backends with minimal developer efforts.

FedScale is open-source with permissive licenses and actively maintained, and we welcome feedback and contributions from the community!

Overview

Getting Started

Our install.sh will install the following automatically:

  • Anaconda Package Manager
  • CUDA 10.2

Note: if you prefer different versions of conda and CUDA, please check comments in install.sh for details.

Run the following commands to install FedScale.

git clone https://github.com/SymbioticLab/FedScale
cd FedScale
source install.sh 

Realistic FL Datasets

We are adding more datasets! Please feel free to contribute!

We provide real-world datasets for the federated learning community, and plan to release much more soon! Each is associated with its training, validation and testing dataset. A summary of statistics for training datasets can be found in Table, and you can refer to each folder for more details. Due to the super large scale of datasets, we are uploading these data and carefully validating their implementations to FAR. So we are actively making each dataset available for FAR experiments.

CV tasks:

Dataset Data Type # of Clients # of Samples Example Task
iNature Image 2,295 193K Classification
FMNIST Image 3,400 640K Classification
OpenImage Image 13,771 1.3M Classification, Object detection
Google Landmark Image 43,484 3.6M Classification
Charades Video 266 10K Action recognition
VLOG Video 4,900 9.6k Video classification, Object detection

NLP tasks:

Dataset Data Type # of Clients # of Samples Example Task
Europarl Text 27,835 1.2M Text translation
Blog Corpus Text 19,320 137M Word prediction
Stackoverflow Text 342,477 135M Word prediction, classification
Reddit Text 1,660,820 351M Word prediction
Amazon Review Text 1,822,925 166M Classification, Word prediction
CoQA Text 7,189 114K Question Answering
LibriTTS Text 2,456 37K Text to speech
Google Speech Audio 2,618 105K Speech recognition
Common Voice Audio 12,976 1.1M Speech recognition

Misc Applications:

Dataset Data Type # of Clients # of Samples Example Task
Taobao Text 182,806 0.9M Recommendation
Go dataset Text 150,333 4.9M Reinforcement learning

Note that no details were kept of any of the participants age, gender, or location, and random ids were assigned to each individual. In using these datasets, we will strictly obey to their licenses, and these datasets provided in this repo should be used for research purpose only.

Please go to ./dataset directory and follow the dataset README for more details.

Run Experiments with FAR

FedScale Automated Runtime (FAR), an automated and easily-deployable evaluation platform, to simplify and standardize the FL experimental setup and model evaluation under a practical setting. FAR is based on our Oort project, which has been shown to scale well and can emulate FL training of thousands of clients in each round.

FAR enables the developer to benchmark various FL efforts with practical FL data and metrics

Please go to ./core directory and follow the FAR README to set up FL training scripts.

Repo Structure

Repo Root
|---- dataset     # Realistic datasets in FedScale
|---- core        # Experiment platform of FedScale
    |---- examples  # Examples of new plugins
    |---- evals     # Backend of job submission
    

Notes

please consider to cite our paper if you use the code or data in your research project.

@inproceedings{fedscale-arxiv,
  title={FedScale: Benchmarking Model and System Performance of Federated Learning},
  author={Fan Lai and Yinwei Dai and Xiangfeng Zhu and Mosharaf Chowdhury},
  booktitle={arXiv:2105.11367},
  year={2021}
}

and

@inproceedings{oort-osdi21,
  title={Oort: Efficient Federated Learning via Guided Participant Selection},
  author={Fan Lai and Xiangfeng Zhu and Harsha V. Madhyastha and Mosharaf Chowdhury},
  booktitle={USENIX Symposium on Operating Systems Design and Implementation (OSDI)},
  year={2021}
}

Contact

Fan Lai ([email protected]), Yinwei Dai ([email protected]), Xiangfeng Zhu ([email protected]) and Mosharaf Chowdhury from the University of Michigan.

Comments
  • Android Aggregation and Execution Support

    Android Aggregation and Execution Support

    Why are these changes needed?

    To support android on-device training and testing with MNN backend.

    Related issue number

    N/A

    Checks

    • [x] I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
    • [x] I've made sure the following tests are passing.
    • Testing Configurations
      • [x] Dry Run (20 training rounds & 1 evaluation round)
      • [ ] Cifar 10 (20 training rounds & 1 evaluation round)
      • [ ] Femnist (20 training rounds & 1 evaluation round)
    opened by continue-revolution 31
  • [<FedScale component: Core|Dataloader|etc...>]

    []

    What happened + What you expected to happen

    The training process sometimes crashes unexpectedly after the model evaluation (testing on the testing set). image

    Versions / Dependencies

    OS: Linux (CloudLab FedScale 240-g5; 1 node) FedScale, Python, cuda, etc: installed by "install.sh --cuda" provided by FedScale.

    Reproduction script

    The conf.yml file I used. conf.yml.zip

    Issue Severity

    Low: It annoys or frustrates me.

    bug 
    opened by Yunzhen-Liu 19
  • Reorg repo

    Reorg repo

    1. Reorganize the repo into better structures. We expect no big changes in the near future;
    2. Fix the SLOW installation, which is due to specifying too many random conda channels (not due to installing too many). It is much faster now;
    3. Fix some legacy paths in docs;

    Test method: Passed the dryrun, femnist and cifar quick run over 10+ rounds.

    opened by fanlai0990 16
  • Install dataset-specific dependencies when downloading that dataset

    Install dataset-specific dependencies when downloading that dataset

    This will make initial conda setup faster (basically, delete the dependency from the environment.yml), and people can avoid installing unnecessary packages.

    enhancement 
    opened by mosharaf 14
  • ProcessGroupGloo error when running on more than one worker machine

    ProcessGroupGloo error when running on more than one worker machine

    Hi, I am trying to perform training based on the following config file for femnist dataset. I can run the experiment using two virtual machines. One as a parameter server and the other as a worker. However, if I increase the number of workers, let's say two workers, I run into the following error (please see the next comment).

    Any thought on this?

    opened by etesami 11
  • Fix Async

    Fix Async

    Why are these changes needed?

    In async FedScale example, (i) training stalls after a while; (ii) API mismatch in Test;

    Related issue number

    Closes #148

    Checks

    • [x] I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
    • [x] I've made sure the following tests are passing.
    • Testing Configurations
      • [x] Dry Run (20 training rounds & 1 evaluation round)
      • [x] Cifar 10 (20 training rounds & 1 evaluation round)
      • [x] Femnist (20 training rounds & 1 evaluation round)
    opened by fanlai0990 8
  • Inconsistency in the dataset directory

    Inconsistency in the dataset directory

    1. The README says 20 datasets, the download script has 16 or so, the data directory has 15 or 16.
    2. Naming of the datasets are inconsistent too; e.g., iNature vs iNaturalist
    3. Using a single letter in the download script is also confusing and short-sighted. There may be more datasets than letters in the alphabet. A convention would be --dataset-name
    documentation enhancement 
    opened by mosharaf 8
  • Fix async

    Fix async

    Why are these changes needed?

    1. Model testing is somehow missing; 2. Weird model accuracy over training;

    Related issue number

    Checks

    • [ ] I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
    • [ ] I've made sure the following tests are passing.
    • Testing Configurations
      • [ ] Dry Run (20 training rounds & 1 evaluation round)
      • [ ] Cifar 10 (20 training rounds & 1 evaluation round)
      • [ ] Femnist (20 training rounds & 1 evaluation round)
    opened by fanlai0990 7
  • [Core] Async aggregator freezes during evaluation

    [Core] Async aggregator freezes during evaluation

    What happened + What you expected to happen

    Hi fedscale team, I tried to run the async aggregator locally, but no test metrics are generated. The training seems to work fine, but the system freezes without any error at round 50.

    Here are the last events from the aggregator:

    (07-26) 11:38:52 INFO [async_aggregator.py:216] Wall clock: 2519 s, round: 49, Remaining participants: 5, Succeed participants: 10, Training loss: 4.433294297636379 (07-26) 11:38:55 INFO [async_aggregator.py:279] Client 2602 train on model 46 during 2274-2535.0060934242283 (07-26) 11:38:55 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (1) (07-26) 11:38:55 INFO [aggregator.py:812] Issue EVENT (update_model) to EXECUTOR (1) (07-26) 11:38:56 INFO [async_aggregator.py:279] Client 2667 train on model 46 during 2319-2539.592434184604 (07-26) 11:38:56 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (1) (07-26) 11:38:56 INFO [async_aggregator.py:279] Client 2683 train on model 46 during 2328-2542.9932767611217 (07-26) 11:38:56 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (1) (07-26) 11:38:59 INFO [async_aggregator.py:279] Client 2569 train on model 45 during 2253-2605.669321587796 (07-26) 11:38:59 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (2) (07-26) 11:38:59 INFO [aggregator.py:812] Issue EVENT (update_model) to EXECUTOR (2) (07-26) 11:39:01 INFO [async_aggregator.py:279] Client 2769 train on model 47 during 2385-2680.206093424228 (07-26) 11:39:01 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (2)

    Here's the tail of the executor log:

    oving_loss': 4.510447650271058, 'trained_size': 100, 'success': True, 'utility': 752.3330107862802} (07-26) 11:39:00 INFO [client.py:32] Start to train (CLIENT: 2569) ... (07-26) 11:39:01 INFO [client.py:68] Training of (CLIENT: 2569) completes, {'clientId': 2569, 'moving_loss': 4.526119316819311, 'trained_size': 100, 'success': True, 'utility': 729.5144631284894} (07-26) 11:39:01 INFO [client.py:32] Start to train (CLIENT: 2769) ... (07-26) 11:39:02 INFO [client.py:68] Training of (CLIENT: 2769) completes, {'clientId': 2769, 'moving_loss': 4.5834765700435645, 'trained_size': 100, 'success': True, 'utility': 692.4210048353054} (07-26) 11:39:04 INFO [client.py:68] Training of (CLIENT: 2667) completes, {'clientId': 2667, 'moving_loss': 4.169509475803674, 'trained_size': 100, 'success': True, 'utility': 556.3458848955673}

    Versions / Dependencies

    Latest

    Reproduction script

    lHere's my config for the async_aggregator.py example:

    
    # ip address of the parameter server (need 1 GPU process)
    ps_ip: localhost
    
    # ip address of each worker:# of available gpus process on each gpu in this node
    # Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
    # E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
    worker_ips:
        - localhost:[2]
    
    exp_path: $FEDSCALE_HOME/fedscale/core
    
    # Entry function of executor and aggregator under $exp_path
    executor_entry: ../../examples/async_fl/async_executor.py
    
    aggregator_entry: ../../examples/async_fl/async_aggregator.py
    
    auth:
        ssh_user: ""
        ssh_private_key: ~/.ssh/id_rsa
    
    # cmd to run before we can indeed run FAR (in order)
    setup_commands:
        - source $HOME/anaconda3/bin/activate fedscale
    
    # ========== Additional job configuration ==========
    # Default parameters are specified in config_parser.py, wherein more description of the parameter can be found
    
    job_conf:
        - job_name: asyncfl                   # Generate logs under this folder: log_path/job_name/time_stamp
        - log_path: $FEDSCALE_HOME/benchmark # Path of log files
        - num_participants: 800                      # Number of participants per round, we use K=100 in our paper, large K will be much slower
        - data_set: femnist                     # Dataset: openImg, google_speech, stackoverflow
        - data_dir: $FEDSCALE_HOME/benchmark/dataset/data/femnist    # Path of the dataset
        - data_map_file: $FEDSCALE_HOME/benchmark/dataset/data/femnist/client_data_mapping/train.csv              # Allocation of data to each client, turn to iid setting if not provided
        - device_conf_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_device_capacity     # Path of the client trace
        - device_avail_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_behave_trace
        - model: shufflenet_v2_x2_0                            # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
        - gradient_policy: yogi                 # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
        - eval_interval: 5                     # How many rounds to run a testing on the testing set
        - rounds: 500                          # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
        - filter_less: 21                       # Remove clients w/ less than 21 samples
        - num_loaders: 2
        - yogi_eta: 3e-3
        - yogi_tau: 1e-8
        - local_steps: 5
        - learning_rate: 0.05
        - batch_size: 20
        - test_bsz: 20
        - malicious_factor: 4
        - use_cuda: False
        - decay_round: 50
        - overcommitment: 1.0
        - async_buffer: 10
        - checkin_period: 50
        - arrival_interval: 3
    

    Issue Severity

    No response

    bug 
    opened by ewenw 7
  • Support k8s for job submission and management

    Support k8s for job submission and management

    Why are these changes needed?

    Support using k8s to manage job lifecycles, including job submission, initialization, termination and clean-up.

    TODO:

    • [x] add README for k8s job management tutorial
    1. change in docker/driver.py is added to use k8s client apis for job management, now the driver will support "default", "docker" and "k8s" modes.
    2. add a yaml generator for automating generation of k8s container configs.
    3. add new example k8s configs in benchmark

    Related issue number

    Checks

    • [x] I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
    • [x] I've made sure the following tests are passing.
    • Testing Configurations
      • k8s
        • [x] Dry Run (20 training rounds & 1 evaluation round)
        • [x] Cifar 10 (20 training rounds & 1 evaluation round)
        • [x] Femnist (20 training rounds & 1 evaluation round)
      • Regression 1: docker
        • [x] Cifar 10 (20 training rounds & 1 evaluation round)
        • [x] Femnist (20 training rounds & 1 evaluation round)
      • Regression 2: default
        • [x] Cifar 10 (20 training rounds & 1 evaluation round)
        • [x] Femnist (20 training rounds & 1 evaluation round)
    opened by IKACE 6
  • Running FEMNIST tutorial on local machine gives a few warnings.

    Running FEMNIST tutorial on local machine gives a few warnings.

    1. [W ParallelNative.cpp:229] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
    2. /Users/mosharaf/opt/anaconda3/envs/fedscale/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:42: DeprecationWarning: FLIP_LEFT_RIGHT is deprecated and will be removed in Pillow 10 (2023-07-01). Use Transpose.FLIP_LEFT_RIGHT instead. return img.transpose(Image.FLIP_LEFT_RIGHT)

    Training seems to continue.

    bug wontfix 
    opened by mosharaf 6
  • Dataloader support for TF

    Dataloader support for TF

    Description

    Hi Team, there is admittedly some overlap between this issue and a previous one. However I thought I would make a new one since the other is fairly old. I am looking for some data loader support in FedScale for TensorFlow. It seems the data classes as they stand are written using PyTorch in mind and I was wondering if anyone has any experience using a TensorFlow data set, particularly using .tfrecord files with a known schema.

    Use case

    I work in industry and am looking to use FedScale to run a simulation of a specific federated learning model on a specific set of hardware. All of our models are written using tf and keras.

    enhancement 
    opened by kashprime 3
  • [Async simulation] Implementation idea for task scheduling

    [Async simulation] Implementation idea for task scheduling

    Description

    Hi FedScale team, here's my suggestion on how to implement the async simulation mode using device traces without needing a constant arrival parameter (related to #162):

    sort device traces by start time
    queue = initialize min priority queue
    while tasks_issued < buffer_size:
       event_time, event_type, client_id = queue.get()
       if event_type == 'start':
            current_concurrency += 1
            if current_concurrency < MAX_CONCURRENCY:
                issue_task(event_time)
        else:
            current_concurrency -= 1
            if current_concurrency == MAX_CONCURRENCY - 1:
                issue_task(event_time)
    
    issue_task(event_time):
        client, trace_start, trace_end = sample next available client at event_time
        add client task to individual executor's queue
        queue.put((trace_start, 'start', client))
        queue.put((trace_end, 'end', client)
    

    This works well in my implementation, but might be harder to integrate into fedscale, hence I'm creating an issue to document it. Let me know if you have any questions / concerns.

    Below is the python code for this scheduling algorithm, feel free to run it and validate the output:

    import random
    from queue import PriorityQueue
    
    id = 0
    
    
    def generate_start_end(time):
        # next available client
        global id
        start_time = time + random.randint(0, 1)
        duration = random.randint(1, 3)
        id += 1
        return start_time, start_time + duration, id
    
    
    min_pq = PriorityQueue()
    total_tasks = 1
    
    TOTAL_TASKS = 10
    MAX_CONCURRENCY = 1
    current_concurrency = 0
    start_times = {}
    
    
    def new_task(event_time):
        new_start, new_end, client_id = generate_start_end(event_time)
        min_pq.put((new_start, 'start', client_id))
        min_pq.put((new_end, 'end', client_id))
        start_times[client_id] = new_start
    
    
    new_task(0)
    while not min_pq.empty():
        event_time, event_type, client_id = min_pq.get()
        if event_type == 'start':
            current_concurrency += 1
            if total_tasks < TOTAL_TASKS and current_concurrency < MAX_CONCURRENCY:
                new_task(event_time)
                total_tasks += 1
        else:
            current_concurrency -= 1
            if total_tasks < TOTAL_TASKS and current_concurrency == MAX_CONCURRENCY - 1:
                new_task(event_time)
                total_tasks += 1
            print(f"processing event starting at {start_times[client_id]} and ending at {event_time}")
    

    Use case

    No response

    enhancement 
    opened by ewenw 1
  • Redis Support for FedScale

    Redis Support for FedScale

    Why are these changes needed?

    To integrate Redis into aggregator for saving aggregation data.

    Related issue number

    N/A

    Checks

    • [x] I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
    • [x] I've made sure the following tests are passing.
    • Testing Configurations
      • [x] Dry Run (20 training rounds & 1 evaluation round)
      • [x] Cifar 10 (20 training rounds & 1 evaluation round)
      • [x] Femnist (20 training rounds & 1 evaluation round)

    Note:

    1. All tests are run on cpu.
    2. Met with the following KeyError bug on the same line in Femnist once with Redis, and also once without Redis (i.e. original code). Sample error output:
    File "/users/xuyehe/FedScale-rd/fedscale/core/aggregation/aggregator.py", line 386, in client_completion_handler
        duration=self.virtual_client_clock[results['clientId']]['computation'] +
    KeyError: 665
    
    opened by xuyehe 2
  • Improve documentation in various components

    Improve documentation in various components

    Description

    Some code comments and docstrings should be added, especially for resource_manager.py, data_loader.py, client_manager.py, etc. Also, some model diagram on what each component would be helpful for people new to the codebase.

    documentation 
    opened by ewenw 1
  • [Dataloader] Fix Missing Model Configuration

    [Dataloader] Fix Missing Model Configuration

    What happened + What you expected to happen

    Albert config is missing, leading to model failures. We should avoid providing such configurations by ourselves. This should be done automatically.

    Versions / Dependencies

    FedScale python folder.

    Reproduction script

    Try to submit nlp configs.

    Issue Severity

    Medium: It is a significant difficulty but I can work around it.

    bug 
    opened by fanlai0990 0
  • Issues on FedScale and Oort: (1) widely promotes its so-called advantages that are not based new version of FedML (10 months outdated until today) (2) The evaluation of FedML old version (Oct, 2021) is not based on facts and overlapping with a published paper; (3) unrealistic overlap between system efficiency and data distribution (4) issues on numerical optimization (5) dual submission?

    Issues on FedScale and Oort: (1) widely promotes its so-called advantages that are not based new version of FedML (10 months outdated until today) (2) The evaluation of FedML old version (Oct, 2021) is not based on facts and overlapping with a published paper; (3) unrealistic overlap between system efficiency and data distribution (4) issues on numerical optimization (5) dual submission?

    Dear Authors of FedScale,

    I didn't want to comment too much on FedScale because I thought all the experts in the field knew the truth. But you promote your outdated paper for a long time without based on facts and your co-author (e.g., Jiachen) always publicly claim the inaccurate advantages of FedScale over FedML on the Internet, which is a deep harm and disrespect to FedML's past academic efforts and current industrialization efforts. Therefore, it is necessary for me to state some facts here, and let people know the truth.

    Summary

    FedScale paper have adopted the method of evaluating the old version of FedML during the ICML submission (3 months expired at the time of submission), and still did not mention it during Camera Ready (6 months expired) and during the conference speech period (about 10 months expired). The issue of comparing with FedML in an old version, and in this case, widely publicizing its so-called advantages that are not based on facts and new version, has brought a lot of harm and loss to FedML. Aside from the harm caused by the publication and publicity of the old version, even the comparison of the old version is academically inaccurate and wrong: there are 4 core arguments in the paper, 3 are not in line with the facts, and the fourth has highly overlaps with existing papers. In addition, the ICML paper substantially overlaps with a proceeding-based published paper at another workshop. Based on these issues, we think this paper violates the dual submission policy and does not meet the criteria for publication. We also hope FedScale team can update paper (https://arxiv.org/pdf/2105.11367v5.pdf) and media articles (Zhihu, etc.) in a timely manner, clarifying the above issues, avoiding misunderstandings among users and peers in the Chinese and English communities, and terminating unnecessary reputation damage .

    Issue 1: FedScale widely promotes its so-called advantages that are not based new version (10 months outdated until today)

    1. Your ICML 2022 paper uses a version 3.5 months before the submission deadline, 6 months before the review/rebuttal deadline (review open date). Reviewers should notice this issue. I believe the rebuttal date is much after our advanced feature release, not mentioning that you only compare with part of our code in an old version.
    2. You promote your ICML 2022 paper on social media (e.g., Zhihu) without mentioning the version date and ID. The earliest date of this promotion is already 6 months outdated, compared to the date you compare with FedML. At that time, FedML already released new version with many advanced features. Your improper claim in the promotion raises too much misunderstanding and concern of FedML company, which invades our reputation a lot (friends and investors come to ask the issue).
    3. The date you present your ICML 2022 in the main conference is already 10 months outdated. You promoted it at social media during that week. Unfortunately, you still didn't address the version ID issue. This further invades FedML reputation (we got concern messages from friends and users at that week). The fact is that we already released a lot of features. Even so, FedML team still kept silent and still believe people can tell the truth.
    4. Your paper didn't mention the version in the main text. Until today, the version ID you mentioned in the appendix is a version that is already 10 months outdated.

    https://arxiv.org/pdf/2105.11367v5.pdf - Table 1's comment on FedML is fully wrong and outdated. image

    My comments: It's surprising to many engineers and researchers at USC and FedML that you overclaim that your platform has stronger "Scalable Platform". Please check our platform at https://fedml.ai.

    FedML AI platform releases the world’s federated learning open platform on the public cloud with an in-depth introduction of products and technologies! https://medium.com/@FedML/fedml-ai-platform-releases-the-worlds-federated-learning-open-platform-on-public-cloud-with-an-8024e68a70b6

    Issues 2: The evaluation of FedML old version (Oct, 2021) is not based on facts. In addition, the core contribution of FedScale ICML paper (system and data heterogeneity) overlaps a published paper Oort. ICML reviewers should be aware of these issues.

    Quote from https://arxiv.org/pdf/2105.11367v5.pdf: "First, they are limited in the versatility of data for various real-world FL applications. Indeed, even though they may have quite a few datasets and FL training tasks (e.g., LEAF (Caldas et al., 2019)), their datasets often contain synthetically generated partitions derived from conventional datasets (e.g., CIFAR) and do not represent realistic characteristics. This is because these benchmarks are mostly borrowed from traditional ML benchmarks (e.g., MLPerf (Mattson et al., 2020)) or designed for simulated FL environments like TensorFlow Federated (TFF) (tff) or PySyft (pys). Second, existing benchmarks often overlook system speed, connectivity, and availability of the clients (e.g., FedML (He et al., 2020) and Flower (Beutel et al., 2021)). This discourages FL efforts from considering system efficiency and leads to overly optimistic statistical performance (§2). Third, their datasets are primarily small-scale, because their experimental environments are unable to emulate large-scale FL deployments. While real FL often involves thousands of participants in each training round (Kairouz et al., 2021b; Yang et al., 2018), most existing benchmarking platforms can merely support the training of tens of participants per round. Finally, most of them lack user-friendly APIs for automated integration, resulting in great engineering efforts for benchmarking at scale"

    These four core arguments are not based on facts:

    1. The 1st argument (about dataset) is wrong and not in line with the fact and exsisting works. We already support a large number of datasets in 2020 that conform to the habits of the ICML/NeurIPS/ICLR community: https://doc.fedml.ai/simulation/user_guide/datasets-and-models.html, and slso supports real datasets (FedNLP, FedGraphNN, FedCV, FedIoT) contained in massive applications: https://github.com/FedML-AI/FedML/tree/master/python/app. The timelines for these works all predate October 2021. These works have been published in the workshops and main conferences of major conferences. It is important to note that these works were published 6 months earlier than the old version of FedML mentioned in the ICML paper, and basically more than half a year earlier than the ICML 2022 submission deadline.

    FedGraphNN: FedGraphNN: A Federated Learning System and Benchmark for Graph Neural Networks. https://arxiv.org/abs/2104.07145 (Arxiv Time: 4 Apr 2021) FedNLP: FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks. https://arxiv.org/abs/2104.08815 (Arxiv Time: 8 Apr 2021) FedCV: https://arxiv.org/abs/2111.11066 (Arxiv time: 22 Nov 2021) FedIoT: https://arxiv.org/abs/2106.07976v1 (Arxiv time: 15 Jun 2021)

    1. the 3rd argument (unable to emulate large-scale FL deployments) is also not based on facts:

    (1) https://arxiv.org/pdf/2105.11367v5.pdf - "FedML can only support 30 participants because of its suboptimal scalability, which under-reports the FL performance that the algorithm can indeed achieve"

    My comments: This doesn't match the fact. From the oldest version of FedML, it always supports arbitrary number of clients training by using the single process (standalone in the old version) sequential training. In addition, our users can run parallel experiments (one GPU per job/run) with multiple GPUs to accelerate the hyperparameter tuning. This avoids communication cost in emulator level. In our latest version, it supports sequential training with multiple nodes via efficient scheduler. Therefore, such a comment does not match the fact.

    (3) https://arxiv.org/pdf/2105.11367v5.pdf - "Third, their datasets are primarily small-scale, because their experimental environments are unable to emulate large-scale FL deployments."

    My comments: this is also misleading to paper readers. In our old version, we already support many large-scale datasets for reseachers in ML community: https://doc.fedml.ai/simulation/user_guide/datasets-and-models.html. They are widely used by many ICML/NeurIPS/ICLR papers. Recently, our latest version even support many realistic and large-scale datasetes in CV, NLP, healthcare, graph neural networks, and IoT. See some links at: https://github.com/FedML-AI/FedML/tree/master/python/app https://github.com/FedML-AI/FedML/tree/master/iot Each one is supported by top-tier conference papers. For example, the NLP one (https://arxiv.org/abs/2104.08815) is connected to Huggingface and accepted to NACCL 2022.

    1. the 4th argument (API not friendly) also does not respect FedML's works. We released FedNLP, FedGraphNN, FedCV, FedIoT and other application frameworks as early as 1.5 years ago (https://open.fedml.ai/platform/appStore), all based on the FedML core framework, after so many applications Verification has long proven its convenience. Regardless of these differences, the best way to prove "convenience" is the user data, which you can look at our GitHub stars, paper citations, platform user number, etc.

    We also put together a brief introduction to APIs to see who is more convenient: https://medium.com/@FedML/fedml-releases-simple-and-flexible-apis-boosting-innovation-in-algorithm-and-system-optimization-b21c2f4b88c8

    1. Regarding 2nd key argument, we think it has been mentioned in another paper Oort (highly overlapping, please compare the two papers; Oort is here: https://arxiv.org/abs/2010.06081), which does not belong to the spirit of ICML that requires independent contribution and novelty of a published paper. Specifically, system heterogeneity (system speed, connectivity and availability) has been described in Section 2.2 of Oort's original paper, and also clearly mentioned in Section 7.1 of the experimental section. System speed, connectivity and availability are the same things as Section 3.2 in the original FedScale article. Oort says: We simulate real-world heterogeneous client system performance and data in both training and testing evaluations using an open-source FL benchmark [48]: (1) Heterogeneous device runtimes (speed) of different models, network throughput/connectivity (connectivity), device model, and availability are emulated using data from AI Benchmark [1] and Network Measurements on mobiles [6].

    Issues 3: issues in FedScale and Oort: unrealistic overlap between system speed, data distribution, and client device availability

    Quote from https://arxiv.org/pdf/2105.11367v5.pdf: "Second, existing benchmarks often overlook system speed, connectivity, and availability of the clients (e.g., FedML (He et al., 2020) and Flower (Beutel et al., 2021)). This discourages FL efforts from considering system efficiency and leads to overly optimistic statistical performance (§2)."

    My comments: this is misleading. My question is "how can you match a realistic overlapping of system speed, data distribution statistics, and client device availability?" You get them from three independent databases, which does not match the practice. Then you build Oort based on this unrealistic assumption. FedScale team never clearly answers this question. This benchmark definitely brings issues in numerical optimization theory. We ML and System researchers do not hope this misleading benchmark to misguide the research in ML area.

    Moreover, such a comment ("existing benchmarks often overlook system speed, connectivity, and availability of the clients") is extremely disrespectful to the work of an industrialized team who has expertise more than this. Distributed system is the hardcore area that FedML engineering team focuses on. Maybe your team only reads part of the materials (white paper? or part of our source code?). Please refer to a comprehensive material list here:

    FedML Homepage: https://fedml.ai/ FedML Open Source: https://github.com/FedML-AI FedML Platform: https://open.fedml.ai FedML Use Cases: https://open.fedml.ai/platform/appStore FedML Documentation: https://doc.fedml.ai FedML Research: https://fedml.ai/research-papers/ (50+ papers covering many aspects including security and privacy)

    Issues 4: FedScale only supports running on the same number of iteration locally, however, many ICML/NeurIPS/ICLR papers (almost all) are working on the same number of epochs. This differs from the entire ML community significantly.

    https://github.com/SymbioticLab/FedScale/blob/51cc4a1e0ab553cd79ecb59af211008788f1af39/fedscale/core/execution/client.py#L50

    Issue 5: We suspect that FedScale ICML paper violates the dual submission policy in ML community

    The FedScale ICML version (ICML proceeding https://proceedings.mlr.press/v162/lai22a/lai22a.pdf) overlaps substantially with a workshop paper with proceeding (https://dl.acm.org/doi/10.1145/3477114.3488760). The workshop date is October 2021, at least 3 months earlier than the ICML 2022 submission deadline. Normally, ICML/NeurIPS/ICLR do not allow submissions that are already published somewhere else with proceeding using the same title/author/core contribution.

    (1) These two papers have the same title "FedScale: Benchmarking Model and System Performance of Federated Learning at Scale". (2) These two papers have 5 authors overlapping Workshop authors: Fan Lai, Yinwei Dai, Xiangfeng Zhu, Harsha V. Madhyastha, Mosharaf Chowdhury ICML authors: Fan Lai, Yinwei Dai, Sanjay S. Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V. Madhyastha, Mosharaf Chowdhury (two authors are added in the ICML version)

    (3) substantial contribution and core argument overlapping. See the two key paragraphs in these two papers. image

    image Note: these two papers are talking about the same arguments with the same wording.

    ICML policy: https://icml.cc/Conferences/2022/StyleAuthorInstructions image

    As mentioned in issue 2, FedScale ICML 2022 paper also overlaps a key contribution with another published paper at OSDI 2021:

    The 2nd key argument has been mentioned in another paper Oort (highly overlapping, please compare the two papers; Oort is here: https://arxiv.org/abs/2010.06081), which does not belong to the spirit of ICML that requires independent contribution and novelty of a published paper.

    Specifically, system heterogeneity (system speed, connectivity and availability) has been described in Section 2.2 of Oort's original paper, and also clearly mentioned in Section 7.1 of the experimental section. System speed, connectivity and availability are the same things as Section 3.2 in the original FedScale article. Oort says: We simulate real-world heterogeneous client system performance and data in both training and testing evaluations using an open-source FL benchmark [48]: (1) Heterogeneous device runtimes (speed) of different models, network throughput/connectivity (connectivity), device model, and availability are emulated using data from AI Benchmark [1] and Network Measurements on mobiles [6].

    Versions / Dependencies

    Code: https://github.com/SymbioticLab/FedScale (51cc4a1)

    Paper: https://arxiv.org/pdf/2105.11367v5.pdf (v5)

    help wanted 
    opened by chaoyanghe 5
Releases(v0.5)
  • v0.5(Jul 18, 2022)

    FedScale 0.5 is the first major release of FedScale after years of development. ​​

    Major Features

    • Distributed/standalone fast-forward FL evaluations
    • 21 realistic FL datasets
    • 70+ lightweight FL models
    • PyTorch and TensorFlow support
    • GPU, x86, and ARM hardware backend support
    • Real-world client system traces
    • Synchronous and asynchronous training with straggler mitigation support
    • Homepage, API documentation

    Credits

    FedScale 0.5 was the work of a large set of new contributors from Michigan and outside. Thanks also to all the FedScale users who have suggested new features or reported bugs.

    Source code(tar.gz)
    Source code(zip)
Owner
null
TianyuQi 10 Dec 11, 2022
ColossalAI-Benchmark - Performance benchmarking with ColossalAI

Benchmark for Tuning Accuracy and Efficiency Overview The benchmark includes our

HPC-AI Tech 31 Oct 7, 2022
In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Contrastive Learning of Object Representations Supervisor: Prof. Dr. Gemma Roig Institutions: Goethe University CVAI - Computational Vision & Artifici

Dirk Neuhäuser 6 Dec 8, 2022
PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System This repository contains the PyTorch im

Libo Qin 25 Sep 6, 2022
PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Libo Qin 12 Sep 26, 2021
Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

B-Pref Official codebase for B-Pref: Benchmarking Preference-BasedReinforcement Learning contains scripts to reproduce experiments. Install conda env

null 48 Dec 20, 2022
A Research-oriented Federated Learning Library and Benchmark Platform for Graph Neural Networks. Accepted to ICLR'2021 - DPML and MLSys'21 - GNNSys workshops.

FedGraphNN: A Federated Learning System and Benchmark for Graph Neural Networks A Research-oriented Federated Learning Library and Benchmark Platform

FedML-AI 175 Dec 1, 2022
Everything you want about DP-Based Federated Learning, including Papers and Code. (Mechanism: Laplace or Gaussian, Dataset: femnist, shakespeare, mnist, cifar-10 and fashion-mnist. )

Differential Privacy (DP) Based Federated Learning (FL) Everything about DP-based FL you need is here. (所有你需要的DP-based FL的信息都在这里) Code Tip: the code o

wenzhu 83 Dec 24, 2022
Revisiting, benchmarking, and refining Heterogeneous Graph Neural Networks.

Heterogeneous Graph Benchmark Revisiting, benchmarking, and refining Heterogeneous Graph Neural Networks. Roadmap We organize our repo by task, and on

THUDM 176 Dec 17, 2022
RobustART: Benchmarking Robustness on Architecture Design and Training Techniques

The first comprehensive Robustness investigation benchmark on large-scale dataset ImageNet regarding ARchitecture design and Training techniques towards diverse noises.

null 132 Dec 23, 2022
Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking We revisit and address issues with Oxford 5k and Paris 6k image retrieval benchm

Filip Radenovic 188 Dec 17, 2022
Code for the paper "Benchmarking and Analyzing Point Cloud Classification under Corruptions"

ModelNet-C Code for the paper "Benchmarking and Analyzing Point Cloud Classification under Corruptions". For the latest updates, see: sites.google.com

Jiawei Ren 45 Dec 28, 2022
Evaluation and Benchmarking of Speech Super-resolution Methods

Speech Super-resolution Evaluation and Benchmarking What this repo do: A toolbox for the evaluation of speech super-resolution algorithms. Unify the e

Haohe Liu (刘濠赫) 84 Dec 20, 2022
FEDn is an open-source, modular and ML-framework agnostic framework for Federated Machine Learning

FEDn is an open-source, modular and ML-framework agnostic framework for Federated Machine Learning (FedML) developed and maintained by Scaleout Systems. FEDn enables highly scalable cross-silo and cross-device use-cases over FEDn networks.

Scaleout 75 Nov 9, 2022
Breaching - Breaching privacy in federated learning scenarios for vision and text

Breaching - A Framework for Attacks against Privacy in Federated Learning This P

Jonas Geiping 139 Jan 3, 2023
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 3, 2023
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 2.8k Feb 12, 2021
Pip-package for trajectory benchmarking from "Be your own Benchmark: No-Reference Trajectory Metric on Registered Point Clouds", ECMR'21

Map Metrics for Trajectory Quality Map metrics toolkit provides a set of metrics to quantitatively evaluate trajectory quality via estimating consiste

Mobile Robotics Lab. at Skoltech 31 Oct 28, 2022
Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Molecular Sets (MOSES): A benchmarking platform for molecular generation models Deep generative models are rapidly becoming popular for the discovery

Neelesh C A 3 Oct 14, 2022