FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedML-AI

Last update: Nov 27, 2022

Related tags

Overview

FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedNLP is a research-oriented benchmarking framework for advancing federated learning (FL) in natural language processing (NLP). It uses FedML repository as the git submodule. In other words, FedNLP only focuses on adavanced models and dataset, while FedML supports various federated optimizers (e.g., FedAvg) and platforms (Distributed Computing, IoT/Mobile, Standalone).

The figure below is the overall structure of FedNLP.

Installation

After git clone-ing this repository, please run the following command to install our dependencies.

conda create -n fednlp python=3.7
conda activate fednlp
# conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1 -c pytorch -n fednlp
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt 
cd FedML; git submodule init; git submodule update; cd ../;

Code Structure of FedNLP

FedML: a soft repository link generated using git submodule add https://github.com/FedML-AI/FedML.
data: provide data downloading scripts and raw data loader to process original data and generate h5py files. Besides, data/advanced_partition offers some practical partition functions to split data for each client.

Note that in FedML/data, there also exists datasets for research, but these datasets are used for evaluating federated optimizers (e.g., FedAvg) and platforms. FedNLP supports more advanced datasets and models.

data_preprocessing: preprocessors, examples and utility functions for each task formulation.
data_manager: data manager is responsible for loading dataset and partition data from h5py files and driving preprocessor to transform data to features.
model: advanced NLP models. You can define your own models in this folder.
trainer: please define your own trainer.py by inheriting the base class in FedML/fedml-core/trainer/fedavg_trainer.py. Some tasks can share the same trainer.
experiments/distributed:
1. experiments is the entry point for training. It contains experiments in different platforms. We start from distributed.
2. Every experiment integrates FIVE building blocks FedML (federated optimizers), data_manager, data_preprocessing, model, trainer.
3. To develop new experiments, please refer the code at experiments/distributed/transformer_exps/fedavg_main_tc.py.
experiments/centralized:
1. This is used to get the reference model accuracy for FL.

Data Preparation

In order to set up correct data to support federated learning, we provide some processed data files and partition files. Users can download them for further training conveniently.

If users want to set up their own dataset, they can refer the scripts under data/raw_data_loader. We already offer a bunch of examples, just follow one of them to prepare your owned data!

download our processed files from Amazon S3.

Dwnload files for each dataset using these two scripts data/download_data.sh and data/download_partition.sh.

We provide two files for each dataset: data files are saved in data_files, and partition files are in directory partiton_files. You need to put the downloaded data_files and partition_files in the data folder here. Simply put, we will have data/data_files/*_data.h5 and data/partition_files/*_partition.h5 in the end.

Experiments for Centralized Learning (Sanity Check)

Transformer-based models

First, please use this command to test the dependencies.

# Test the environment for the fed_transformers
python -m model.fed_transformers.test

Run Text Classification model with distilbert:

DATA_NAME=20news
CUDA_VISIBLE_DEVICES=1 python -m experiments.centralized.transformer_exps.main_tc \
    --dataset ${DATA_NAME} \
    --data_file ~/fednlp_data/data_files/${DATA_NAME}_data.h5 \
    --partition_file ~/fednlp_data/partition_files/${DATA_NAME}_partition.h5 \
    --partition_method niid_label_clients=100.0_alpha=5.0 \
    --model_type distilbert \
    --model_name distilbert-base-uncased  \
    --do_lower_case True \
    --train_batch_size 32 \
    --eval_batch_size 8 \
    --max_seq_length 256 \
    --learning_rate 5e-5 \
    --epochs 20 \
    --evaluate_during_training_steps 500 \
    --output_dir /tmp/${DATA_NAME}_fed/ \
    --n_gpu 1

Experiments for Federated Learning

We already summarize some scripts for running federated learning experiments. Once you finished the environment settings, you can refer and run these scripts including run_text_classification.sh, run_seq_tagging.sh and run_span_extraction.sh under experiments/distributed/transformer_exps.

Citation

Please cite our FedNLP and FedML paper if it helps your research. You can describe us in your paper like this: "We develop our experiments based on FedNLP [1] and FedML [2]".

Comments

(Distributed Data Loader)The data loading time for sentment140 is too long. I waited for more than half an hour.

Running this for loop is extremely loog. https://github.com/FedML-AI/FedNLP/blob/bd6dbb98e334637d69ad61e65f8d5ae75bf8d1cb/experiments/distributed/transformer_exps/text_classification_fedavg.py#L160

opened by chaoyanghe 27
[IMPORTANT]Client Sampling Frozen
Hello authors,

I'm currently implementing your work on text classification on 20news dataset. I'm using single Nvidia A6000 for this task with FedOPT algorithm, total client 50 and 2 clients per round.

After the data are loaded, once the training process comes to the client sampling part, it freezed like this: 1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (22)} - mapping_processes_to_gpu_device_from_yaml_file(): gpu_util = {'ChaoyangHe-GPU-RTX2080Tix4': [1, 0]} 1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (31)} - mapping_processes_to_gpu_device_from_yaml_file(): Process 0 running on host: ChaoyangHe-GPU-RTX2080Tix4, gethostname: lacrymosa.ics.uci.edu, local_gpu_id: 0 ... 1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (32)} - mapping_processes_to_gpu_device_from_yaml_file(): i = 1, worker_number = 1 1133831 2022-03-08,21:49:04.511 - {gpu_mapping.py (37)} - mapping_processes_to_gpu_device_from_yaml_file(): process_id = 0, GPU device = cuda:0 1133831 2022-03-08,21:49:04.511 - {fedavg_main_tc.py (84)} - (): process_id = 0, size = 1, device=cuda:0 1133831 2022-03-08,21:49:04.512 - {fedavg_main_tc.py (85)} - (): torch.cuda.current_device()=0 1133831 2022-03-08,21:49:04.512 - {fedavg_main_tc.py (86)} - (): torch.cuda.device_count()=2 Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']

This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading index from h5 file.: 100%|██████████| 100/100 [00:00<00:00, 2576.67it/s] 1133831 2022-03-08,21:49:07.753 - {base_data_manager.py (183)} - _load_federated_data_server(): caching test index size 7532test cut off None Loading data from h5 file.: 100%|██████████| 7532/7532 [00:02<00:00, 3057.08it/s] 100%|██████████| 7532/7532 [00:03<00:00, 2410.62it/s] 1133831 2022-03-08,21:49:13.718 - {text_classification_preprocessor.py (145)} - transform_features(): 7532 features created from 7532 samples. 1133831 2022-03-08,21:49:13.764 - {base_data_manager.py (196)} - _load_federated_data_server(): caching test data size 7532 1133831 2022-03-08,21:49:13.858 - {base_data_manager.py (219)} - _load_federated_data_server(): test_dl_global number = 942 1133831 2022-03-08,21:49:13.861 - {FedOptAggregator.py (132)} - client_sampling(): client_indexes = [26 86]

I dont know why this is happening. Could you help me with this issue?
opened by zjc664656505 20
No dataloader in data_preprocessing
Hi, I notice that in the experiments/centralized/bilstm_exps/main_text_classification.py it imports different dataloaders for different tasks. For example:

import data_preprocessing.AGNews.data_loader import data_preprocessing.SST_2.data_loader import data_preprocessing.SemEval2010Task8.data_loader import data_preprocessing.Sentiment140.data_loader import data_preprocessing.news_20.data_loader

But I didn't find these dataloader in the data_preprocessing file. Are these code removed or lost?

Thanks,
opened by ziqi-zhang 12
object doesn't exist for text classification script

If I run a text classification model with distilbert using:

DATA_NAME=20news CUDA_VISIBLE_DEVICES=1 python -m experiments.centralized.transformer_exps.main_tc
--dataset ${DATA_NAME}
--data_file ~/fednlp_data/data_files/${DATA_NAME}_data.h5
--partition_file ~/fednlp_data/partition_files/${DATA_NAME}_partition.h5
--partition_method niid_label_clients=100.0_alpha=5.0
--model_type distilbert
--model_name distilbert-base-uncased
--do_lower_case True
--train_batch_size 32
--eval_batch_size 8
--max_seq_length 256
--learning_rate 5e-5
--epochs 20
--evaluate_during_training_steps 500
--output_dir /tmp/${DATA_NAME}_fed/
--n_gpu 1

I got as errror 'KeyError: "Unable to open object (object 'niid_label_clients=100.0_alpha=5.0' doesn't exist)"', but the object should exist?

opened by ayanflow 5
Too many unused configuration/functionalities in the model trainer class (e.g., model/fed_transformers/classification/classification_model.py

for example, ONNX, quantization

Another issue is that a single function has 400 lines of code. It's better to simplify this class for FL. The current code is mainly for centralized training. Can we do it in 100 lines or split it as more functions or classes?

opened by chaoyanghe 4
QA does not calculate F1 score result , may I know how to fix it?

root:epoch = 1, batch_idx = 2164/5520, loss = 0.6054576635360718 INFO:root:epoch = 1, batch_idx = 2165/5520, loss = 0.4317961633205414 INFO:root:epoch = 1, batch_idx = 2166/5520, loss = 1.390831470489502 INFO:root:epoch = 1, batch_idx = 2167/5520, loss = 1.07370924949646 INFO:root:epoch = 1, batch_idx = 2168/5520, loss = 1.1164920330047607 INFO:root:epoch = 1, batch_idx = 2169/5520, loss = 0.6192945241928101 INFO:root:epoch = 1, batch_idx = 2170/5520, loss = 0.7042073607444763 INFO:root:epoch = 1, batch_idx = 2171/5520, loss = 0.702218770980835 INFO:root:epoch = 1, batch_idx = 2172/5520, loss = 0.47233307361602783 INFO:root:epoch = 1, batch_idx = 2173/5520, loss = 0.547944188117981 INFO:root:epoch = 1, batch_idx = 2174/5520, loss = 0.703558087348938 INFO:root:epoch = 1, batch_idx = 2175/5520, loss = 0.793656587600708 INFO:root:epoch = 1, batch_idx = 2176/5520, loss = 0.790333092212677 INFO:root:epoch = 1, batch_idx = 2177/5520, loss = 0.5816390514373779 INFO:root:epoch = 1, batch_idx = 2178/5520, loss = 0.9623005986213684 INFO:root:epoch = 1, batch_idx = 2179/5520, loss = 0.6054102182388306 INFO:root:cached_features_file = cache_dir/cached_dev_bert_256_34726 INFO:examples.question_answering.question_answering_model: Features loaded from cache at cache_dir/cached_dev_bert_256_34726 Running Evaluation: 100%|██████████| 2203/2203 [03:30<00:00, 10.46it/s] INFO:root:{} INFO:examples.question_answering.question_answering_model:{'correct': 15510, 'similar': 14006, 'incorrect': 5210, 'eval_loss': -7.354878266291677} INFO:root:epoch = 1, batch_idx = 2180/5520, loss = 0.5322230458259583 INFO:root:epoch = 1, batch_idx = 2181/5520, loss = 0.5274325609207153 INFO:root:epoch = 1, batch_idx = 2182/5520, loss = 0.5773954391479492 INFO:root:epoch = 1, batch_idx = 2183/5520, loss = 0.8108208775520325

opened by chaoyanghe 2
Some suggestions/issues in "XXXModel" class
I am trying to simplify the code.

Some suggestions:

I suggest finishing a function in a screen length (less than 100 lines). In some companies, this is hard rule. Even in NLP domain, like huggingface, they also follow this rule well. Their code is readable to me.

I don't see some special training tricks, so the training loop can be finished in a screen length.

when we want to repeat a functionality, it's better to extract it as a function. For example, 1) the early stopping related code repeats twice in the training loop, but the content is nearly the same; 2) defining the trainable parameters can be shrunk into a function (the begining part of train()).

Many variant Transformers' code are merged with different branches. I suggest only consider the models in FL. As a benchmark, two models will be enough.

Some issues:

this class also contains data loading in the training and evaluation loop, which should be handled outside the trainer class in FL, otherwise the performance may downgrade and some hidden issues may happen. Under FedML framework, the design pattern is to finish data loading of each client before starting the training loop.

(May update once I found more)
opened by chaoyanghe 2
Hyperparameters for reproducing the results of the paper?

Hi, thank you for the work.

I am confused regarding the learning rates used in the experiments. The README.md (experiments/distributed/transformer_exps/README.md) have different server lr's (0.1, 5e-5) with that of the paper (Section 4.3). I am trying to reproduce some experiments as a baseline, but I am reaching either a higher or lower performance than the reported accuracy.

For the seq2seq task (Gigawords), could you report the server and client learning rate? or even better refer me to the wandb project?

Thanks

opened by bangawayoo 1
Error running uniform partition for text classification

Hi. I am encountering a EOFError when trying to run uniform partition for text classification.

run_text_classification.sh FedOPT "uniform" 5e-5 0.1 51 4

27440 2022-01-09,13:28:37.575 - {base_data_manager.py (306)} - _load_data_loader_from_cache(): Loading features from cached file cache_dir/distilbert_distilbert-base-uncased_cached_256_ClassificationModel_20news_uniform_75 Traceback (most recent call last): File "/home/ky/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ky/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ky/Research/NLP/FL/FedNLP/experiments/distributed/transformer_exps/run_tc_exps/fedavg_main_tc.py", line 140, in train_data_local_dict, test_data_local_dict, num_clients = dm.load_federated_data(process_id=process_id) File "/home/ky/Research/NLP/FL/FedNLP/data_manager/base_data_manager.py", line 142, in load_federated_data return self._load_federated_data_local() File "/home/ky/Research/NLP/FL/FedNLP/data_manager/base_data_manager.py", line 240, in _load_federated_data_local state, res = self._load_data_loader_from_cache(client_idx) File "/home/ky/Research/NLP/FL/FedNLP/data_manager/base_data_manager.py", line 309, in _load_data_loader_from_cache train_examples, train_features, train_dataset, test_examples, test_features, test_dataset = pickle.load(handle) EOFError: Ran out of input

Non-iid partition method works fine. Any suggestion on how to fix this?

Thanks.

opened by bangawayoo 1
Hanging after last round of training

Hi, thanks for the great work.

When running sh run_text_classification.sh FedOPT "niid_label_clients=100_alpha=100.0" 1e-3 0.1 1 4, the process does not terminate automatically after the last round of training regardless of the number of communication rounds.

The log stops after displaying the last eval metric

18521 2021-12-29,21:14:53.265 - {tc_transformer_trainer.py (180)} - eval_model(): best_accuracy = 0.000000
18521 2021-12-29,21:14:53.266 - {tc_transformer_trainer.py (188)} - eval_model(): {'mcc': 0.0, 'tp': 0, 'tn': 0, 'fp': 0, 'fn': 0, 'acc': 0.0, 'eval_loss': 3.01809245740279}

Commenting out post_complete_message_to_sweep_process(self.args) on ClientManger and ServerManger does abort the program, so it seems something with FIFO is the problem. Will commenting out the function cause any problem?

Possibly related to an issue from FedML.

opened by bangawayoo 1
Refactor Model class to Trainer class

@yuchenlin To distinguish from Model(torch.Module), may I refactor the model class into xxxTrainer? In essence, the code you wrote is a trainer that handles train, eval, load, save, args, config, etc. In huggingface, they also call this kind of class as trainer:

https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py

opened by chaoyanghe 1
Come up problems in Running Text Classification model with No module named transformers

Hi, Our environment: Ubuntu16.04 During the installation, strictly follow the python version and configuration steps in the README.md file.It is confirmed that all the required dependent files have been installed, but the following error occurs as soon as the file is run 'bash run_simulation.sh'. 1.When we run the test the dependencies command: bash run_simulation.sh The error shown in the figure below appears.

Please give us some advice. Thank you!

opened by QIJIAHAO-6 0
KeyError: 'Unable to open object (bad heap free list)'

When I use 20news for classification, I get this error, can anyone help me? I have got the dataset from here. https://fednlp.s3-us-west-1.amazonaws.com/partition_files/20news_partition.h5 https://fednlp.s3-us-west-1.amazonaws.com/data_files/20news_data.h5

Loading data from h5 file.: 0%| | 0/11314 [00:00<?, ?it/s] Traceback (most recent call last): File "/root/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/root/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/FedNLP-master/experiments/centralized/transformer_exps/main_tc.py", line 91, in train_dl, test_dl = dm.load_centralized_data() File "/home/FedNLP-master/data_manager/base_data_manager.py", line 112, in load_centralized_data train_data = self.read_instance_from_h5(data_file, train_index_list) File "/home/FedNLP-master/data_manager/text_classification_data_manager.py", line 23, in read_instance_from_h5 X.append(data_file["X"][str(idx)][()].decode("utf-8")) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/root/miniconda3/envs/fednlp/lib/python3.7/site-packages/h5py/_hl/group.py", line 305, in getitem oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 190, in h5py.h5o.open KeyError: 'Unable to open object (bad heap free list)'

opened by ysgncss 3
Could you support knowledge distillation-based FL algorithms like FedED, Fedmd, or FedDF

Currently, your platform supports some parameter-average-based FL algorithms. Could you support knowledge-distillation-based FL algorithms like FedED, Fedmd, or FedDF?

FedED: Federated Learning via Ensemble Distillation for Medical Relation Extraction, Fedmd: Heterogenous federated learning via model distillation. NeurIPS workshop, 2019 FedDF, Ensemble distillation for robust model fusion in federated learning. NeurIPS, 2020

opened by Barry-Menglong-Yao 0

FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

Related tags

Overview

FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

Installation

Code Structure of FedNLP

Data Preparation

download our processed files from Amazon S3.

Experiments for Centralized Learning (Sanity Check)

Transformer-based models

Experiments for Federated Learning

Citation

Comments

Owner

FedML-AI

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

DELTA is a deep learning based natural language and speech processing platform.

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

DELTA is a deep learning based natural language and speech processing platform.

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Deep Learning for Natural Language Processing - Lectures 2021

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.