Parallelformers: An Efficient Model Parallelization Toolkit for Deployment

Overview

GitHub release Apache 2.0 Docs Issues


  • Parallelformers, which is based on Megatron LM, is designed to make model parallelization easier.
  • You can parallelize various models in HuggingFace Transformers on multiple GPUs with a single line of code.
  • Currently, Parallelformers only supports inference. Training features are NOT included.

What's New:

Why Parallelformers?

You can load a model that is too large for a single GPU. For example, using Parallelformers, you can load a model of 12GB on two 8 GB GPUs. In addition, you can save your precious money because usually multiple smaller size GPUs are less costly than a single larger size GPU.

Installation

Parallelformers can be easily installed using the pip package manager. All the dependencies such as torch, transformers, and dacite should be installed automatically with the following command. Be careful that the name is plural.

pip install parallelformers

Getting Started

1. Create a HuggingFace transformers model.

You don't need to call .half() or .cuda() as those functions will be invoked automatically. It is more memory efficient to start parallelization on the CPU.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")

2. Put the model in the parallelize() function.

from parallelformers import parallelize

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

Since nvidia-smi shows the reserved cache area, it is difficult to check the exact allocated memory. To check the allocated memory state well, you can set the verbose option as 'detail' or 'simple'. (default is None)

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    2721 MB |    2967 MB |    2967 MB |  251905 KB |
|       from large pool |    2720 MB |    2966 MB |    2966 MB |  251904 KB |
|       from small pool |       1 MB |       1 MB |       1 MB |       1 KB |
|---------------------------------------------------------------------------|

GPU:0 => 2.72GB
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    2721 MB |    2967 MB |    2967 MB |  251905 KB |
|       from large pool |    2720 MB |    2966 MB |    2966 MB |  251904 KB |
|       from small pool |       1 MB |       1 MB |       1 MB |       1 KB |
|---------------------------------------------------------------------------|

GPU:1 => 2.72GB

3. Do Inference as usual.

You don't have to call .cuda() when creating input tokens. Note that you should input both input tokens and attention masks to the model. (**inputs is the recommended way for this)

inputs = tokenizer("Parallelformers is", return_tensors="pt")

outputs = model.generate(
    **inputs,
    num_beams=5,
    no_repeat_ngram_size=4,
    max_length=15,
)

print(f"Output: {tokenizer.batch_decode(outputs)[0]}")
Output: Parallelformers is an open-source library for parallel programming ...

4. Deploy the model to the server as usual.

The parallelization process does not affect the web server because they are automatically synchronized.

") def generate_text(text): inputs = tokenizer(text, return_tensors="pt") outputs = model.generate( **inputs, num_beams=5, no_repeat_ngram_size=4, max_length=15, ) outputs = tokenizer.batch_decode( outputs, skip_special_tokens=True, ) return { "inputs": text, "outputs": outputs[0], } app.run(host="0.0.0.0", port=5000) ">
from flask import Flask

app = Flask(__name__)


@app.route("/generate_text/")
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt")
    
    outputs = model.generate(
        **inputs,
        num_beams=5,
        no_repeat_ngram_size=4,
        max_length=15,
    )
    
    outputs = tokenizer.batch_decode(
        outputs,
        skip_special_tokens=True,
    )
    
    return {
        "inputs": text,
        "outputs": outputs[0],
    }


app.run(host="0.0.0.0", port=5000)

You can send a request to the web server as follows:

$ curl -X get "YOUR_IP:5000/generate_text/Messi"

And the following result should be returned.

{"inputs": "Messi", "outputs": "Messi is the best player in the world right now. He is the"}

5. Check the current GPU states.

You can check GPU states using .memory_allocated(), .memory_reserved() and .memory_chached() to make sure the parallelization is successful.

model.memory_allocated()
model.memory_reserved()
model.memory_chached()
{'cuda:0':XXXXXX, 'cuda:1':XXXXXX}

6. Manage the model parallelization states.

You can manage model parallelization states using .cuda(), .cpu() and .to(). The model parallelization process ends if you call those functions.

model.cuda()

print(torch.cuda.memory_summary(0))
print(torch.cuda.memory_summary(1))

Check the allocated memory status using torch.cuda.memory_summary().

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    5121 MB |    5121 MB |    5121 MB |    1024 B  |
|       from large pool |    5120 MB |    5120 MB |    5120 MB |       0 B  |
|       from small pool |       1 MB |       1 MB |       1 MB |    1024 B  |
|---------------------------------------------------------------------------|

GPU0 => 5.12GB
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |    1024 B  |    1024 B  |    1024 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |    1024 B  |    1024 B  |    1024 B  |
|---------------------------------------------------------------------------|

GPU1 => 0.00GB

If you switch to the CPU mode, it works like this.

model.cpu()

print(torch.cuda.memory_summary(0))
print(torch.cuda.memory_summary(1))
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |    5121 MB |    5121 MB |    5121 MB |
|       from large pool |       0 B  |    5120 MB |    5120 MB |    5120 MB |
|       from small pool |       0 B  |       1 MB |       1 MB |       1 MB |
|---------------------------------------------------------------------------|

GPU0 => 0.00GB
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |    1024 B  |    1024 B  |    1024 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |    1024 B  |    1024 B  |    1024 B  |
|---------------------------------------------------------------------------|

GPU1 => 0.00GB

Supported Models

Currently, most models in Huggingface transformers are supported. All layers in the models listed below can be parallelized. They include vision models like ViT, CLIP and speech models like Wav2Vec2 as well as language models.

Fully Supported Models
  • ALBERT
  • BART
  • BARThez (=BERT)
  • BERT
  • BERTweet (=BERT)
  • BertJapanese (=BERT)
  • BertGeneration
  • Blenderbot
  • Blenderbot Samll
  • BORT (=BERT)
  • CamemBERT (=RoBERTa)
  • CLIP
  • CPM
  • CTRL
  • DeBERTa
  • DeBERTa-v2
  • DeiT
  • DETR
  • DialoGPT (=GPT2)
  • DistilBERT
  • DPR (=BERT)
  • ELECTRA
  • FlauBERT (=XLM)
  • FSMT
  • Funnel Transformer
  • herBERT (=RoBERTa)
  • I-BERT
  • LayoutLM
  • LED
  • Longformer
  • LUKE
  • LXMERT
  • MarianMT
  • M2M100
  • MBart
  • Mobile BERT
  • MPNet
  • MT5 (=T5)
  • Megatron BERT (=BERT)
  • Megatron GPT2 (=GPT2)
  • OpenAI GPT
  • OpenAI GPT2
  • GPTNeo
  • Hubert
  • Pegasus
  • PhoBERT (=RoBERTa)
  • Reformer
  • RetriBERT
  • RoBERTa
  • RoFormer
  • Speech2Text
  • T5
  • ByT5 (=T5)
  • TAPAS
  • TransformerXL
  • ViT
  • VisualBERT
  • Wav2Vec2
  • XLM
  • XLM-RoBERTa (=RoBERTa)
  • XLNet
  • XLSR-Wave2Vec2

At present the following models are partly supported or not supported.

Partly Supported Models
  • BigBird
  • BigBirdPegasus
  • ConvBERT
  • ProphetNet
  • XLM-ProphetNet
Unsupported Models
  • SqueezeBERT
  • RAG

Advanced Usage

Refer to POLICY.md

FAQ

Refer to FAQ.md.

Contributing

Refer to CONTRIBUTING.md

Documentation

For more detailed information, see full documentation

Citation

If you find this library useful, please consider citing:

@misc{parallelformers,
  author       = {Ko, Hyunwoong},
  title        = {Parallelformers: An Efficient Model Parallelization Toolkit for Deployment},
  howpublished = {\url{https://github.com/tunib-ai/parallelformers}},
  year         = {2021},
}

LICENSE

Parallelformers is licensed under the terms of the Apache License 2.0.

Copyright 2021 TUNiB inc. https://www.tunib.ai. All Rights Reserved.

Comments
  • AssertionError: Model should be on CPU before parallelization. It is more memory-efficient.

    AssertionError: Model should be on CPU before parallelization. It is more memory-efficient.

    Hello, first of all congratulations for this amazing project. It's simple, efficient and versatile. Very useful.

    In some cases, it happens that one has several GPUs, but not enough RAM to parallelize the model. When loading the model on GPU, and then parallelizing, I'm getting the below error: AssertionError: Model should be on CPU before parallelization. It is more memory-efficient.

    It doesn't stop the script, but it seems that the parallelization fails.

    My question is: is it possible to load the initial model on GPU instead of CPU (even if it's not memory-efficient) or not at all?

    Thanks!

    opened by juliensalinas 29
  • KoGPT3와 연동시 품질 이슈

    KoGPT3와 연동시 품질 이슈

    안녕하세요 GPTJForCausalLM모델을 지원하는지 확인하려고 KoGPT3를 가지고 parallelformers 라이브러리로 인퍼런스 해보는 걸 테스트하고 있었는데요.

    실행코드는 아래와 같습니다.

    import torch
    from transformers import AutoTokenizer, AutoModelForCausalLM 
    from parallelformers import parallelize
    
    tokenizer = AutoTokenizer.from_pretrained(
      'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16',  # or float32 version: revision=KoGPT6B-ryan1.5b
      bos_token='[BOS]', eos_token='[EOS]', unk_token='[UNK]', pad_token='[PAD]', mask_token='[MASK]'
    )
    model = AutoModelForCausalLM.from_pretrained(
      'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16',  # or float32 version: revision=KoGPT6B-ryan1.5b
      pad_token_id=tokenizer.eos_token_id,
      torch_dtype='auto'
    )
    
    parallelize(model, num_gpus=2, fp16=True, verbose='detail')
    
    prompt = '''[공부, 학생, 힘들] => 힘들더라도 학생의 본분은 공부입니다
    [시작, 떨림, 긴장] => 새로운 시작은 항상 떨리고 긴장되죠 파이팅!!
    [방어, 제철, 겨울] => 겨울에는 방어가 제철이죠 방어회 어떠세요?
    [겸손, 인생, 변화] => 인생은 어떻게 변할지 몰라요 항상 겸손한 태도를 갖춰야해요
    [학교, 선생님, 은혜] => 학창시절 선생님의 은혜를 잊지 못해요 감사합니다.
    [입사, 회사, 신입] =>'''
    
    temperature = 0.8
    max_length = 140
    batch_size = 5
    
    inputs = tokenizer([prompt]*batch_size, return_tensors="pt")
    ## **inputs의 경우
    gen_tokens = model.generate(**inputs, do_sample=True, temperature=temperature, max_length=max_length)
    ## input_ids와 attention_mask를 넣을 경우
    ## gen_tokens = model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, do_sample=True, temperature=temperature, max_length=max_length)
    generated = tokenizer.batch_decode(gen_tokens)
    

    OUTPUT은 아래와 같습니다.

    • parallelformers를 쓰지 않았을 경우 image

    • parallelformers를 쓸 경우 (**inputs 일 경우) image

    • parallelformers를 쓸 경우 (input_ids와 attention_mask만 넣을 경우) image

    위처럼 parallelformers로 래핑을 했을 때 품질이 떨어지는 경우가 발생하는데 (문법자체가 어긋나는 결과가 나오는..) 혹시 제가 잘못사용하고 있는건지 아니면 gpt3는 지원을 안하는 건지 물어보려 이슈 남깁니다 :)..

    opened by BangDaeng 13
  • RuntimeError: Cannot re-initialize CUDA in forked subprocess

    RuntimeError: Cannot re-initialize CUDA in forked subprocess

    How to reproduce

    I'm getting the following error while trying to run the example in the getting started document

    Process ParallelProcess-1:
    Traceback (most recent call last):
      File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
        self.run()
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
        return func(*args, **kwargs)
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
        engine = ParallelEngine(
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
        self.mp_group = self.create_process_group(backend)
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 106, in create_process_group
        torch.cuda.set_device(int(os.getenv("LOCAL_RANK", "0")))
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 314, in set_device
        torch._C._cuda_setDevice(device)
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 207, in _lazy_init
        raise RuntimeError(
    RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
    Process ParallelProcess-2:
    Traceback (most recent call last):
      File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
        self.run()
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
        return func(*args, **kwargs)
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
        engine = ParallelEngine(
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
        self.mp_group = self.create_process_group(backend)
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
        dist.init_process_group(backend=backend)
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
        _store_based_barrier(rank, store, timeout)
      File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 242, in _store_based_barrier
        worker_count = store.add(store_key, 0)
    RuntimeError: Connection reset by peer
    

    This is my code. I'm running it on a AWS g5.12xlarge instance with 4 GPUs

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
    tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
    
    from parallelformers import parallelize
    
    parallelize(model, num_gpus=2, fp16=True, verbose='detail')
    
    inputs = tokenizer("Parallelformers is", return_tensors="pt")
    
    outputs = model.generate(
        **inputs,
        num_beams=5,
        no_repeat_ngram_size=4,
        max_length=15,
    )
    
    print(f"Output: {tokenizer.batch_decode(outputs)[0]}")
    
    
    
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA A10G         On   | 00000000:00:1B.0 Off |                    0 |
    |  0%   29C    P8    19W / 300W |      2MiB / 23028MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  NVIDIA A10G         On   | 00000000:00:1C.0 Off |                    0 |
    |  0%   29C    P8    16W / 300W |      2MiB / 23028MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  NVIDIA A10G         On   | 00000000:00:1D.0 Off |                    0 |
    |  0%   29C    P8    16W / 300W |      2MiB / 23028MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
    |  0%   30C    P8    15W / 300W |      2MiB / 23028MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    

    I pip installed multiprocess https://pypi.org/project/multiprocess/ as initially I kept getting importing multiprocess as mp, multiprocess not found. Then I noticed there was a PR that removed torch.multiprocessing done by @Oaklight . Maybe I'm not using the right multiprocessing library? Reverting it back to torch.multiprocessing caused the same error noticed by @Oaklight .

    Environment

    • OS : Ubuntu
    • Python version : 3.9
    • Transformers version : 4.21.0
    • Whether to use Docker: Nope
    • Misc.: Cuda 11.6
    bug 
    opened by cabal-daniel 12
  • Support for GPT-J

    Support for GPT-J

    Thanks for the great repo! I have tried it out, it's really amazing to lead such a large model in multiple GPUs.

    Describe a requested feature

    Currently, GPT-J is supported only in HF 4.7.0 and by installing

    pip install git+https://github.com/finetuneanon/transformers@gpt-j
    

    In your requirement, there is HF 4.8.0, and needs to load several new models. Soon gpt-j will be fully integrated in HF: https://github.com/huggingface/transformers/pull/12243

    I am wondering if is there an easy way to have back compatibility, or include GPT-J soon.

    Thanks again for your great repo 👍🏻

    -- Andrea

    enhancement 
    opened by andreamad8 11
  • AttributeError: Can't get attribute 'MegatronPolicy' on <module '__main__' (built-in)>

    AttributeError: Can't get attribute 'MegatronPolicy' on

    Trying to use parallelformers with the megatron-11b pip package. The MegatronPolicy class is as-provided from megatron-11b pypi webpage

    How to reproduce

    from megatron_11b import MegatronForCausalLM, MegatronTokenizer
    
    tokenizer = MegatronTokenizer.from_pretrained("./megatron-11B")
    model = MegatronForCausalLM.from_pretrained("./megatron-11B")
    
    # https://tunib-ai.github.io/parallelformers/intro/POLICY.html
    
    from parallelformers.policies.base import Policy, Layer
    from parallelformers.utils.dist_utils import AllReduceLinear
    from megatron_11b.modeling_megatron import MegatronDecoderLayer
    
    
    class MegatronPolicy(Policy):
    
        @staticmethod
        def replace_arguments(config, world_size):
            return {
                # 1. reduce hidden size
                "self_attn.embed_dim": config.d_model // world_size,
    
                # 2. reduce number of heads
                "self_attn.num_heads": config.encoder_attention_heads // world_size,
            }
    
        @staticmethod
        def attn_qkv():
            return [
                Layer(
                    weight="self_attn.q_proj.weight",
                    bias="self_attn.q_proj.bias",
                ),
                Layer(
                    weight="self_attn.k_proj.weight",
                    bias="self_attn.k_proj.bias",
                ),
                Layer(
                    weight="self_attn.v_proj.weight",
                    bias="self_attn.v_proj.bias",
                ),
            ]
    
        @staticmethod
        def attn_out():
            return [
                Layer(
                    weight="self_attn.out_proj.weight",
                    bias="self_attn.out_proj.bias",
                    replace=AllReduceLinear,
                ),
            ]
    
        @staticmethod
        def mlp_in():
            return [
                Layer(
                    weight="fc1.weight",
                    bias="fc1.bias",
                ),
            ]
    
        @staticmethod
        def mlp_out():
            return [
                Layer(
                    weight="fc2.weight",
                    bias="fc2.bias",
                    replace=AllReduceLinear,
                ),
            ]
    
        @staticmethod
        def original_layer_class():
            return MegatronDecoderLayer
    
    from parallelformers import parallelize
    
    parallelize(model, num_gpus=8, fp16=True, verbose='detail', custom_policies=[MegatronPolicy])
    

    Environment

    • OS : Ubuntu LTS 20.04
    • Python version : 3.8
    • Transformers version : 4.4.2
    • Whether to use Docker: no
    • Misc.: it's executed in a jupyter notebook, which might be the source of the problem: https://stackoverflow.com/a/65001152
    bug 
    opened by Oaklight 6
  • 다중 Model 로드 방법

    다중 Model 로드 방법

    How to reproduce

    • 먼저 좋은 프로젝트를 만들어 주셔서 감사의 말씀을 드립니다.
    • 현재 1080 GPU 8개가 있는 서버에서 Flask 를 사용하여 한국어 모델을 여러개를 올려보는 테스트를 해보고 있는데요.
    • 1개의 모델을 여러개의 GPU에 올리는 부분들은 잘 되는데 동시에 여러 모델을 올릴 때 아래와 같은 에러가 발생하고 있습니다.
    • 혹시 여러 모델을 동시에 올릴 경우 추가적으로 해야할 작업이 있을까요?
    • 타깃 GPU의 경우에는 모델 호출 전 Environments 의 CUDA_VISIBLE_DEVICES를 조절하여 변경하고 있습니다.
    • ex > os.environ["CUDA_VISIBLE_DEVICES"]="0" , parallelize(model_1, ... )
    •  > os.environ["CUDA_VISIBLE_DEVICES"]="1" ,  parallelize(model_2, ... )
      
    .... ( 두번 째 모델 로드 시 에러 발생 )
    ===========================================================       
    model name :  ./model/ko-gpt-trinity-1.2B-v0.5
    CUDA_VISIBLE_DEVICES :  1
    request_gpu :  1                            
    used_gpu    :  2
    ===========================================================          
    Process ParallelProcess-2:                                         
    Traceback (most recent call last):                                        
      File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
        self.run()                  
      File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
        return func(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/process.py", line 254, in run
        custom_policies=self.custom_policies,
      File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
        self.mp_group = self.create_process_group(backend)
      File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
        dist.init_process_group(backend=backend)
      File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
        store, rank, world_size = next(rendezvous_iterator)
      File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
        store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
      File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
        hostname, port, world_size, start_daemon, timeout, multi_tenant=True
    RuntimeError: Address already in use
    
    
    • parallelformers/parallel/engine.py 부분에서 dist.init_process_group 을 할 때 에러가 발생하는 것 같은데요.
    • parallelize 호출 시 어떻게 변경하면 다양한 모델들을 동시에 올릴 수 있을까요?
        def create_process_group(self, backend: str):
            """
            Create Pytorch distributed process group
            Args:
                backend (str): distributed backend
            Returns:
                ProcessGroupNCCL: process group for parallization
            """
            if not dist.is_initialized():
                dist.init_process_group(backend=backend)
    
            torch.cuda.set_device(int(os.getenv("LOCAL_RANK", "0")))
            new_group = dist.new_group([i for i in range(self.num_gpus)])
    
            return new_group
    
    

    Environment

    • OS : Ubuntu 18.04
    • Python version :3.7.11
    • Transformers version : 4.15.0
    • Whether to use Docker: FROM pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel
    • Misc.: " flask 내에서 parallelformers를 활용한 다중 모델 로드"
    question 
    opened by Don9wanKim 6
  • Error using google/UL2 model

    Error using google/UL2 model

    The model: google/ul2

    The Hardware: 2x RTX Titan AMD Ryzen 9 5900X 12-Core Processor 64Gb RAM

    The Environment: Python 3.9.13 Pytorch 1.12.0+cu102 NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5

    Code used:

    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    from parallelformers import parallelize
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained("google/ul2")
    model = AutoModelForSeq2SeqLM.from_pretrained("google/ul2")
    
    parallelize(model, num_gpus=2, fp16=True, verbose='detail')
    
    input_string = "[S2S] Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, solid man with a bald head. Mrs. Dursley was thin and blonde and more than the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere <extra_id_0>"
    
    inputs = tokenizer(input_string, return_tensors="pt")
    
    outputs = model.generate(**inputs, max_length=200)
    
    print(tokenizer.decode(outputs[0]))
    

    Error Message:

    $ python test.py 
    /home/******/miniconda3/envs/ul2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 16 leaked semaphore objects to clean up at shutdown
      warnings.warn('resource_tracker: There appear to be %d '
    Bus error (core dumped)
    

    Is this something I can fix? I would love to use this large model, as it's near SOTA on everything :)

    opened by dnhkng 4
  • Support for GPT2-XL

    Support for GPT2-XL

    Thank you for the great project!

    How to reproduce

    https://github.com/snoop2head/Language_Model_Memorization/blob/2c5db6f9bdd0206cba87d13b158d8c27ce0e55a7/parallel_inference.py#L39-L82

    • Tested and works for gpt2, gpt2-medium, gpt2-large
    • If AutoModelForCausalLM is changed into gpt2-xl, it yields the following error message
    File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 193, in inference
        outputs = function_(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
        return func(*args, **kwargs)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1294, in generate
        return self.greedy_search(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1689, in greedy_search
        outputs = self(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1058, in forward
        transformer_outputs = self.transformer(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 901, in forward
        outputs = block(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 401, in forward
        attn_outputs = self.attn(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 325, in forward
        query = self._split_heads(query, self.num_heads, self.head_dim)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 290, in _split_heads
        tensor = tensor.view(new_shape)
    RuntimeError: shape '[116, 5, 12, 64]' is invalid for input of size 464000
    Traceback (most recent call last):
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 193, in inference
        outputs = function_(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
        return func(*args, **kwargs)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1294, in generate
        return self.greedy_search(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1689, in greedy_search
        outputs = self(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1058, in forward
        transformer_outputs = self.transformer(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 901, in forward
        outputs = block(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 401, in forward
        attn_outputs = self.attn(
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 325, in forward
        query = self._split_heads(query, self.num_heads, self.head_dim)
      File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 290, in _split_heads
        tensor = tensor.view(new_shape)
    RuntimeError: shape '[116, 5, 12, 64]' is invalid for input of size 464000
    
    bug 
    opened by snoop2head 3
  • docker support

    docker support

    We will continue to log problems with Docker containers on this thread. And we aim to solve it. Ultimately, the goal is to deploy the model in a Kubernetes environment. If anyone has any problems with the Docker environment, please feel free to leave issues. We will actively review and resolve them.

    enhancement 
    opened by hyunwoongko 2
  • Support for OPT

    Support for OPT

    Hi,

    Would it be possible to support new OPT models (a suite of GPT-like models)?

    Here's the official doc: https://huggingface.co/docs/transformers/model_doc/opt

    Thanks for your great work!

    enhancement 
    opened by mrzjy 1
  • GPT2 parallelism does not work on the Tesla K80

    GPT2 parallelism does not work on the Tesla K80

    How to reproduce

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from parallelformers import parallelize
    
    model = AutoModelForCausalLM.from_pretrained("distilgpt2")
    tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
    
    parallelize(model, num_gpus=2, fp16=True, verbose='detail')
    
    inputs = tokenizer("Parallelformers is", return_tensors="pt")
    
    outputs = model.generate(
        **inputs,
        max_length=15,
    )
    
    print(f"Output: {tokenizer.batch_decode(outputs)[0]}")
    

    Problem

    The system distributes the model between GPUs, but when generating the second GPU is 100% loaded and does not leave this state. Generation failed. image

    Environment

    PyTorch version: 1.10.1+cu113
    Is debug build: False
    CUDA used to build PyTorch: 11.3
    ROCM used to build PyTorch: N/A
    
    OS: Ubuntu 18.04.6 LTS (x86_64)
    GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
    Clang version: Could not collect
    CMake version: Could not collect
    Libc version: glibc-2.17
    
    Python version: 3.7.13 (default, Mar 29 2022, 02:18:16)  [GCC 7.5.0] (64-bit runtime)
    Python platform: Linux-4.15.0-187-generic-x86_64-with-debian-buster-sid
    Is CUDA available: True
    CUDA runtime version: Could not collect
    GPU models and configuration: 
    GPU 0: NVIDIA Tesla K80
    GPU 1: NVIDIA Tesla K80
    
    Nvidia driver version: Could not collect
    cuDNN version: Could not collect
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    Is XNNPACK available: True
    
    Versions of relevant libraries:
    [pip3] numpy==1.21.6
    [pip3] torch==1.10.1+cu113
    [conda] numpy                     1.21.6                   pypi_0    pypi
    [conda] torch                     1.10.1+cu113             pypi_0    pypi
    
    bug 
    opened by 0x7o 1
  • Do not check if an object is pickable

    Do not check if an object is pickable

    Title

    • Speed up results serialization

    Description

    • This implements the easiest solution to #46 by simply removing the check. It wasn't tested with python 3.6 (which uses standalone dataclasses lib). It may break on dynamically created dataclasses.

    Linked Issues

    • resolved #46
    opened by mkardas 1
  • Speed up results serialization

    Speed up results serialization

    Describe a requested feature

    I was running some performance tests and I noticed that checking if an object is pickable: https://github.com/tunib-ai/parallelformers/blob/ccaea515ee2e4d7540f2a275f6cdb0c33a7780f0/parallelformers/parallel/process.py#L209 takes a lot of time when the output is big (f.e., when a model returns a large logits tensor), because the whole object is being serialized into memory and then deserialized. I wonder what are the cases in which check_pickable helps, as dataclasses and ModelOutput should be as pickable as its dictionary representation.

    If the check is still needed, I guess the code could be still sped up by modifying an object only on pickle failure. That would require some workarounds (perhaps overriding https://github.com/python/cpython/blob/9dc787ea96916552695e79397588fdfa68f22024/Lib/multiprocessing/queues.py#L275) so I want to make sure the check is still necessary, before giving it a shot. Another option is to always check for https://github.com/tunib-ai/parallelformers/blob/ccaea515ee2e4d7540f2a275f6cdb0c33a7780f0/parallelformers/parallel/process.py#L236-L239 and modify the object even if it's pickable, but that would remove custom fields added outside a definition of a given class.

    enhancement 
    opened by mkardas 0
  • RuntimeError: CUDA error: peer access is not supported between these two devices

    RuntimeError: CUDA error: peer access is not supported between these two devices

    I tried running the example from the readme but received the above error. Does that mean that my hardware is not supported?

    Environment

    • OS : Ubuntu
    • Python version : 3.7.11
    • Transformers version : 4.23.1
    • Whether to use Docker: no
    • GPUs: NVIDIA GeForce RTX 2080 Ti
    bug 
    opened by Dorcoh4 0
  • Bug with T511b inference

    Bug with T511b inference

    How to reproduce

    from transformers import AutoModelForSeq2SeqLM, AutoTokenizer,AutoModelForCausalLM
    from parallelformers import parallelize
    model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-2.7B')
    parallelize(model, num_gpus=4, fp16 = False)
    

    Environment

    • OS : 18.04.4 LTS (Bionic Beaver) Ubuntu
    • Python version : 3.7.3
    • Transformers version : 4.22.1
    • Whether to use Docker: No
    • Misc.: N/A
    bug 
    opened by ZeyiLiao 0
  • OSError: [Errno 9] Bad file descriptor

    OSError: [Errno 9] Bad file descriptor

    How to reproduce

    Using a p4d.24xlarge:

    from parallelformers import parallelize
    import torch
    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    model_name = "facebook/opt-66b"
    batch_size = [1]
    batch = [["out story begins on"] * bs for bs in batch_size]
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
    inputs = [tokenizer(seq, return_tensors="pt").input_ids for seq in batch]
    parallelize(model, num_gpus=8, fp16=True)
    for _ in range(100):
        model.generate(
            torch.cat(inputs, dim=0),
            do_sample=True,
            max_length=2048,
            num_return_sequences=1,
        )
    

    It loads okay and begins performing inference. Can see all 8 GPUs at 90+% utilization using nvidia-smi for a while. Then eventually one GPU drops to 0%, the others jump to 100%. Terminal shows:

    Traceback (most recent call last):                                                                         
      File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
        obj = _ForkingPickler.dumps(obj)                                                                       
      File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps                                                                                                         
        cls(buf, protocol).dump(obj)                                                                           
      File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage                                                                          
        df = multiprocessing.reduction.DupFd(fd)                                                               
      File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd                                                                                                        
        return resource_sharer.DupFd(fd)                                                                       
      File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in __init__                                                                                                
        new_fd = os.dup(fd)                                                                                    
    OSError: [Errno 9] Bad file descriptor 
    

    It then seems to hang forever from there.

    I do realize this stacktrace doesn't give enough enough to get back to parallelformers, which is frustrating. Maybe it's actually a bug in PyTorch or Multiprocessing?

    Environment

    • OS : Ubuntu 20.04.4 LTS
    • Python version : 3.8.13
    • Transformers version : 4.24.0
    • Whether to use Docker : No
    • Misc. : N/A
    bug 
    opened by aws-stdun 0
Releases(v1.2.7)
Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

FINRA 25 Dec 28, 2022
machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made this project as a requirement for an internship at Indian Servers. We are now making it open to contribution.

Krishna Priyatham Potluri 73 Dec 1, 2022
A multi-functional library for full-stack Deep Learning. Simplifies Model Building, API development, and Model Deployment.

chitra What is chitra? chitra (चित्र) is a multi-functional library for full-stack Deep Learning. It simplifies Model Building, API development, and M

Aniket Maurya 210 Dec 21, 2022
Davide Gallitelli 3 Dec 21, 2021
Arch-Net: Model Distillation for Architecture Agnostic Model Deployment

Arch-Net: Model Distillation for Architecture Agnostic Model Deployment The official implementation of Arch-Net: Model Distillation for Architecture A

MEGVII Research 22 Jan 5, 2023
deployment of a hybrid model for automatic weapon detection/ anomaly detection for surveillance applications

Automatic Weapon Detection Deployment of a hybrid model for automatic weapon detection/ anomaly detection for surveillance applications. Loved the pro

Janhavi 4 Mar 4, 2022
Model Zoo for AI Model Efficiency Toolkit

We provide a collection of popular neural network models and compare their floating point and quantized performance.

Qualcomm Innovation Center 137 Jan 3, 2023
MMRazor: a model compression toolkit for model slimming and AutoML

Documentation: https://mmrazor.readthedocs.io/ English | 简体中文 Introduction MMRazor is a model compression toolkit for model slimming and AutoML, which

OpenMMLab 899 Jan 2, 2023
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17.3k Dec 29, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17k Feb 11, 2021
Bark Toolkit is a toolkit wich provides Denial-of-service attacks, SMS attacks and more.

Bark Toolkit About Bark Toolkit Bark Toolkit is a set of tools that provides denial of service attacks. Bark Toolkit includes SMS attack tool, HTTP

null 13 Jan 4, 2023
D2Go is a toolkit for efficient deep learning

D2Go D2Go is a production ready software system from FacebookResearch, which supports end-to-end model training and deployment for mobile platforms. W

Facebook Research 744 Jan 4, 2023
Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

texttron 193 Jan 4, 2023
FAMIE is a comprehensive and efficient active learning (AL) toolkit for multilingual information extraction (IE)

FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction

null 18 Sep 1, 2022
pyinfra automates infrastructure super fast at massive scale. It can be used for ad-hoc command execution, service deployment, configuration management and more.

pyinfra automates/provisions/manages/deploys infrastructure super fast at massive scale. It can be used for ad-hoc command execution, service deployme

Nick Barrett 2.1k Dec 29, 2022
Simple, Pythonic remote execution and deployment.

Welcome to Fabric! Fabric is a high level Python (2.7, 3.4+) library designed to execute shell commands remotely over SSH, yielding useful Python obje

Fabric 13.8k Jan 6, 2023
A simple static site generator with deployment to S3/Cloudfront.

Stasis A simple static site generator with deployment to S3/Cloudfront. Features Stasis is a static website generator written in Python, using Pandoc

Scott Czepiel 56 Sep 29, 2022
Mockoon is the easiest and quickest way to run mock APIs locally. No remote deployment, no account required, open source.

Mockoon Mockoon is the easiest and quickest way to run mock APIs locally. No remote deployment, no account required, open source. It has been built wi

mockoon 4.4k Dec 30, 2022