Natural Language Processing with transformers

Overview

基于transformers的自然语言处理(NLP)入门

Natural Language Processing with transformers. 本项目面向的对象是:

  • NLP初学者、transformer初学者
  • 有一定的python、pytorch编程基础
  • 对前沿的transformer模型感兴趣
  • 了解和知道简单的深度学习模型

本项目的愿景是:

希望结合形象生动的原理讲解和多个动手实践项目,帮助初学者快速入门深度学习时代的NLP。

本项目的主要参考资料是:

  • Huggingface/Transformers代码库
  • 多个优秀的Transformer讲解和分享

项目成员:

  • erenup(多多笔记),北京大学,负责人
  • 张帆,Datawhale,天津大学,篇章4
  • 张贤,哈尔滨工业大学,篇章2
  • 李泺秋,浙江大学,篇章3
  • 蔡杰,北京大学,篇章4
  • hlzhang,麦吉尔大学,篇章4
  • 台运鹏 篇章2
  • 张红旭 篇章2

本项目总结和学习了多篇优秀文档和分享,在各个章节均有标注来源,如有侵权,请及时联系项目成员,谢谢。去Github点完Star再学习事半功倍哦 😄 ,谢谢。

项目内容

篇章1-前言

篇章2-Transformer相关原理

篇章3-编写一个Transformer模型:BERT

篇章4-使用Transformers解决NLP任务

Comments
  • about (可能是我的版本问题) 4.5-生成任务-语言模型

    about (可能是我的版本问题) 4.5-生成任务-语言模型

    这个里面有一段:

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=lm_datasets["train"][:1000],
        eval_dataset=lm_datasets["validation"][:100],
        data_collator=data_collator,
    )
    

    可能是我的版本问题,datasets切片之后类型是dict,好像不能直接赋值给dataset,这样会报错:

    ***** Running training *****
      Num examples = 3
      Num Epochs = 1
      Instantaneous batch size per device = 8
      Total train batch size (w. parallel, distributed & accumulation) = 8
      Gradient Accumulation steps = 1
      Total optimization steps = 1
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    <ipython-input-113-3435b262f1ae> in <module>()
    ----> 1 trainer.train()
    
    4 frames
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
         42     def fetch(self, possibly_batched_index):
         43         if self.auto_collation:
    ---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
         45         else:
         46             data = self.dataset[possibly_batched_index]
    
    KeyError: 2
    

    因为切片之后,len就不是样本长度,而是dict的K的数量(3个)

    原始数据: Dataset({ features: ['attention_mask', 'input_ids', 'labels'], num_rows: 19240 }) 切片之后dict: Type: dict String form: {'attention_mask': [[1, 1,。。。。。。后面太长了 其实应该要的是 Dataset({ features: ['attention_mask', 'input_ids', 'labels'], num_rows:1000 }) ==================这边的我的参考代码===========

    from datasets import Dataset
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=Dataset.from_dict(lm_datasets["train"][:1000]),
        。。。。
    

    =======因为可能是版本问题,我这边给出一些版本说明===== datasets-1.11.0 fsspec-2021.7.0 huggingface-hub-0.0.12 pyyaml-5.4.1 sacremoses-0.0.45 tokenizers-0.10.3 transformers-4.9.2 xxhash-2.0.2 Pytorch版本:1.9.0+cu102

    opened by yuanyihan 10
  • bugs about 4.6-生成任务-机器翻译

    bugs about 4.6-生成任务-机器翻译

    首先,十分感谢您的整理,此教程受益匪浅。但有几处疑问希望您给予一些指导。 =======客套话结束====== 1.目前发现4.6的环境可能稍有问题 ! pip install datasets transformers sacrebleu sentencepiece 至少建议被更改为 ! pip install datasets transformers "sacrebleu>=1.4.12,<2.0.0" sentencepiece 其中1.4.12是代码中表明的,2.0.0是本次发现的不兼容版本,1.5.1是兼容的 否则,这边环境(默认是安装sacrebleu2.0.0)的出错为:

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-1-4d4d0ea64237> in <module>()
          2 
          3 #raw_datasets = load_dataset("wmt16", "ro-en")
    ----> 4 metric = load_metric("sacrebleu")
    
    10 frames
    /root/.cache/huggingface/modules/datasets_modules/metrics/sacrebleu/eae4b006c615cfaacce2df4eec95acba813e69f28db6fff9e0e1398c50a0ed47/sacrebleu.py in Sacrebleu()
        114         force=False,
        115         lowercase=False,
    --> 116         tokenize=scb.DEFAULT_TOKENIZER,
        117         use_effective_order=False,
        118     ):
    
    AttributeError: module 'sacrebleu' has no attribute 'DEFAULT_TOKENIZER
    

    ====我看了一下PyPI的信息==== 在2021年8月10日,更新了此包到2.0.0,而这个更新和本教程可能不兼容,您可能需要在教程中注明

    opened by yuanyihan 3
  • bugs about 3.1-如何实现一个BERT.md

    bugs about 3.1-如何实现一个BERT.md

    首先,十分感谢您的整理,此教程受益匪浅。但有几处疑问希望您给予一些指导。

    1. BERT结构疑似有部分错误:

      • BertEmbeddings
      • BertEncoder
        • BertLayer
          • BertAttention
          • BertIntermediate
          • BertOutput
        • BertPooler #这部分应该是和BertEmbeddings平级,而不是和BertLayer平级`
    2. 原文:总计dense全连接层出现了(1+1+1)x12+1=37次,并不是每个dense都配了激活函数 这个建议您校核一下,因为:self-attention就有3个FC生成QKV,1个输出;FFN有两个,所以每层至少6个FC。poor输出一个,,,所以理论上有6*12+1

    ============分割线,我这边给出一个自己的统计=======

    1. BERT Tokenization 分词模型(BertTokenizer)
    2. BERT Model 本体模型(BertModel)
      • BertEmbeddings(Embd*3,LN,DO)
      • BertEncoder(BertLayer*12)
        • BertLayer
          • BertAttention
            • BertSelfAttention(FC*3生成QKV,在QK之后加一个DO)
            • BertSelfOutput(FC,DO,ADD残差,LN)
          • BertIntermediate(FC,gelu)
          • BertOutput(FC,DO,ADD残差,LN)
      • BertPooler(FC,tanh)
    opened by yuanyihan 3
  • 4.1文本分类

    4.1文本分类

    突然发现这个教程,看了一下,受益匪浅 ======客套话结束,正式开始====== 1.WNLI (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not.在[[4.1-文本分类.ipynb]]和[[4.1-文本分类.md]]均未翻译,您可以翻译为:输入句子对,并判断前后句子中代词指代消解是否正确 2.在markdown中,您可以加一个链接,直接指向colab:https://colab.research.google.com/github/datawhalechina/learn-nlp-with-transformers/blob/main/docs/篇章4-使用Transformers解决NLP任务/4.1-文本分类.ipynb ======正式结束,客套话开始====== 再次感谢这个教程

    opened by yuanyihan 2
  • Update 2.3-图解BERT.md

    Update 2.3-图解BERT.md

    老实说,我觉得这是教程里面写的最不好的一章,甚至需要重新写。 一开始引写ransformer,上来就是一顿输出,也没讲attention如何引出transformer。中间又去回顾词嵌入的几个模型,最后又来了一顿讲解。讲了半天,最重要的亮点mask一笔带过。整个体系支离破碎。词嵌入的几个模型比较可以放在文末做一个补充,比如先将transformer的输入,词嵌入是怎么样的,标注一下这种方式跟之前的模型有很大改进(具体可以看文末附录)。这样才不会冲碎整个主线。 讲模型,引入-结构-原理-优缺点总结等等,顺势而为。这是主线,其它的补充部分觉得需要提出的都放在末尾,不要冲淡主线。

    opened by zhxnlp 2
  • Update 2.2.1-Pytorch编写完整的Transformer.md

    Update 2.2.1-Pytorch编写完整的Transformer.md

    看了代码有个地方应该是少写了一个2 主要看了encoder和decoder的结构。分词器和注意力没细看,导致有一点不太确定,希望能请教一下。 1.代码中memory是encoder最后一层输出,是不是就是教程里的Z,形状和encoder输入一样? 2.在decoder中第二层encoder-decoder-attention层的k和v是来自memory,但每层这个k、v是一样的,还是得乘以每层不同的权重矩阵? 3.sgt是上一层输出,第一层是目标序列的输入对吧?比如第一个预测字符,后续的预测结果加入这个输入序列 4.看代码encoder和decoder每层内部(自注意力和FFNN)已经有了norm,是不是默认最后输出还要进行一次norm? 5.为啥最后encoder和decode输出还要要norm? 暂时这几个问题,麻烦各位大佬了

    opened by zhxnlp 0
Owner
Datawhale
for the learner,和学习者一起成长
Datawhale
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow ?? Transformers provides thousands of pretrained mo

Hugging Face 77.2k Jan 3, 2023
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 2, 2023
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 1, 2023
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 652 Jan 6, 2023
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 19.5k Feb 13, 2021
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 5, 2023
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 3, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 2, 2023