NLP-Overview

Introduction

The field of NPL encompasses a variety of topics which involve the computational processing and understanding of human languages. Since the 1980s, the field has increasingly relied on data-driven computation involving statistics, probability, and machine learning.
Recent increases in computational power and parallelization, harnessed by Graphical Processing Units (GPUs), now allow for “deep learning”, which utilizes artificial neural networks (ANNs), sometimes with billions of trainable parameters. Additionally, the contemporary availability of large datasets, facilitated by sophisticated data collection processes, enables the training of such deep architectures.

Natural Language Processing

The field of natural language processing, also known as computational linguistics, involves the engineering of computational models and processes to solve practical problems in understanding human languages. Although it is sometimes difficult to distinguish clearly to which areas issues belong.
Work in NLP can be divided into two broad sub-areas: core areas and applications.
Often one needs to handle one or more of the core issues successfully and apply those ideas and procedures to solve practical problems.
Currently, NLP is primarily a data-driven field using statistical and probabilistic computations along with machine learning.

2.1. Core area of NLP

The core issues are those that are inherently present in any computational linguistic system. To perform translation, text summarization, image captioning, or any other linguistic task, there must be some understanding of the underlying language. This understanding can be broken down into at least four main areas: language modeling, morphology, parsing, and semantics. The number of scholarly works in each area over the last decade is shown in Figure 3.

Fig. 3: Publication Volume for Core Areas of NLP. The number of publications, indexed by Google Scholar, relating to each topic over the last decade is shown. While all areas have experienced growth, language modeling has grown the most.

2.1.1 Language Modeling and Word Embeddings

Language modeling can be viewed in two ways. First, it determines which words follow which. By extension, however, this can be viewed as determining what words mean, as individual words are only weakly meaningful, deriving their full value only from their interactions with other words.
The core areas address fundamental problems such as language modeling, which underscores quantifying associations among naturally occurring words.
Arguably, the most important task in NLP is that of language modeling. Language modeling (LM) is an essential piece of almost any application of NLP. Language modeling is the process of creating a model to predict words or simple linguistic components given previous words or components. This is useful for applications in which a user types input, to provide predictive ability for fast text entry. However, its power and versatility emanate from the fact that it can implicitly capture syntactic and semantic relationships among words or components in a linear neighborhood, making it useful for tasks such as machine translation or text summarization. Using prediction, such programs are able to generate more relevant, human-sounding sentences

2.1.2. Morphology

Morphology is the study of how words themselves are formed. It considers the roots of words and the use of prefixes and suffixes, compounds, and other intraword devices, to display tense, gender, plurality, and a other linguistic constructs. (Understanding later)
Morphological processing, dealing with segmentation of meaningful components of words and identifying the true parts of speech of words as used.
Morphology is concerned with finding segments within single words, including roots and stems, prefixes, suffixes,and—in some languages—infixes. Affixes (prefixes, suffixes, or infixes) are used to overtly modify stems for gender, number, person, et cetera.

2.1.3. Parsing

Parsing considers which words modify others, forming constituents, leading to a sentential structure. The area of semantics is the study of what words mean. It takes into account the meanings of the individual words and how they relate to and modify others, as well as the context these words appear in and some degree of world knowledge, i.e., “common sense”. There is a significant amount of overlap between each of these areas.
Syntactic processing, or parsing, which builds sentence diagrams as possible precursors to semantic processing.
Parsing examines how different words and phrases relate to each other within a sentence. There are at least two distinct forms of parsing: constituency parsing and dependency parsing. In constituency parsing, phrasal constituents are extracted from a sentence in a hierarchical fashion. Dependency parsing looks at the relationships between pairs of individual words.
Remaining Challenges: Outside of universal parsing, a parsing challenge that needs to be further investigated is the building of syntactic structures without the use of treebanks for training. Attempts have been made using attention scores and Tree-LSTMs, as well as outside-inside auto-encoders. If such approaches are successful, they have potential use in many environments, including in the context of low-resource languages and out-of-domain scenarios. While a number of other challenges remain, these are the largest and are expected to receive the most focus.

D. Semantics

Semantic processing, which attempts to distill meaning of words, phrases, and higher level components in text.
Semantic processing involves understanding the meaning of words, phrases, sentences, or documents at some level. Word embeddings, such as Word2Vec and GloVe, claim to capture meanings of words, following the Distributional Hypothesis of Meaning. As a corollary, when vectors corresponding to phrases, sentences, or other components of text are processed using a neural network, a representation that can be loosely thought to be semantically representative is computed compositionally.
In this section, neural semantic processing research is separated into two distinct areas: Work on comparing the semantic similarity of two portions of text, and work on capturing and transferring meaning in high level constituents, particularly sentences.
Semantic Challenges: In addition to the challenges already mentioned, researchers believe that being able to solve tasks well does not indicate actual understanding. Integrating deep networks with general word-graphs (e.g. WordNet) or knowledge-graphs (e.g. DBPedia) may be able to endow a sense of understanding. Graph-embedding is an active area of research, and work on integrating language-based models and graph models has only recently begun to take off, giving hope for better machine understanding.

Summary of Core Issues

Deep learning has generally performed very well, surpassing existing states of the art in many individual core NLP tasks, and has thus created the foundation on which useful natural language applications can and are being built. However, it is clear from examining the research reviewed here that natural language is an enigmatically complex topic, with myriad core or basic tasks, of which deep learning has only grazed the surface. It is also not clear how architectures for ably executing individual core tasks can be synthesized to build a common edifice, possibly a much more complex distributed neural architecture, to show competence in multiple or “all” core tasks. More fundamentally, it is also not clear, how mastering of basic tasks, may lead to superior performance in applied tasks, which are the ultimate engineering goals, especially in the context of building effective and efficient deep learning models. Many, if not most, successful deep learning architectures for applied tasks, discussed in the next section, seem to forgo explicit architectural components for core tasks, and learn such tasks implicitly. Thus, some researchers argue that the relevance of the large amount of work on core issues is not fully justified, while others argue that further extensive research in such areas is necessary to better understand and develop systems which more perfectly perform these tasks, whether explicitly or implicitly.

Application of NLP Using Deep Learning

While the study of core areas of NLP is important to understanding how neural models work, it is meaningless in and of itself from an engineering perspective, which values applications that benefit humanity, not pure philosophical and scientific inquiry. Current approaches to solving several immediately useful NLP tasks are summarized here. Note that the issues included here are only those involving the processing of text, not the processing of verbal speech. Because speech processing [162], [163] requires expertise on several other topics including acoustic processing, it is generally considered another field of its own, sharing many commonalities with the field of NLP. The number of studies in each discussed area over the last decade is shown in Figure 4.

Fig. 4: Publication Volume for Applied Areas of NLP. All areas of applied natural language processing discussed have witnessed growth in recent years, with the largest growth occurring in the last two to three years.

The application areas involve topics
- Extraction of useful information (e.g. named entities and relations), translation of text between and among languages, summarization of written works, automatic answering of questions by inferring answers, and classification and clustering of documents.

4.1. Information Retrieval

The purpose of Information Retrieval (IR) systems is to help people find the right (most useful) information in the right (most convenient) format at the right time (when they need it) [164]. Among many issues in IR, a primary problem that needs addressing pertains to ranking documents with respect to a query string in terms of relevance scores for ad-hoc retrieval tasks, similar to what happens in a search engine.

4.2. Information Extraction

Information extraction extracts explicit or implicit information from text. The outputs of systems vary, but often the extracted data and the relationships within it are saved in relational databases [172]. Commonly extracted information includes named entities and relations, events and their participants, temporal information, and tuples of facts.

4.3. Text Generation

Many NLP tasks require the generation of human-like language. Summarization and machine translation convert one text to another in a sequence-to-sequence (seq2seq) fashion. Other tasks, such as image and video captioning and automatic weather and sports reporting, convert non-textual data to text. Some tasks, however, produce text without any input data to convert (or with only small amounts used as a topic or guide). These tasks include poetry generation, joke generation, and story generation.

4.4. Summarization

Summarization finds elements of interest in documents in order to produce an encapsulation of the most important content. There are two primary types of summarization: extractive and abstractive. The first focuses on sentence extraction, simplification, reordering, and concatenation to relay the important information in documents using text taken directly from the documents. Abstractive summaries rely on expressing documents’ contents through generation-style abstraction, possibly using words never seen in the documents [48].

4.5. Question Answering

Similar to summarization and information extraction, question answering (QA) gathers relevant words, phrases, or sentences from a document. QA returns this information in a coherent fashion in response to a request. Current methods resemble those of summarization.

4.6. Machine Translation

Machine translation (MT) is the quintessential application of NLP. It involves the use of mathematical and algorithmic techniques to translate documents in one language to another. Performing effective translation is intrinsically onerous even for humans, requiring proficiency in areas such as morphology, syntax, and semantics, as well as an adept understanding and discernment of cultural sensitivities, for both of the languages (and associated societies) under consideration [48].

Summary of Deep Learning NLP Applications

Numerous other applications of natural language processing exist including grammar correction, as seen in word processors, and author mimicking, which, given sufficient data, generates text replicating the style of a particular writer. Many of these applications are infrequently used, understudied, or not yet exposed to deep learning. However, the area of sentiment analysis should be noted, as it is becoming increasingly popular and utilizing deep learning. In large part a semantic task, it is the extraction of a writer’s sentiment—their positive, negative, or neutral inclination towards some subject or idea [268]. Applications are varied, including product research, futures prediction, social media analysis, and classification of spam [269], [270]. The current state of the art uses an ensemble including both LSTMs and CNNs [271].
This section has provided a number of select examples of the applied usages of deep learning in natural language processing. Countless studies have been conducted in these and similar areas, chronicling the ways in which deep learning has facilitated the successful use of natural language in a wide variety of applications. Only a minuscule fraction of such work has been referred to in this survey.
While more specific recommendations for practitioners have been discussed in some individual subsections, the current trend in state-of-the-art models in all application areas is to use pre-trained stacks of Transformer units in some configuration, whether in encoder-decoder configurations or just as encoders. Thus, self-attention which is the mainstay of Transformer has become the norm, along with cross-attention between encoder and decoder units, if decoders are present. In fact, in many recent papers, if not most, Transformers have begun to replace LSTM units that were preponderant just a few months ago. Pre-training of these large Transformer models has also become the accepted way to endow a model with generalized knowledge of language. Models such as BERT, which have been trained on corpora of billions of words, are available for download, thus providing a practitioner with a model that possesses a great amount of general knowledge of language already. A practitioner can further train it with one’s own general corpora, if desired, but such training is not always necessary, considering the enormous sizes of the pre-training that downloaded models have received. To train a model to perform a certain task well, the last step a practitioner must go through is to use available downloadable task-specific corpora, or build one’s own task-specific corpus. This last training step is usually supervised. It is also recommended that if several tasks are to be performed, multi-task training be used wherever possible.

Reference:

[1] https://arxiv.org/pdf/1807.10854.pdf

[2] https://arxiv.org/pdf/1708.02709.pdf

[3] https://arxiv.org/pdf/2108.05542.pdf

[4] Recent Advances in Natural Language Processing via Large Pre-TrainedLanguage Models: A Survey https://arxiv.org/pdf/2111.01243.pdf

[5] AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing https://arxiv.org/pdf/2108.05542.pdf Appendix:

Encompasses (v) to include different types of things

Underscores (v) to emphasize something, or to show that it is important

Data-driven science is an interdisciplinary field of scientific methods to extract knowledge from data

Data-driven learning a learning approach driven by research-like access to data

Inherently (v) is a basic or essential feature that gives something its character

Arguably (adv) used for stating your opinion or belief, especially when you think other people may disagree (Được cho là)

Versatility (adj) able to change easily or to be used for different purposes (Tính linh hoạt)

Emanate (v) to come from a particular place (bắt nguồn).

Implicitly (adj) not stated directly, but expressed in the way that someone behaves, or understood from what they are saying (ngầm hiểu).

Corollary (n) something that will also be true if a particular idea or statement is true, or something that will also exist if a particular situation exists (kết quả tất yếu)1

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

12.3k Dec 31, 2022

Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

3.2k Dec 30, 2022

An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

11.4k Jan 1, 2023

NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community

2.5k Jan 4, 2023

NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

1.6k Feb 10, 2021

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran