A Practitioner's Guide to Natural Language Processing

Dipanjan (DJ) Sarkar

Last update: Jan 3, 2023

Related tags

Text Data & NLP python semantic natural-language-processing sentiment-analysis text-classification clustering pattern natural-language scikit-learn sentiment spacy nltk text-summarization gensim stanford-nlp text-analytics

Overview

Text Analytics with Python - 2nd Edition

A Practitioner's Guide to Natural Language Processing

Text analytics can be a bit overwhelming and frustrating at times with the unstructured and noisy nature of textual data and the vast amount of information available. "Text Analytics with Python" is a book packed with 674 pages of useful information based on techniques, algorithms, experiences and various lessons learnt over time in analyzing text data. This repository contains datasets and code used in this book. I will also be adding various notebooks and bonus content here from time to time. Keep watching this space!

Get the book

About the book

Leverage Natural Language Processing (NLP) in Python and learn how to set up your own robust environment for performing text analytics. This second edition has gone through a major revamp and introduces several significant changes and new topics based on the recent trends in NLP.

You’ll see how to use the latest state-of-the-art frameworks in NLP, coupled with machine learning and deep learning models for supervised sentiment analysis powered by Python to solve actual case studies. Start by reviewing Python for NLP fundamentals on strings and text data and move on to engineering representation methods for text data, including both traditional statistical models and newer deep learning-based embedding models. Improved techniques and new methods around parsing and processing text are discussed as well.
Text summarization and topic models have been overhauled so the book showcases how to build, tune, and interpret topic models in the context of an interest dataset on NIPS conference papers. Additionally, the book covers text similarity techniques with a real-world example of movie recommenders, along with sentiment analysis using supervised and unsupervised techniques. There is also a chapter dedicated to semantic analysis where you’ll see how to build your own named entity recognition (NER) system from scratch. While the overall structure of the book remains the same, the entire code base, modules, and chapters has been updated to the latest Python 3.x release.

^{Edition: 2nd
Pages: 674
Language: English
Book Title: Text Analytics with Python
Book Subtitle: A Practitioner's Guide to Natural Language Processing
Publisher: Apress (a part of Springer)
Print ISBN: 978-1-4842-4353-4
Online ISBN: 978-1-4842-4354-1
DOI: 10.1007/978-1-4842-4354-1
Copyright: Dipanjan Sarkar}

With this book you will:

Understanding NLP and text syntax, semantics and structure
Discover text cleaning and feature engineering strategies
Learn and implement text classification and text clustering
Understand and build text summarization and topic models
Learn about the promise of deep learning and transfer learning for NLP
Implement hands-on examples based on Python and several popular open source libraries in NLP and text analytics, such as the natural language toolkit (nltk), gensim, scikit-learn, spaCy, keras and tensorflow

Comments

Convert code base for Python 3.x

Python 3 is the future and even though a lot of legacy code and systems run on Python 2 (including our applications, which is why I had written this book in Python 2 in the first place). We need to slowly start migrating and building our code, apps and systems based on Python 3.

Looking for experts in Python 3.x as well as NLP and text analytics who could help out in migrating each chapter's codebase to Python 3.x, since I am occupied for a major part of this year on other projects. I do have some parts of it ready for Python 3.x and can offer help and support whenever needed.

Successful codebase migrations will make sure you are mentioned as a contributor in the acknowledgements & contributor list of this repository and project. Also you will get a mention in future versions of the book whenever that is in the pipeline.
enhancement help wanted

opened by dipanjanS 25
Computing BM25 Similarity for 30 Querys and 85000 Documents

Hello,

the Code from the Book for BM25 is not working for large Datasets.

File "C:\Users\xxx\Anaconda2\lib\site-packages\scipy\sparse\base.py", line 1039, in _process_toarray_args return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError

It would be great if someone could change the code that it works in my case. I'm trying this by myself currently, but no success so far :(.

Thanks

opened by codingnoobneedshelp 7
ModuleNotFoundError: No module named 'normalization

Can anyone advise how to fix this urgent problem while using Python 3 ? While I am trying the code in Chapter 4: "----> from normalization import normalize_corpus import nltk from operator import itemgetter

ModuleNotFoundError: No module named 'normalization' "

opened by samuelxmli 4
I get in error

my nltk is not complete download, because one of module is out of date. So, when i try your code i get in error, on of module is not work.

In [47]: from contractions import CONTRACTION_MAP

ImportError Traceback (most recent call last) in () ----> 1 from contractions import CONTRACTION_MAP

ImportError: No module named contractions

opened by fatihinstf 3

Bug in feature_extractors() (Chapter 4)

Going through feature_extraction_demo.py, the line:

avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
                                                 model=model,
                                                 num_features=10)

raises an AttributionError:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-cdd908e72f5c> in <module>()
      2 TOKENIZED_CORPUS
      3 avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
----> 4                                                  model=model, num_features=10)

/Users/athair/researchdone/text_analytics_with_python/codes/feature_extractors.pyc in averaged_word_vectorizer(corpus, model, num_features)
     58 
     59 def averaged_word_vectorizer(corpus, model, num_features):
---> 60     vocabulary = set(model.index2word)
     61     features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
     62                     for tokenized_sentence in corpus]

AttributeError: 'Word2Vec' object has no attribute 'index2word'

with the latest gensim: v'3.0.1' I tried with both pip install gensim and from pulling directly from the gensim github repository: https://github.com/RaRe-Technologies/gensim/

The accepted answer here suggests a fix: https://stackoverflow.com/questions/43146077/index2word-in-gensims-doc2vec-raises-an-attribute-error

bug enhancement

opened by athoag 2

Graphviz code for Fig 3-4 generation

It would be great for readers to see how to generate annotated dependency tree in Fig 3-4. This is also related to https://stackoverflow.com/a/44867616/2380455, where installation of dependencies has been presented for os x.

opened by ambientlight 2
Do the dot-product on the sparse matrix

When you have a large corpus, first making the matrix dense takes a lot of memory. Doing the dot product first and then expanding the result is more memory-efficient and still gives the same result
enhancement

opened by martijnvanbeers 2
Error in: text-analytics-with-python/New-Second-Edition/Ch05 - Text Classification/Ch05b - Text Classification - I.ipynb

Running this line:

normalize our corpus

norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True, accented_char_removal=True, text_lower_case=True, text_lemmatization=True, text_stemming=False, special_char_removal=True, remove_digits=True, stopword_removal=True, stopwords=stopword_list)

Returns error: AttributeError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_17860/2616894830.py in 6 7 # normalize our corpus ----> 8 norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True, 9 accented_char_removal=True, text_lower_case=True, text_lemmatization=True, 10 text_stemming=False, special_char_removal=True, remove_digits=True,

AttributeError: module 'text_normalizer' has no attribute 'normalize_corpus'

I can't find a reference to normalize_corpus in the text_normalizer documentation. Thanks

opened by christophjones 1
Jupyter Notebooks for 2nd Edition?

Hello Dipanjan,

I was wondering if you had the notebooks in question mentioned in the Safari/OReilly book available? The link led me here and I don't see them in the repo.

Thanks!

opened by pauldevos 1
Non functioning code in chapter 7: sentiwordnet example

This is also on page 356.

from nltk.corpus import sentiwordnet as swn

good = swn.senti_synsets('good', 'n')[0] Traceback (most recent call last): File "", line 1, in TypeError: 'filter' object is not subscriptable

opened by ruddjm 1
from pattern.en import tag raise BadZipFile in Chapter 6

When i run the code in Chapter 6,I got the following error: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1267, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file

I tried to use pattern3 but it doesn't work.Google has little about this. I can't solve it. It would be great if someone who have gone through this problem tells me how to solve it. thanks a lot!

opened by LittleTemple 0

Owner

Dipanjan (DJ) Sarkar

Data Science Lead, Google Dev Expert - ML, Author, Social: www.linkedin.com/in/dipanzan

GitHub

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

3 May 25, 2022

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

4 Jul 1, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Dec 30, 2022

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

Ucto for Python This is a Python binding to the tokeniser Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task,

27 Dec 14, 2022

💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

24.9k Jan 2, 2023

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

21 Aug 12, 2022

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

2.1k Jan 1, 2023

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 6, 2023

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

1.2k Dec 21, 2022

💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

19.5k Feb 13, 2021

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

77.3k Jan 3, 2023

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

12.3k Dec 31, 2022

State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

3k Jan 5, 2023

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

1.9k Feb 3, 2021

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

2.9k Jan 2, 2023

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023

DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

1.5k Dec 26, 2022

💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

19.6k Feb 18, 2021

A Practitioner's Guide to Natural Language Processing

Related tags

Overview

Text Analytics with Python - 2nd Edition

A Practitioner's Guide to Natural Language Processing

Get the book

About the book

Comments

In [47]: from contractions import CONTRACTION_MAP

normalize our corpus

Owner

Dipanjan (DJ) Sarkar

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

💫 Industrial-strength Natural Language Processing (NLP) in Python

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

Basic Utilities for PyTorch Natural Language Processing (NLP)

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

💫 Industrial-strength Natural Language Processing (NLP) in Python

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

A very simple framework for state-of-the-art Natural Language Processing (NLP)

State of the Art Natural Language Processing

Basic Utilities for PyTorch Natural Language Processing (NLP)

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

DELTA is a deep learning based natural language and speech processing platform.

💫 Industrial-strength Natural Language Processing (NLP) in Python