A Practitioner's Guide to Natural Language Processing

Overview

Text Analytics with Python - 2nd Edition

A Practitioner's Guide to Natural Language Processing

Text analytics can be a bit overwhelming and frustrating at times with the unstructured and noisy nature of textual data and the vast amount of information available. "Text Analytics with Python" is a book packed with 674 pages of useful information based on techniques, algorithms, experiences and various lessons learnt over time in analyzing text data. This repository contains datasets and code used in this book. I will also be adding various notebooks and bonus content here from time to time. Keep watching this space!

Get the book



About the book

Book Cover

Leverage Natural Language Processing (NLP) in Python and learn how to set up your own robust environment for performing text analytics. This second edition has gone through a major revamp and introduces several significant changes and new topics based on the recent trends in NLP.

You’ll see how to use the latest state-of-the-art frameworks in NLP, coupled with machine learning and deep learning models for supervised sentiment analysis powered by Python to solve actual case studies. Start by reviewing Python for NLP fundamentals on strings and text data and move on to engineering representation methods for text data, including both traditional statistical models and newer deep learning-based embedding models. Improved techniques and new methods around parsing and processing text are discussed as well.
Text summarization and topic models have been overhauled so the book showcases how to build, tune, and interpret topic models in the context of an interest dataset on NIPS conference papers. Additionally, the book covers text similarity techniques with a real-world example of movie recommenders, along with sentiment analysis using supervised and unsupervised techniques. There is also a chapter dedicated to semantic analysis where you’ll see how to build your own named entity recognition (NER) system from scratch. While the overall structure of the book remains the same, the entire code base, modules, and chapters has been updated to the latest Python 3.x release.

Edition: 2nd
Pages: 674
Language: English
Book Title: Text Analytics with Python
Book Subtitle: A Practitioner's Guide to Natural Language Processing
Publisher: Apress (a part of Springer)
Print ISBN: 978-1-4842-4353-4
Online ISBN: 978-1-4842-4354-1
DOI: 10.1007/978-1-4842-4354-1
Copyright: Dipanjan Sarkar

With this book you will:

  • Understanding NLP and text syntax, semantics and structure
  • Discover text cleaning and feature engineering strategies
  • Learn and implement text classification and text clustering
  • Understand and build text summarization and topic models
  • Learn about the promise of deep learning and transfer learning for NLP
  • Implement hands-on examples based on Python and several popular open source libraries in NLP and text analytics, such as the natural language toolkit (nltk), gensim, scikit-learn, spaCy, keras and tensorflow
Comments
  • Convert code base for Python 3.x

    Convert code base for Python 3.x

    Python 3 is the future and even though a lot of legacy code and systems run on Python 2 (including our applications, which is why I had written this book in Python 2 in the first place). We need to slowly start migrating and building our code, apps and systems based on Python 3.

    Looking for experts in Python 3.x as well as NLP and text analytics who could help out in migrating each chapter's codebase to Python 3.x, since I am occupied for a major part of this year on other projects. I do have some parts of it ready for Python 3.x and can offer help and support whenever needed.

    Successful codebase migrations will make sure you are mentioned as a contributor in the acknowledgements & contributor list of this repository and project. Also you will get a mention in future versions of the book whenever that is in the pipeline.

    enhancement help wanted 
    opened by dipanjanS 25
  • Computing BM25 Similarity for 30 Querys and 85000 Documents

    Computing BM25 Similarity for 30 Querys and 85000 Documents

    Hello,

    the Code from the Book for BM25 is not working for large Datasets.

    File "C:\Users\xxx\Anaconda2\lib\site-packages\scipy\sparse\base.py", line 1039, in _process_toarray_args return np.zeros(self.shape, dtype=self.dtype, order=order)

    MemoryError

    It would be great if someone could change the code that it works in my case. I'm trying this by myself currently, but no success so far :(.

    Thanks

    opened by codingnoobneedshelp 7
  • ModuleNotFoundError: No module named 'normalization

    ModuleNotFoundError: No module named 'normalization

    Can anyone advise how to fix this urgent problem while using Python 3 ? While I am trying the code in Chapter 4: "----> from normalization import normalize_corpus import nltk from operator import itemgetter

    ModuleNotFoundError: No module named 'normalization' "

    opened by samuelxmli 4
  • I get in error

    I get in error

    my nltk is not complete download, because one of module is out of date. So, when i try your code i get in error, on of module is not work.

    In [47]: from contractions import CONTRACTION_MAP

    ImportError Traceback (most recent call last) in () ----> 1 from contractions import CONTRACTION_MAP

    ImportError: No module named contractions

    image

    opened by fatihinstf 3
  • Bug in feature_extractors() (Chapter 4)

    Bug in feature_extractors() (Chapter 4)

    Going through feature_extraction_demo.py, the line:

    avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
                                                     model=model,
                                                     num_features=10)
    

    raises an AttributionError:

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-17-cdd908e72f5c> in <module>()
          2 TOKENIZED_CORPUS
          3 avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
    ----> 4                                                  model=model, num_features=10)
    
    /Users/athair/researchdone/text_analytics_with_python/codes/feature_extractors.pyc in averaged_word_vectorizer(corpus, model, num_features)
         58 
         59 def averaged_word_vectorizer(corpus, model, num_features):
    ---> 60     vocabulary = set(model.index2word)
         61     features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
         62                     for tokenized_sentence in corpus]
    
    AttributeError: 'Word2Vec' object has no attribute 'index2word'
    

    with the latest gensim: v'3.0.1' I tried with both pip install gensim and from pulling directly from the gensim github repository: https://github.com/RaRe-Technologies/gensim/

    The accepted answer here suggests a fix: https://stackoverflow.com/questions/43146077/index2word-in-gensims-doc2vec-raises-an-attribute-error

    bug enhancement 
    opened by athoag 2
  • Graphviz code for Fig 3-4 generation

    Graphviz code for Fig 3-4 generation

    It would be great for readers to see how to generate annotated dependency tree in Fig 3-4. This is also related to https://stackoverflow.com/a/44867616/2380455, where installation of dependencies has been presented for os x.

    opened by ambientlight 2
  • Do the dot-product on the sparse matrix

    Do the dot-product on the sparse matrix

    When you have a large corpus, first making the matrix dense takes a lot of memory. Doing the dot product first and then expanding the result is more memory-efficient and still gives the same result

    enhancement 
    opened by martijnvanbeers 2
  • Error in: text-analytics-with-python/New-Second-Edition/Ch05 - Text Classification/Ch05b - Text Classification - I.ipynb

    Error in: text-analytics-with-python/New-Second-Edition/Ch05 - Text Classification/Ch05b - Text Classification - I.ipynb

    Running this line:

    normalize our corpus

    norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True, accented_char_removal=True, text_lower_case=True, text_lemmatization=True, text_stemming=False, special_char_removal=True, remove_digits=True, stopword_removal=True, stopwords=stopword_list)

    Returns error: AttributeError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_17860/2616894830.py in 6 7 # normalize our corpus ----> 8 norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True, 9 accented_char_removal=True, text_lower_case=True, text_lemmatization=True, 10 text_stemming=False, special_char_removal=True, remove_digits=True,

    AttributeError: module 'text_normalizer' has no attribute 'normalize_corpus'

    I can't find a reference to normalize_corpus in the text_normalizer documentation. Thanks

    opened by christophjones 1
  • Jupyter Notebooks for 2nd Edition?

    Jupyter Notebooks for 2nd Edition?

    Hello Dipanjan,

    I was wondering if you had the notebooks in question mentioned in the Safari/OReilly book available? The link led me here and I don't see them in the repo.

    Thanks!

    Screen Shot 2020-01-29 at 12 57 13 PM

    opened by pauldevos 1
  • Non functioning code in chapter 7: sentiwordnet example

    Non functioning code in chapter 7: sentiwordnet example

    This is also on page 356.

    from nltk.corpus import sentiwordnet as swn

    good = swn.senti_synsets('good', 'n')[0] Traceback (most recent call last): File "", line 1, in TypeError: 'filter' object is not subscriptable

    opened by ruddjm 1
  • from pattern.en import tag raise BadZipFile in Chapter 6

    from pattern.en import tag raise BadZipFile in Chapter 6

    When i run the code in Chapter 6,I got the following error: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1267, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file

    I tried to use pattern3 but it doesn't work.Google has little about this. I can't solve it. It would be great if someone who have gone through this problem tells me how to solve it. thanks a lot!

    opened by LittleTemple 0
Owner
Dipanjan (DJ) Sarkar
Data Science Lead, Google Dev Expert - ML, Author, Social: www.linkedin.com/in/dipanzan
Dipanjan (DJ) Sarkar
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 2, 2023
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 1, 2023
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 652 Jan 6, 2023
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 19.5k Feb 13, 2021
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 3, 2023
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 5, 2023
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 3, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 2, 2023
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 7, 2023
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 19.6k Feb 18, 2021