NLTK Source

Natural Language Toolkit

Last update: Jan 4, 2023

Related tags

Overview

Natural Language Toolkit (NLTK)

NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. NLTK requires Python version 3.5, 3.6, 3.7, or 3.8.

For documentation, please visit nltk.org.

Contributing

Do you want to contribute to NLTK development? Great! Please read CONTRIBUTING.md for more details.

Donate

Have you found the toolkit helpful? Please support NLTK development by donating to the project via PayPal, using the link on the NLTK homepage.

Citing

If you publish work that uses NLTK, please cite the NLTK book, as follows:

Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.

Copyright

For license information, see LICENSE.txt.

AUTHORS.md contains a list of everyone who has contributed to NLTK.

Redistributing

NLTK source code is distributed under the Apache 2.0 License.
NLTK documentation is distributed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States license.
NLTK corpora are provided under the terms given in the README file for each corpus; all are redistributable and available for non-commercial use.
NLTK may be freely redistributed, subject to the provisions of these licenses.

Comments

Use new and shiny Stanford CoreNLP web API
This is work in progress. The general idea is to separate code that start a server from one that interacts with it.

I see a future usage as following

With an external server, here connection wraps a requests session, which in turn keeps a pool of connections, so there is no need to start a new connection for every API call.

parser = StanfordParser(url='http://localhost:9000') parser.parse('John loves Mary.')

With a server started by NLTK. In this case a server has to be started, jars found. Maybe it's a good idea to use circus arbiter for that https://circus.readthedocs.org/en/latest/for-devs/ or just popen. Here an extra care of port number should be taken as well managing the cases when the server dies.

with CoreNLPServer() as server: parser = StanfordParser(url=server.url) parser.parse('John loves Mary.')

Comments and ideas are welcome.
opened by dimazest 90
Failed to download NLTK data: HTTP ERROR 405 / 403
>>> nltk.download("all") [nltk_data] Error loading all: HTTP Error 405: Not allowed. >>> nltk.version_info sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)

Also, I tried to visit https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/cmudict.zip. Got the same HTTP 405 ERROR.

Find the same problem on stackoverflow: https://stackoverflow.com/questions/45318066/getting-405-while-trying-to-download-nltk-dta

Any comments would be appreciated.
admin corpus bug inactive
opened by matthew-z 47
Added the `translate` module for MT
Removed the un-used import in align/gdfa.py

Added the translate module with model.py to load pre-calculated phrase table and language models and

added stack_decoder.py to perform stack_decoding for Statistical Machine Translation (SMT)
opened by alvations 44
Stabilized MaltParser API
From #943,

MaltParser was requiring all sorts of weird os.environ to make it find the binary and then call jar file with environment java classpath.

The new API requires only where the user saves his/her installed version of maltparser and finds the jar files using os.walk and uses full classpath and org.maltparser.Malt to call Maltparser instead of -jar

Also the generate_malt_command makes updating the API to suit Maltparser easier.

I've tried with Maltparser-1.7.2 and Maltparser-1.8
opened by alvations 41
Malt parser not parsing sentences

whenever I parse a sentence using malt parser it gives me the following exception: mp.parse_one(token)

Exception: MaltParser parsing (java -cp D:/Python Files/maltparser-1.8.1\maltparser-1.8.1.jar:D:/Python Files/maltparser-1.8.1\lib\liblinear-1.8.jar:D:/Python Files/maltparser-1.8.1\lib\libsvm.jar:D:/Python Files/maltparser-1.8.1\lib\log4j.jar org.maltparser.Malt -c engmalt.poly-1.7.mco -i C:\Users\MUSTUF~1\AppData\Local\Temp\malt_input.conll.9ck59rmy -o C:\Users\MUSTUF~1\AppData\Local\Temp\malt_output.conll.1j7w_xvw -m parse) failed with exit code 1

opened by Mustufain 38

Reimplement FreqDist

Reimplement FreqDist to use collections.Counter, in NLTK 3.0.

http://docs.python.org/3.1/library/collections.html?highlight=counter#collections.Counter

Migrated from http://code.google.com/p/nltk/issues/detail?id=456

earlier comments

dchichkov said, at 2010-07-01T22:56:11.000Z:

A few comments:

Currently FreqDist is derived from dict(). While it is maybe convenient, it creates extra ambiguities in the interface.
Recalculating _N upon every inc() call is wasteful, especially in a way it is done now - via setitem call.
For large data sets it is valuable to have a trimming method that deletes samples with frequencies/counts lower than a specified limit;
For large data sets it is valuable to have a trimming method that can be called during the data collection process and trims the FreqDist to a specified size (e.g. 100,000,000 samples). This trimming method should delete 'oldest' samples (FIFO basis) with the lowest frequencies. That can be implemented if an OrderedDict is used to store the samples (as an option during the construction).
Consider storing samples that occur only once (hapax legomena) in a separate bin (set). Potentially that could reduce memory usage and give overall performance boost. Something along these lines (the code is untested):

        self._d = dict()
        self._s = set()
        self._N = 0
        self._Nr_cache = None
        self._max_cache = None
        self._item_cache = None
        if samples:
            self.update(samples)

    def inc(self, sample, count=1):
        """
        Increment this C{FreqDist}'s count for the given
        sample.

        @param sample: The sample whose count should be incremented.
        @type sample: any
        @param count: The amount to increment the sample's count by.
        @type count: C{int}
        @rtype: None
        @raise NotImplementedError: If C{sample} is not a
               supported sample type.
        """
        if count == 0: return
        self._N += count
        if sample in self._d: self._d[sample] += count; return
        if sample in self._s: self._d[sample] = count + 1; self._s.remove(sample); return
        if count == 1: self._s.add(sample); return
        self._d[sample] = count

    def __getitem__(self, sample):
        if sample in self._d: return self._d[sample]
        if sample in self._s: return 1
        return 0

    def __setitem__(self, sample, value):
        """
        Set this C{FreqDist}'s count for the given sample.

        @param sample: The sample whose count should be incremented.
        @type sample: any hashable object
        @param count: The new value for the sample's count
        @type count: C{int}
        @rtype: None
        @raise TypeError: If C{sample} is not a supported sample type.
        """

        self._N += (value - self.__getitem__(sample)))
        if sample in self._d and value <= 1: del delf._d[sample]
        if sample in self._s and value != 1: self._s.remove(sample)
        if value == 1: self._s.add(sample); return
        if value > 1: self._d[sample] = value

opened by alexrudnick 38

Improving Lancaster stemmer with strip prefix function and customizable rules from Whoosh
Issue: #1648

Clean up documentation

Added doctests (copied from lancaster.py)

Added AUTHORS.md

Fix the code where if there is no rule table present in init, the stemmer uses default rule.

Code is from Whoosh

If I could get more detail on nltk testing, that would be great. Because, I'm a bit confused with doctest and how to run the test.
opened by jaehoonhwang 33
panlex_lite installation via nltk.download() appears to fail
Platform: Python 3.5 on Mac OS X 10.11.2 Steps to reproduce:

$ python3

import nltk; nltk.download('all', halt_on_error=False)

Symptoms:

Partial console write:

[nltk_data] | Downloading package panlex_lite to [nltk_data] | /Users/beng/nltk_data... [nltk_data] | Unzipping corpora/panlex_lite.zip. Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 664, in download for msg in self.incr_download(info_or_id, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 543, in incr_download for msg in self.incr_download(info.children, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 529, in incr_download for msg in self._download_list(info_or_id, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 572, in _download_list for msg in self.incr_download(item, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 549, in incr_download for msg in self._download_package(info, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 638, in _download_package for msg in _unzip_iter(filepath, zipdir, verbose=False): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 2039, in _unzip_iter outfile.write(contents) OSError: [Errno 22] Invalid argument
opened by grayben 32
Refactored text.concordance to return list

Refactored text.concordance so that it returns a list (while still retaining the print functionality). Written so that the list is optional and the old behavior is default, but I think it might make more sense to do this the other way around (list of concordances by default, print on request).

The unittests and doctest are based on the behavior described in http://www.nltk.org/book/ch01.html Doctest relies on ellipses because of invisible character issues.

Would love more test cases.
corpus enhancement

opened by story645 31
git branching model

There is a proposal to adopt a more systematic approach to branches, e.g.

http://nvie.com/posts/a-successful-git-branching-model/ https://docs.google.com/file/d/0B5_-_q4j54cgYjVhYWZjN2UtOTQ4My00N2JjLTk2YWEtODljZWQyYTBkNDEx/edit?hl=en_US#

Please post feedback here.
admin

opened by stevenbird 31
Sentiment Analysis support
This Pull Request aims to add new corpora and lexicons specifically built for Sentiment Analysis purposes. Their integration has been discussed with authors and annotators, and we agreed for licensing terms that respect NLTK Apache License.

New corpora and Lexicons:

_Opinion Lexicon by Bing Liu_

URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon

PAPERS: Mining and summarizing customer reviews (Hu, Minqing and Liu, Bing)

CONTACT: Bing Liu ([email protected])

CORPUS NAME: opinion_lexicon

_Customer Review Dataset_ (Product reviews)

URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets

PAPERS: Mining and summarizing customer reviews (Hu, Minqing and Liu, Bing)

CONTACT: Bing Liu ([email protected])

NOTES: Title of review, product feature, positive/negative label with opinion strength, other info (comparisons, pronoun resolution, etc.)

CORPUS NAME: product_reviews_1

_Additional Review Dataset_ (Product reviews)

URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets

PAPERS: A holistic lexicon-based approach to opinion mining (Ding, Xiaowen and Liu, Bing and Yu, Philip S)

CONTACT: Bing Liu ([email protected])

NOTES: Title of review, product feature, positive/negative label with opinion strength, other info (comparisons, pronoun resolution, etc.)

CORPUS NAME: product_reviews_2

_Pros and Cons Dataset_ (Pros and cons sentences)

URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets

PAPERS: Mining Opinions in Comparative Sentences (Ganapathibhotla, Murthy and Liu, Bing)

CONTACT: Bing Liu ([email protected])

NOTES: A list of sentences tagged <pros> or <cons>

CORPUS NAME: pros_cons

_Comparative Sentences_ (Reviews)

URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets

PAPERS: Identifying Comparative Sentences in Text Documents (Nitin Jindal and Bing Liu), Mining Opinion Features in Customer Reviews (Minqing Hu and Bing Liu)

CONTACT: Bing Liu ([email protected])

NOTES: Sentence, POS-tagged sentence, entities, comparison type (non-equal, equative, superlative, non-gradable)

CORPUS NAME: comparative_sentences

_Sentence Polarity Dataset 1.0_

URL: http://www.cs.cornell.edu/People/pabo/movie-review-data/

PAPERS: Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales (Pang, Bo and Lee, Lillian)

CONTACT: Bo Pang, Lillian Lee

CORPUS NAME: sentence_polarity

_Subjectivity Dataset 1.0_

URL: http://www.cs.cornell.edu/People/pabo/movie-review-data/

PAPERS: A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts (Pang, Bo and Lee, Lillian)

CONTACT: Bo Pang, Lillian Lee

CORPUS NAME: subjectivity

Additionally, sentiment_analyzer.py contains methods can be used as a sort of wrapper around common NLTK functionalities to ease Sentiment Analysis tasks, especially for teaching purposes.

Several demos and helper functions have been implemented in util.py to provide examples using different datasets.
opened by fievelk 30

Fix a bug that left out the last section/heading.

Also refactored the logic a bit.

Here's an alternative using itertools.groupby, in case you prefer it:

sections, block = list(), list()
for k, g in groupby(self.parser.parse(stream.read()), key=lambda t: t.level == 0 and t.type == 'heading_open'):
    if k or block:
        block.extend(g)
    if not k and block:
        sections.append(block)
        block = list()

corpus bug

opened by elespike 8

For FreqDist plotting, no longer show plot by default
Resolves #2788

Hello!

Pull Request overview

For FreqDist and ConditionalFreqDist, only return the ax rather than also showing the plot by default.

Details

See #2788 for details. Note that the third case mentioned in that issue has already been solved in 8c759729d3a1d03fd724c96c6d82c38382ef2e82.

Tom Aarsen

plot
opened by tomaarsen 0
Texttiling and paragraphs

Hello.

I have a block of text that I want to segment. However I found out that the texttiling implementation requires that the input text is split into paragraphs. My questions is why does this happen? Doesn't the algorithm create pseudosentences and blocks on its own? After all, Hearst didn't rely on sentences and paragraphs to segment text.
tokenizer

opened by markdimi 4

Improve performance of nltk.ngrams

With the current implementation of nltk.ngrams, the performance decreases slightly when the size of n-grams increases:

>>> timeit.timeit('''nltk.ngrams(tokens, 1)''', setup='import nltk; tokens = list(range(1000))')
0.6647043000000394
>>> timeit.timeit('''nltk.ngrams(tokens, 2)''', setup='import nltk; tokens = list(range(1000))')
0.9033590999999888
>>> timeit.timeit('''nltk.ngrams(tokens, 3)''', setup='import nltk; tokens = list(range(1000))')
1.1749377000001004
>>> timeit.timeit('''nltk.ngrams(tokens, 4)''', setup='import nltk; tokens = list(range(1000))')
1.4639482000000044
>>> timeit.timeit('''nltk.ngrams(tokens, 5)''', setup='import nltk; tokens = list(range(1000))')
1.787133600000061

Though this is the expected behavior, perhaps it could be improved further. There is a function sliding_window from more-itertools which have seemingly implemented the same function as nltk.ngrams:

>>> timeit.timeit('more_itertools.sliding_window(tokens, 1)', setup='import more_itertools; tokens = list(range(1000))')
0.1839231999997537
>>> timeit.timeit('more_itertools.sliding_window(tokens, 2)', setup='import more_itertools; tokens = list(range(1000))')
0.188646499999777
>>> timeit.timeit('more_itertools.sliding_window(tokens, 3)', setup='import more_itertools; tokens = list(range(1000))')
0.17990640000016356
>>> timeit.timeit('more_itertools.sliding_window(tokens, 4)', setup='import more_itertools; tokens = list(range(1000))')
0.18609590000005483
>>> timeit.timeit('more_itertools.sliding_window(tokens, 5)', setup='import more_itertools; tokens = list(range(1000))')
0.19976090000000113

This implementation is faster, and it seems that the performance does not decrease when the size of n-grams increases. I suppose that this would be a better alternative to the current implementation, but I'm not sure about the compatibility with other optional parameters (e.g. pad_left, pad_right) and the license issue (Apache vs. MIT).

If the maintainers do not have time, I could work on this.

Notes: more-itertools has another function windowed which likewise would possibly be a better replacement for nltk.skipgrams.

performance

opened by BLKSerene 2

`generate` gives infinite recursion for recursive grammars
The following code for generating simple recursive grammar does not work

from nltk.grammar import Nonterminal, Production, CFG from nltk.parse.generate import generate S = Nonterminal('S') P = [Production(S, ['a', S]), Production(S, [])] G = CFG(S, P) gen = generate(G) print(next(gen))

gives RuntimeError: The grammar has rule(s) that yield infinite recursion!

I think that the problem should not be an issue if the algorithm uses different searching technique and iterators should be capable of generating infinite sequences. Without the capability to generate recursive grammars, it loses many practical capability because it can only handle finite grammars.
opened by sylee957 0

Owner

Natural Language Toolkit

GitHub http://nltk.org/

This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

2 Nov 17, 2021

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

9 Jun 27, 2022

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

15.3k Dec 30, 2022

An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT

6k Dec 30, 2022

When doing audio and video sentiment recognition, I found that a lot of code is duplicated, often a function in different time debugging for a long time, based on this problem, I want to manage all the previous work, organized into an open source library can be iterative. For their own use and others.

FastAudioVisual Our project is developed here. The goal finish time is March 01, 2021 What is FastAudioVisual? FastAudioVisual is a tool that allows u

39 Oct 27, 2022

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

15.3k Jan 3, 2023

An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

11.4k Jan 1, 2023

Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

5.8k Jan 4, 2023

An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

6k Dec 31, 2022

An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

3.9k Jan 3, 2023

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

10.8k Feb 18, 2021

An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

9.7k Feb 18, 2021

Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

4.8k Feb 18, 2021

An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

5k Feb 18, 2021

An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

3k Feb 17, 2021

This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

40 Nov 25, 2022

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, multi-microphone signal processing and many others.

5.1k Jan 9, 2023

Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

16 Oct 8, 2022

An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

1.4k Jan 2, 2023

NLTK Source

Related tags

Overview

Natural Language Toolkit (NLTK)

Contributing

Donate

Citing

Copyright

Redistributing

Comments

earlier comments

Partial console write:

Pull Request overview

Details

Owner

Natural Language Toolkit

This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

An open source library for deep learning end-to-end dialog systems and chatbots.

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

An open-source NLP research library, built on PyTorch.

Open Source Neural Machine Translation in PyTorch

An open source library for deep learning end-to-end dialog systems and chatbots.

An Open-Source Package for Neural Relation Extraction (NRE)

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

An open-source NLP research library, built on PyTorch.

Open Source Neural Machine Translation in PyTorch

An open source library for deep learning end-to-end dialog systems and chatbots.

An Open-Source Package for Neural Relation Extraction (NRE)

This is the source code of RPG (Reward-Randomized Policy Gradient)

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

An open source framework for seq2seq models in PyTorch.