NLTK Source

Overview

Natural Language Toolkit (NLTK)

PyPI Travis

NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. NLTK requires Python version 3.5, 3.6, 3.7, or 3.8.

For documentation, please visit nltk.org.

Contributing

Do you want to contribute to NLTK development? Great! Please read CONTRIBUTING.md for more details.

See also how to contribute to NLTK.

Donate

Have you found the toolkit helpful? Please support NLTK development by donating to the project via PayPal, using the link on the NLTK homepage.

Citing

If you publish work that uses NLTK, please cite the NLTK book, as follows:

Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.

Copyright

Copyright (C) 2001-2020 NLTK Project

For license information, see LICENSE.txt.

AUTHORS.md contains a list of everyone who has contributed to NLTK.

Redistributing

  • NLTK source code is distributed under the Apache 2.0 License.
  • NLTK documentation is distributed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States license.
  • NLTK corpora are provided under the terms given in the README file for each corpus; all are redistributable and available for non-commercial use.
  • NLTK may be freely redistributed, subject to the provisions of these licenses.
Comments
  • Use new and shiny Stanford CoreNLP web API

    Use new and shiny Stanford CoreNLP web API

    This is work in progress. The general idea is to separate code that start a server from one that interacts with it.

    I see a future usage as following

    With an external server, here connection wraps a requests session, which in turn keeps a pool of connections, so there is no need to start a new connection for every API call.

    parser = StanfordParser(url='http://localhost:9000')
    parser.parse('John loves Mary.')
    

    With a server started by NLTK. In this case a server has to be started, jars found. Maybe it's a good idea to use circus arbiter for that https://circus.readthedocs.org/en/latest/for-devs/ or just popen. Here an extra care of port number should be taken as well managing the cases when the server dies.

    with CoreNLPServer() as server:
        parser = StanfordParser(url=server.url)
        parser.parse('John loves Mary.')
    

    Comments and ideas are welcome.

    opened by dimazest 90
  • Failed to download NLTK data: HTTP ERROR 405 / 403

    Failed to download NLTK data: HTTP ERROR 405 / 403

    >>> nltk.download("all")
    [nltk_data] Error loading all: HTTP Error 405: Not allowed.
    
    >>> nltk.version_info
    sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
    
    

    Also, I tried to visit https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/cmudict.zip. Got the same HTTP 405 ERROR.

    Find the same problem on stackoverflow: https://stackoverflow.com/questions/45318066/getting-405-while-trying-to-download-nltk-dta

    Any comments would be appreciated.

    admin corpus bug inactive 
    opened by matthew-z 47
  • Added the `translate` module for MT

    Added the `translate` module for MT

    • Removed the un-used import in align/gdfa.py
    • Added the translate module with model.py to load pre-calculated phrase table and language models and
    • added stack_decoder.py to perform stack_decoding for Statistical Machine Translation (SMT)
    opened by alvations 44
  • Stabilized MaltParser API

    Stabilized MaltParser API

    From #943,

    MaltParser was requiring all sorts of weird os.environ to make it find the binary and then call jar file with environment java classpath.

    • The new API requires only where the user saves his/her installed version of maltparser and finds the jar files using os.walk and uses full classpath and org.maltparser.Malt to call Maltparser instead of -jar
    • Also the generate_malt_command makes updating the API to suit Maltparser easier.

    I've tried with Maltparser-1.7.2 and Maltparser-1.8

    opened by alvations 41
  • Malt parser not parsing sentences

    Malt parser not parsing sentences

    whenever I parse a sentence using malt parser it gives me the following exception: mp.parse_one(token)

    Exception: MaltParser parsing (java -cp D:/Python Files/maltparser-1.8.1\maltparser-1.8.1.jar:D:/Python Files/maltparser-1.8.1\lib\liblinear-1.8.jar:D:/Python Files/maltparser-1.8.1\lib\libsvm.jar:D:/Python Files/maltparser-1.8.1\lib\log4j.jar org.maltparser.Malt -c engmalt.poly-1.7.mco -i C:\Users\MUSTUF~1\AppData\Local\Temp\malt_input.conll.9ck59rmy -o C:\Users\MUSTUF~1\AppData\Local\Temp\malt_output.conll.1j7w_xvw -m parse) failed with exit code 1

    opened by Mustufain 38
  • Reimplement FreqDist

    Reimplement FreqDist

    Reimplement FreqDist to use collections.Counter, in NLTK 3.0.

    http://docs.python.org/3.1/library/collections.html?highlight=counter#collections.Counter

    Migrated from http://code.google.com/p/nltk/issues/detail?id=456


    earlier comments

    dchichkov said, at 2010-07-01T22:56:11.000Z:

    A few comments:

    1. Currently FreqDist is derived from dict(). While it is maybe convenient, it creates extra ambiguities in the interface.

    2. Recalculating _N upon every inc() call is wasteful, especially in a way it is done now - via setitem call.

    3. For large data sets it is valuable to have a trimming method that deletes samples with frequencies/counts lower than a specified limit;

    4. For large data sets it is valuable to have a trimming method that can be called during the data collection process and trims the FreqDist to a specified size (e.g. 100,000,000 samples). This trimming method should delete 'oldest' samples (FIFO basis) with the lowest frequencies. That can be implemented if an OrderedDict is used to store the samples (as an option during the construction).

    5. Consider storing samples that occur only once (hapax legomena) in a separate bin (set). Potentially that could reduce memory usage and give overall performance boost. Something along these lines (the code is untested):

            self._d = dict()
            self._s = set()
            self._N = 0
            self._Nr_cache = None
            self._max_cache = None
            self._item_cache = None
            if samples:
                self.update(samples)
    
        def inc(self, sample, count=1):
            """
            Increment this C{FreqDist}'s count for the given
            sample.
    
            @param sample: The sample whose count should be incremented.
            @type sample: any
            @param count: The amount to increment the sample's count by.
            @type count: C{int}
            @rtype: None
            @raise NotImplementedError: If C{sample} is not a
                   supported sample type.
            """
            if count == 0: return
            self._N += count
            if sample in self._d: self._d[sample] += count; return
            if sample in self._s: self._d[sample] = count + 1; self._s.remove(sample); return
            if count == 1: self._s.add(sample); return
            self._d[sample] = count
    
        def __getitem__(self, sample):
            if sample in self._d: return self._d[sample]
            if sample in self._s: return 1
            return 0
    
        def __setitem__(self, sample, value):
            """
            Set this C{FreqDist}'s count for the given sample.
    
            @param sample: The sample whose count should be incremented.
            @type sample: any hashable object
            @param count: The new value for the sample's count
            @type count: C{int}
            @rtype: None
            @raise TypeError: If C{sample} is not a supported sample type.
            """
    
            self._N += (value - self.__getitem__(sample)))
            if sample in self._d and value <= 1: del delf._d[sample]
            if sample in self._s and value != 1: self._s.remove(sample)
            if value == 1: self._s.add(sample); return
            if value > 1: self._d[sample] = value
    
    opened by alexrudnick 38
  • Improving Lancaster stemmer with strip prefix function and customizable rules from Whoosh

    Improving Lancaster stemmer with strip prefix function and customizable rules from Whoosh

    Issue: #1648

    • Clean up documentation
    • Added doctests (copied from lancaster.py)
    • Added AUTHORS.md
    • Fix the code where if there is no rule table present in init, the stemmer uses default rule.

    Code is from Whoosh

    If I could get more detail on nltk testing, that would be great. Because, I'm a bit confused with doctest and how to run the test.

    opened by jaehoonhwang 33
  • panlex_lite installation via nltk.download() appears to fail

    panlex_lite installation via nltk.download() appears to fail

    Platform: Python 3.5 on Mac OS X 10.11.2 Steps to reproduce:

    1. $ python3
    2. import nltk; nltk.download('all', halt_on_error=False)

    Symptoms:

    Partial console write:

    [nltk_data] | Downloading package panlex_lite to [nltk_data] | /Users/beng/nltk_data... [nltk_data] | Unzipping corpora/panlex_lite.zip. Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 664, in download for msg in self.incr_download(info_or_id, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 543, in incr_download for msg in self.incr_download(info.children, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 529, in incr_download for msg in self._download_list(info_or_id, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 572, in _download_list for msg in self.incr_download(item, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 549, in incr_download for msg in self._download_package(info, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 638, in _download_package for msg in _unzip_iter(filepath, zipdir, verbose=False): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 2039, in _unzip_iter outfile.write(contents) OSError: [Errno 22] Invalid argument

    opened by grayben 32
  • Refactored text.concordance to return list

    Refactored text.concordance to return list

    Refactored text.concordance so that it returns a list (while still retaining the print functionality). Written so that the list is optional and the old behavior is default, but I think it might make more sense to do this the other way around (list of concordances by default, print on request).

    The unittests and doctest are based on the behavior described in http://www.nltk.org/book/ch01.html Doctest relies on ellipses because of invisible character issues.

    Would love more test cases.

    corpus enhancement 
    opened by story645 31
  • git branching model

    git branching model

    There is a proposal to adopt a more systematic approach to branches, e.g.

    http://nvie.com/posts/a-successful-git-branching-model/ https://docs.google.com/file/d/0B5_-_q4j54cgYjVhYWZjN2UtOTQ4My00N2JjLTk2YWEtODljZWQyYTBkNDEx/edit?hl=en_US#

    Please post feedback here.

    admin 
    opened by stevenbird 31
  • Sentiment Analysis support

    Sentiment Analysis support

    This Pull Request aims to add new corpora and lexicons specifically built for Sentiment Analysis purposes. Their integration has been discussed with authors and annotators, and we agreed for licensing terms that respect NLTK Apache License.

    New corpora and Lexicons:

    Additionally, sentiment_analyzer.py contains methods can be used as a sort of wrapper around common NLTK functionalities to ease Sentiment Analysis tasks, especially for teaching purposes.

    Several demos and helper functions have been implemented in util.py to provide examples using different datasets.

    opened by fievelk 30
  • Fix a bug that left out the last section/heading.

    Fix a bug that left out the last section/heading.

    Also refactored the logic a bit.

    Here's an alternative using itertools.groupby, in case you prefer it:

    sections, block = list(), list()
    for k, g in groupby(self.parser.parse(stream.read()), key=lambda t: t.level == 0 and t.type == 'heading_open'):
        if k or block:
            block.extend(g)
        if not k and block:
            sections.append(block)
            block = list()
    
    corpus bug 
    opened by elespike 8
  • For FreqDist plotting, no longer show plot by default

    For FreqDist plotting, no longer show plot by default

    Resolves #2788

    Hello!

    Pull Request overview

    • For FreqDist and ConditionalFreqDist, only return the ax rather than also showing the plot by default.

    Details

    See #2788 for details. Note that the third case mentioned in that issue has already been solved in 8c759729d3a1d03fd724c96c6d82c38382ef2e82.

    • Tom Aarsen
    plot 
    opened by tomaarsen 0
  • Texttiling and paragraphs

    Texttiling and paragraphs

    Hello.

    I have a block of text that I want to segment. However I found out that the texttiling implementation requires that the input text is split into paragraphs. My questions is why does this happen? Doesn't the algorithm create pseudosentences and blocks on its own? After all, Hearst didn't rely on sentences and paragraphs to segment text.

    tokenizer 
    opened by markdimi 4
  • Improve performance of nltk.ngrams

    Improve performance of nltk.ngrams

    With the current implementation of nltk.ngrams, the performance decreases slightly when the size of n-grams increases:

    >>> timeit.timeit('''nltk.ngrams(tokens, 1)''', setup='import nltk; tokens = list(range(1000))')
    0.6647043000000394
    >>> timeit.timeit('''nltk.ngrams(tokens, 2)''', setup='import nltk; tokens = list(range(1000))')
    0.9033590999999888
    >>> timeit.timeit('''nltk.ngrams(tokens, 3)''', setup='import nltk; tokens = list(range(1000))')
    1.1749377000001004
    >>> timeit.timeit('''nltk.ngrams(tokens, 4)''', setup='import nltk; tokens = list(range(1000))')
    1.4639482000000044
    >>> timeit.timeit('''nltk.ngrams(tokens, 5)''', setup='import nltk; tokens = list(range(1000))')
    1.787133600000061
    

    Though this is the expected behavior, perhaps it could be improved further. There is a function sliding_window from more-itertools which have seemingly implemented the same function as nltk.ngrams:

    >>> timeit.timeit('more_itertools.sliding_window(tokens, 1)', setup='import more_itertools; tokens = list(range(1000))')
    0.1839231999997537
    >>> timeit.timeit('more_itertools.sliding_window(tokens, 2)', setup='import more_itertools; tokens = list(range(1000))')
    0.188646499999777
    >>> timeit.timeit('more_itertools.sliding_window(tokens, 3)', setup='import more_itertools; tokens = list(range(1000))')
    0.17990640000016356
    >>> timeit.timeit('more_itertools.sliding_window(tokens, 4)', setup='import more_itertools; tokens = list(range(1000))')
    0.18609590000005483
    >>> timeit.timeit('more_itertools.sliding_window(tokens, 5)', setup='import more_itertools; tokens = list(range(1000))')
    0.19976090000000113
    

    This implementation is faster, and it seems that the performance does not decrease when the size of n-grams increases. I suppose that this would be a better alternative to the current implementation, but I'm not sure about the compatibility with other optional parameters (e.g. pad_left, pad_right) and the license issue (Apache vs. MIT).

    If the maintainers do not have time, I could work on this.

    Notes: more-itertools has another function windowed which likewise would possibly be a better replacement for nltk.skipgrams.

    performance 
    opened by BLKSerene 2
  • `generate` gives infinite recursion for recursive grammars

    `generate` gives infinite recursion for recursive grammars

    The following code for generating simple recursive grammar does not work

    from nltk.grammar import Nonterminal, Production, CFG
    from nltk.parse.generate import generate
    
    
    S = Nonterminal('S')
    P = [Production(S, ['a', S]), Production(S, [])]
    G = CFG(S, P)
    
    gen = generate(G)
    print(next(gen))
    

    gives RuntimeError: The grammar has rule(s) that yield infinite recursion!

    I think that the problem should not be an issue if the algorithm uses different searching technique and iterators should be capable of generating infinite sequences. Without the capability to generate recursive grammars, it loses many practical capability because it can only handle finite grammars.

    opened by sylee957 0
Owner
Natural Language Toolkit
Natural Language Toolkit
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

SaiVenkatDhulipudi 2 Nov 17, 2021
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 30, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 3, 2023
An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

AI2 11.4k Jan 1, 2023
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 5.8k Jan 4, 2023
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 31, 2022
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 3, 2023
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 10.8k Feb 18, 2021
An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

AI2 9.7k Feb 18, 2021
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 4.8k Feb 18, 2021
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 5k Feb 18, 2021
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3k Feb 17, 2021
This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

null 40 Nov 25, 2022
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, multi-microphone signal processing and many others.

SpeechBrain 5.1k Jan 9, 2023
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

null 16 Oct 8, 2022
An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

International Business Machines 1.4k Jan 2, 2023