NLTK Source

Overview

Natural Language Toolkit (NLTK)

PyPI Travis

NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. NLTK requires Python version 3.5, 3.6, 3.7, or 3.8.

For documentation, please visit nltk.org.

Contributing

Do you want to contribute to NLTK development? Great! Please read CONTRIBUTING.md for more details.

See also how to contribute to NLTK.

Donate

Have you found the toolkit helpful? Please support NLTK development by donating to the project via PayPal, using the link on the NLTK homepage.

Citing

If you publish work that uses NLTK, please cite the NLTK book, as follows:

Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.

Copyright

Copyright (C) 2001-2020 NLTK Project

For license information, see LICENSE.txt.

AUTHORS.md contains a list of everyone who has contributed to NLTK.

Redistributing

  • NLTK source code is distributed under the Apache 2.0 License.
  • NLTK documentation is distributed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States license.
  • NLTK corpora are provided under the terms given in the README file for each corpus; all are redistributable and available for non-commercial use.
  • NLTK may be freely redistributed, subject to the provisions of these licenses.
Issues
  • Use new and shiny Stanford CoreNLP web API

    Use new and shiny Stanford CoreNLP web API

    This is work in progress. The general idea is to separate code that start a server from one that interacts with it.

    I see a future usage as following

    With an external server, here connection wraps a requests session, which in turn keeps a pool of connections, so there is no need to start a new connection for every API call.

    parser = StanfordParser(url='http://localhost:9000')
    parser.parse('John loves Mary.')
    

    With a server started by NLTK. In this case a server has to be started, jars found. Maybe it's a good idea to use circus arbiter for that https://circus.readthedocs.org/en/latest/for-devs/ or just popen. Here an extra care of port number should be taken as well managing the cases when the server dies.

    with CoreNLPServer() as server:
        parser = StanfordParser(url=server.url)
        parser.parse('John loves Mary.')
    

    Comments and ideas are welcome.

    opened by dimazest 90
  • Failed to download NLTK data: HTTP ERROR 405 / 403

    Failed to download NLTK data: HTTP ERROR 405 / 403

    >>> nltk.download("all")
    [nltk_data] Error loading all: HTTP Error 405: Not allowed.
    
    >>> nltk.version_info
    sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
    
    

    Also, I tried to visit https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/cmudict.zip. Got the same HTTP 405 ERROR.

    Find the same problem on stackoverflow: https://stackoverflow.com/questions/45318066/getting-405-while-trying-to-download-nltk-dta

    Any comments would be appreciated.

    admin corpus bug inactive 
    opened by matthew-z 47
  • Added the `translate` module for MT

    Added the `translate` module for MT

    • Removed the un-used import in align/gdfa.py
    • Added the translate module with model.py to load pre-calculated phrase table and language models and
    • added stack_decoder.py to perform stack_decoding for Statistical Machine Translation (SMT)
    opened by alvations 44
  • Stabilized MaltParser API

    Stabilized MaltParser API

    From #943,

    MaltParser was requiring all sorts of weird os.environ to make it find the binary and then call jar file with environment java classpath.

    • The new API requires only where the user saves his/her installed version of maltparser and finds the jar files using os.walk and uses full classpath and org.maltparser.Malt to call Maltparser instead of -jar
    • Also the generate_malt_command makes updating the API to suit Maltparser easier.

    I've tried with Maltparser-1.7.2 and Maltparser-1.8

    opened by alvations 41
  • Malt parser not parsing sentences

    Malt parser not parsing sentences

    whenever I parse a sentence using malt parser it gives me the following exception: mp.parse_one(token)

    Exception: MaltParser parsing (java -cp D:/Python Files/maltparser-1.8.1\maltparser-1.8.1.jar:D:/Python Files/maltparser-1.8.1\lib\liblinear-1.8.jar:D:/Python Files/maltparser-1.8.1\lib\libsvm.jar:D:/Python Files/maltparser-1.8.1\lib\log4j.jar org.maltparser.Malt -c engmalt.poly-1.7.mco -i C:\Users\MUSTUF~1\AppData\Local\Temp\malt_input.conll.9ck59rmy -o C:\Users\MUSTUF~1\AppData\Local\Temp\malt_output.conll.1j7w_xvw -m parse) failed with exit code 1

    opened by Mustufain 38
  • Reimplement FreqDist

    Reimplement FreqDist

    Reimplement FreqDist to use collections.Counter, in NLTK 3.0.

    http://docs.python.org/3.1/library/collections.html?highlight=counter#collections.Counter

    Migrated from http://code.google.com/p/nltk/issues/detail?id=456


    earlier comments

    dchichkov said, at 2010-07-01T22:56:11.000Z:

    A few comments:

    1. Currently FreqDist is derived from dict(). While it is maybe convenient, it creates extra ambiguities in the interface.

    2. Recalculating _N upon every inc() call is wasteful, especially in a way it is done now - via setitem call.

    3. For large data sets it is valuable to have a trimming method that deletes samples with frequencies/counts lower than a specified limit;

    4. For large data sets it is valuable to have a trimming method that can be called during the data collection process and trims the FreqDist to a specified size (e.g. 100,000,000 samples). This trimming method should delete 'oldest' samples (FIFO basis) with the lowest frequencies. That can be implemented if an OrderedDict is used to store the samples (as an option during the construction).

    5. Consider storing samples that occur only once (hapax legomena) in a separate bin (set). Potentially that could reduce memory usage and give overall performance boost. Something along these lines (the code is untested):

            self._d = dict()
            self._s = set()
            self._N = 0
            self._Nr_cache = None
            self._max_cache = None
            self._item_cache = None
            if samples:
                self.update(samples)
    
        def inc(self, sample, count=1):
            """
            Increment this C{FreqDist}'s count for the given
            sample.
    
            @param sample: The sample whose count should be incremented.
            @type sample: any
            @param count: The amount to increment the sample's count by.
            @type count: C{int}
            @rtype: None
            @raise NotImplementedError: If C{sample} is not a
                   supported sample type.
            """
            if count == 0: return
            self._N += count
            if sample in self._d: self._d[sample] += count; return
            if sample in self._s: self._d[sample] = count + 1; self._s.remove(sample); return
            if count == 1: self._s.add(sample); return
            self._d[sample] = count
    
        def __getitem__(self, sample):
            if sample in self._d: return self._d[sample]
            if sample in self._s: return 1
            return 0
    
        def __setitem__(self, sample, value):
            """
            Set this C{FreqDist}'s count for the given sample.
    
            @param sample: The sample whose count should be incremented.
            @type sample: any hashable object
            @param count: The new value for the sample's count
            @type count: C{int}
            @rtype: None
            @raise TypeError: If C{sample} is not a supported sample type.
            """
    
            self._N += (value - self.__getitem__(sample)))
            if sample in self._d and value <= 1: del delf._d[sample]
            if sample in self._s and value != 1: self._s.remove(sample)
            if value == 1: self._s.add(sample); return
            if value > 1: self._d[sample] = value
    
    opened by alexrudnick 38
  • Improving Lancaster stemmer with strip prefix function and customizable rules from Whoosh

    Improving Lancaster stemmer with strip prefix function and customizable rules from Whoosh

    Issue: #1648

    • Clean up documentation
    • Added doctests (copied from lancaster.py)
    • Added AUTHORS.md
    • Fix the code where if there is no rule table present in init, the stemmer uses default rule.

    Code is from Whoosh

    If I could get more detail on nltk testing, that would be great. Because, I'm a bit confused with doctest and how to run the test.

    opened by jaehoonhwang 33
  • panlex_lite installation via nltk.download() appears to fail

    panlex_lite installation via nltk.download() appears to fail

    Platform: Python 3.5 on Mac OS X 10.11.2 Steps to reproduce:

    1. $ python3
    2. import nltk; nltk.download('all', halt_on_error=False)

    Symptoms:

    Partial console write:

    [nltk_data] | Downloading package panlex_lite to [nltk_data] | /Users/beng/nltk_data... [nltk_data] | Unzipping corpora/panlex_lite.zip. Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 664, in download for msg in self.incr_download(info_or_id, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 543, in incr_download for msg in self.incr_download(info.children, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 529, in incr_download for msg in self._download_list(info_or_id, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 572, in _download_list for msg in self.incr_download(item, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 549, in incr_download for msg in self._download_package(info, download_dir, force): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 638, in _download_package for msg in _unzip_iter(filepath, zipdir, verbose=False): File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 2039, in _unzip_iter outfile.write(contents) OSError: [Errno 22] Invalid argument

    opened by grayben 32
  • Refactored text.concordance to return list

    Refactored text.concordance to return list

    Refactored text.concordance so that it returns a list (while still retaining the print functionality). Written so that the list is optional and the old behavior is default, but I think it might make more sense to do this the other way around (list of concordances by default, print on request).

    The unittests and doctest are based on the behavior described in http://www.nltk.org/book/ch01.html Doctest relies on ellipses because of invisible character issues.

    Would love more test cases.

    corpus enhancement 
    opened by story645 31
  • git branching model

    git branching model

    There is a proposal to adopt a more systematic approach to branches, e.g.

    http://nvie.com/posts/a-successful-git-branching-model/ https://docs.google.com/file/d/0B5_-_q4j54cgYjVhYWZjN2UtOTQ4My00N2JjLTk2YWEtODljZWQyYTBkNDEx/edit?hl=en_US#

    Please post feedback here.

    admin 
    opened by stevenbird 31
  • Sentiment Analysis support

    Sentiment Analysis support

    This Pull Request aims to add new corpora and lexicons specifically built for Sentiment Analysis purposes. Their integration has been discussed with authors and annotators, and we agreed for licensing terms that respect NLTK Apache License.

    New corpora and Lexicons:

    Additionally, sentiment_analyzer.py contains methods can be used as a sort of wrapper around common NLTK functionalities to ease Sentiment Analysis tasks, especially for teaching purposes.

    Several demos and helper functions have been implemented in util.py to provide examples using different datasets.

    opened by fievelk 30
  • already downloaded 'wordnet' but can't find it

    already downloaded 'wordnet' but can't find it

    Hello, I ran nltk.download('wordnet') in jupternotebook in Linux.

    [nltk_data] Downloading package wordnet to
    [nltk_data]     /home/xxxx_linux/nltk_data...
    [nltk_data]   Package wordnet is already up-to-date!
    

    then then i ran

    from nltk.stem.wordnet import WordNetLemmatizer
    WordNetLemmatizer().lemmatize("better kip")
    

    got error message

    LookupError                               Traceback (most recent call last)
    File ~/.cache/pypoetry/virtualenvs/disaster-response-pipeline-project-Ber-KOyS-py3.8/lib/python3.8/site-packages/nltk/corpus/util.py:80, in LazyCorpusLoader.__load(self)
         79 except LookupError as e:
    ---> 80     try: root = nltk.data.find('{}/{}'.format(self.subdir, zip_name))
         81     except LookupError: raise e
    
    File ~/.cache/pypoetry/virtualenvs/disaster-response-pipeline-project-Ber-KOyS-py3.8/lib/python3.8/site-packages/nltk/data.py:673, in find(resource_name, paths)
        672 resource_not_found = '\n%s\n%s\n%s\n' % (sep, msg, sep)
    --> 673 raise LookupError(resource_not_found)
    
    LookupError: 
    **********************************************************************
      Resource wordnet not found.
      Please use the NLTK Downloader to obtain the resource:
    
      >>> import nltk
      >>> nltk.download('wordnet')
      
      Searched in:
        - '/home/xxxx_linux/nltk_data'
        - '/usr/share/nltk_data'
        - '/usr/local/share/nltk_data'
        - '/usr/lib/nltk_data'
        - '/usr/local/lib/nltk_data'
    ...
        - '/usr/local/lib/nltk_data'
        - '/home/xxxx_linux/.cache/pypoetry/virtualenvs/disaster-response-pipeline-project-Ber-KOyS-py3.8/nltk_data'
        - '/home/xxxx_linux/.cache/pypoetry/virtualenvs/disaster-response-pipeline-project-Ber-KOyS-py3.8/lib/nltk_data'
    **********************************************************************
    

    in the terminal, i also try to cd /home/xxxx_linux/nltk_data, i only found two folders corpora tokenizers

    Anyone know what is the reason to cause it? I assumed it should download successfully, but it is not there.

    opened by canfang-feng 3
  • Resolution issue #3001

    Resolution issue #3001

    There was a reported issue with the "longid()" and "shortid()" methods of the Verbnet.py file. Basically, there are two types of ID, short id (eg. 32.5) and long id(eg. example-32.5). The methods involved change a short id to a long id and vice versa. The problem was that verbnet.longid("114-1") is supposed to return the long id "act-114-1", but it instead returns "114-1". This issue was solved by changing the Regex pattern to match the id types while still passing the tests for all other cases.

    The change in the long id Regex pattern was:

    _LONGID_RE = re.compile(r"([^-.]*)-([\d+.-]+)$") ------> _LONGID_RE = re.compile(r"([A-Za-z_]+)-([\d.-]+)$")

    The new pattern tries to match an alphabetic sequence first (without considering empty sequences), then to match a minus sign "-" and finally the associated short id.

    opened by JuanIMartinezB 1
  • generated  not   Syntactically Correct Sentences

    generated not Syntactically Correct Sentences

    dogs disappears dogs walks dogs disappear dogs walk dogs disappeared dogs walked girls disappears girls walks girls disappear girls walk girls disappeared girls walked

    cp = load_parser('grammars/book_grammars/feat0.fcfg', trace=2) gra = cp.grammar() for prd in generate(gra, depth=4):
    dot = ' '.join(prd) print(dot)

    opened by Sujingqiao 1
  • Verbnet longid() method returns wrong results on some shortids

    Verbnet longid() method returns wrong results on some shortids

    For verbnet3.3 corpus,

    verbnet.longid("114-1") is supposed to return longid "act-114-1", but it instead returns "114-1".

    It seems that the bug is rooted in the regex in verbnet.py, which wrongly matches "114-1" as a longid :

    _LONGID_RE = re.compile(r"([^\-\.]*)-([\d+.\-]+)$")

    I currently use _LONGID_RE = re.compile(r"([A-Za-z_]*)-([\d.\-]+)$") as a workaround.

    opened by TMPxyz 1
  • Update skolemize.py

    Update skolemize.py

    add a function, richardize(just a name trying to make it not confused with simplify or __reduce__ ), modified from skolemize.

    The function name does not represent any logician Richard, and maybe the NLTK maintenance party can choose a better name. The new function changes minimum from skolemize, and instead conversion to CNF (which is not equivalent but equisatisfiable to the input FOL expression), richardize(your_expr).equiv(your_expr) will be True.

    The motivation of this function is that I did not find simplify to work in the way I wished, and for a pretty messy expression, e.g., those with a lot of consecutive negations, the new function can make it neater.

    opened by cestwc 1
Owner
Natural Language Toolkit
Natural Language Toolkit
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

SaiVenkatDhulipudi 2 Nov 17, 2021
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 14.7k Aug 12, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 5.8k Aug 12, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 14.7k Aug 15, 2022
An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

AI2 11.1k Aug 6, 2022
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 5.7k Aug 11, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 5.8k Aug 12, 2022
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.7k Aug 12, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 10.8k Feb 18, 2021
An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

AI2 9.7k Feb 18, 2021
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 4.8k Feb 18, 2021
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 5k Feb 18, 2021
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3k Feb 17, 2021
This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

null 37 Jun 10, 2022
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, multi-microphone signal processing and many others.

SpeechBrain 4.4k Aug 8, 2022
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

null 15 Jul 30, 2022
An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

International Business Machines 1.4k Aug 9, 2022