Python module (C extension and plain python) implementing Aho-Corasick algorithm

Overview

pyahocorasick

Linux Master branch tests status Windows Master branch tests status

pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find multiple key strings occurrences at once in some input text. The library provides an ahocorasick Python module that you can use as a plain dict-like Trie or convert a Trie to an automaton for efficient Aho-Corasick search.

It is implemented in C and tested on Python 2.7 and 3.4+. It works on Linux, Mac and Windows.

The license is BSD-3-clause. Some utilities, such as tests and the pure Python automaton are dedicated to the Public Domain.

Download and source code

You can fetch pyahocorasick from:

Quick start

This module is written in C. You need a C compiler installed to compile native CPython extensions. To install:

pip install pyahocorasick

Then create an Automaton:

>>> import ahocorasick
>>> A = ahocorasick.Automaton()

You can use the Automaton class as a trie. Add some string keys and their associated value to this trie. Here we associate a tuple of (insertion index, original string) as a value to each key string we add to the trie:

>>> for idx, key in enumerate('he her hers she'.split()):
...   A.add_word(key, (idx, key))

Then check if some string exists in the trie:

>>> 'he' in A
True
>>> 'HER' in A
False

And play with the get() dict-like method:

>>> A.get('he')
(0, 'he')
>>> A.get('she')
(3, 'she')
>>> A.get('cat', 'not exists')
'not exists'
>>> A.get('dog')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError

Now convert the trie to an Aho-Corasick automaton to enable Aho-Corasick search:

>>> A.make_automaton()

Then search all occurrences of the keys (the needles) in an input string (our haystack).

Here we print the results and just check that they are correct. The Automaton.iter() method return the results as two-tuples of the end index where a trie key was found in the input string and the associated value for this key. Here we had stored as values a tuple with the original string and its trie insertion order:

>>> for end_index, (insert_order, original_value) in A.iter(haystack):
...     start_index = end_index - len(original_value) + 1
...     print((start_index, end_index, (insert_order, original_value)))
...     assert haystack[start_index:start_index + len(original_value)] == original_value
...
(1, 2, (0, 'he'))
(1, 3, (1, 'her'))
(1, 4, (2, 'hers'))
(4, 6, (3, 'she'))
(5, 6, (0, 'he'))

You can also create an eventually large automaton ahead of time and pickle it to re-load later. Here we just pickle to a string. You would typically pickle to a file instead:

>>> import cPickle
>>> pickled = cPickle.dumps(A)
>>> B = cPickle.loads(pickled)
>>> B.get('he')
(0, 'he')
See also:

Documentation

The full documentation including the API overview and reference is published on readthedocs.

Overview

With an Aho-Corasick automaton you can efficiently search all occurrences of multiple strings (the needles) in an input string (the haystack) making a single pass over the input string. With pyahocorasick you can eventually build large automatons and pickle them to reuse them over and over as an indexed structure for fast multi pattern string matching.

One of the advantages of an Aho-Corasick automaton is that the typical worst-case and best-case runtimes are about the same and depends primarily on the size of the input string and secondarily on the number of matches returned. While this may not be the fastest string search algorithm in all cases, it can search for multiple strings at once and its runtime guarantees make it rather unique. Because pyahocorasick is based on a Trie, it stores redundant keys prefixes only once using memory efficiently.

A drawback is that it needs to be constructed and "finalized" ahead of time before you can search strings. In several applications where you search for several pre-defined "needles" in a variable "haystacks" this is actually an advantage.

Aho-Corasick automatons are commonly used for fast multi-pattern matching in intrusion detection systems (such as snort), anti-viruses and many other applications that need fast matching against a pre-defined set of string keys.

Internally an Aho-Corasick automaton is typically based on a Trie with extra data for failure links and an implementation of the Aho-Corasick search procedure.

Behind the scenes the pyahocorasick Python library implements these two data structures: a Trie and an Aho-Corasick string matching automaton. Both are exposed through the Automaton class.

In addition to Trie-like and Aho-Corasick methods and data structures, pyahocorasick also implements dict-like methods: The pyahocorasick Automaton is a Trie a dict-like structure indexed by string keys each associated with a value object. You can use this to retrieve an associated value in a time proportional to a string key length.

pyahocorasick is available in two flavors:

  • a CPython C-based extension, compatible with Python 2 and 3.
  • a simpler pure Python module, compatible with Python 2 and 3. This is only available in the source repository (not on Pypi) under the py/ directory and has a slightly different API.

Unicode and bytes

The type of strings accepted and returned by Automaton methods are either unicode or bytes, depending on a compile time settings (preprocessor definition of AHOCORASICK_UNICODE as set in setup.py).

The Automaton.unicode attributes can tell you how the library was built. On Python 3, unicode is the default. On Python 2, bytes is the default and only value.

Warning

When the library is built with unicode support on Python 3, an Automaton will store 2 or 4 bytes per letter, depending on your Python installation. When built for bytes, only one byte per letter is needed.

Unicode is NOT supported on Python 2 for now.

Build and install from PyPi

To install for common operating systems, use pip. Pre-built wheels should be available on Pypi at some point in the future:

pip install pyahocorasick

To build from sources you need to have a C compiler installed and configured which should be standard on Linux and easy to get on MacOSX.

On Windows and Python 2.7 you need the Microsoft Visual C++ Compiler for Python 2.7 (aka. Visual Studio 2008). There have been reports that pyahocorasick does not build yet with MinGW. It may build with cygwin but this has not been tested. If you get this working with these platforms, please report in a ticket!

To build from sources, clone the git repository or download and extract the source archive.

Install pip (and its setuptools companion) and then run (in a virtualenv of course!):

pip install .

If compilation succeeds, the module is ready to use.

Support

Support is available through the GitHub issue tracker to report bugs or ask questions.

Contributing

You can submit contributions through GitHub pull requests.

Authors

The initial author and maintainer is Wojciech Muła. Philippe Ombredanne, the current co-owner, rewrote documentation, setup CI servers and did a whole lot of work to make this module better accessible to end users.

Alphabetic list of authors:

  • Andrew Grigorev
  • Bogdan
  • David Woakes
  • Edward Betts
  • Frankie Robertson
  • Frederik Petersen
  • gladtosee
  • INADA Naoki
  • Jan Fan
  • Pastafarianist
  • Philippe Ombredanne
  • Renat Nasyrov
  • Sylvain Zimmer
  • Xiaopeng Xu

This library would not be possible without help of many people, who contributed in various ways. They created pull requests, reported bugs as GitHub issues or via direct messages, proposed fixes, or spent their valuable time on testing.

Thank you.

License

This library is licensed under very liberal BSD-3-Clause license. Some portions of the code are dedicated to the public domain such as the pure Python automaton and test code.

Full text of license is available in LICENSE file.

Other Aho-Corasick implementations for Python you can consider

While pyahocorasick tries to be the finest and fastest Aho Corasick library for Python you may consider these other libraries:

  • Written in pure Python.
  • Poor performance.
  • Written in pure Python.
  • Better performance than py-aho-corasick.
  • Using pypy, ahocorapy's search performance is only slightly worse than pyahocorasick's.
  • Performs additional suffix shortcutting (more setup overhead, less search overhead for suffix lookups).
  • Includes visualization tool for resulting automaton (using pygraphviz).
  • MIT-licensed, 100% test coverage, tested on all major python versions (+ pypy)
  • Written in C. Does not return overlapping matches.
  • Does not compile on Windows (July 2016).
  • No support for the pickle protocol.
  • Written in Cython.
  • Large automaton may take a long time to build (July 2016)
  • No support for a dict-like protocol to associate a value to a string key.
  • Written in C.
  • seems unmaintained (last update in 2005).
  • GPL-licensed.
Comments
  • Invalid pickle file generated:

    Invalid pickle file generated: "ValueError: binary data truncated (1)"

    I've managed to create an automation, and then pickle that automation to a 286 Mb pickle file. Problem is, when I try to unpickle, I get this error:

    $ python -m pickle wikidata-automation.pickle 
    Traceback (most recent call last):
      File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 184, in _run_module_as_main
        "__main__", mod_spec)
      File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickle.py", line 1605, in <module>
        obj = load(f)
    ValueError: binary data truncated (1)
    

    The source of that error is here: https://github.com/WojciechMula/pyahocorasick/blob/master/Automaton_pickle.c#L309

    Would you mind helping me troubleshoot this? Any ideas? I don't think I can send files this big to you?

    Update: This is how I build the pickle file:

    automaton = ahocorasick.Automaton()
    for i, (label, id_) in enumerate(generator):
        automaton.add_word(label, id_)
    
    automaton.make_automaton()
    
    with open(filename_out, "wb") as f:
        pickle.dump(automaton, f)
    

    Where generator just runs yield ("Belgium", "Q31").

    bug 
    opened by EmilStenstrom 60
  • Failed install on Windows 10 with Python 3.5.2

    Failed install on Windows 10 with Python 3.5.2

    I'm not able to install using the command line command

    py -m pip install ahocorasick

    The compile is failing, I've attached the error messages I'm getting.

    I have Visual Studio Professional 2015 Professional installed and Cython 0.25.2. Any pointers on what I may be doing wrong gratefully received!

    ahocorasick-compile-error.txt

    opened by woakesd 33
  • Need an iter_long() method [was: Question about functionality. ]

    Need an iter_long() method [was: Question about functionality. ]

    Hello, and thanks for your work.

    1. It's possible to save a trie. Something like pickle format?
    2. Module ahocorasick by Danny Yoo has this method: search_long(query, [startpos]). Same as search(), except that this searches for the longest leftmost keyword that matches. pyahocorasick has smth like this method?
    enhancement question 
    opened by Ulitochka 30
  • Store sequences of integers

    Store sequences of integers

    It would be great if we could store sequences of integers over a well defined range in an automaton instead of just plain bytes or unicode. Here is a simple python proof of concept (derived from the pure Python trie)

    # -*- coding: utf-8 -*-
    
    # Copyright (c) 2011-2014 Wojciech Mula
    # All rights reserved.
    #
    # Redistribution and use in source and binary forms, with or
    # without modification, are permitted provided that the following
    # conditions are met:
    #
    # * Redistributions of source code must retain the above copyright
    #   notice, this list of conditions and the following disclaimer.
    # * Redistributions in binary form must reproduce the above
    #   copyright notice, this list of conditions and the following
    #   disclaimer in the documentation and/or other materials
    #   provided with the distribution.
    # * Neither the name of the Wojciech Mula nor the names of its
    #   contributors may be used to endorse or promote products
    #   derived from this software without specific prior written
    #   permission.
    #
    # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
    # CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
    # INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
    # MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
    # DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
    # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
    # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
    # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
    # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
    # AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
    # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
    # IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
    # THE POSSIBILITY OF SUCH DAMAGE.
    
    """
    Aho-Corasick string search algorithm.
    
    Author    : Wojciech Mula, [email protected]
    WWW       : http://0x80.pl
    License   : public domain
    """
    
    from __future__ import print_function, absolute_import
    
    from collections import deque
    
    
    # used to distinguish from None
    NIL = -1
    
    # attributes index in the node
    _key = 0
    _val = 1
    _fail = 2
    _kids = 3
    
    def TrieNode(key):
        """
        Build a new node as a simple list: [key, value, fail, kids]
        - key is an integer
        - value is an arbitrary object associated with this node
        - failure link used by Aho-Corasick automaton
        - kids is a mapping of children
        """
        # [keyitem:0, value:1, fail:2, _kids:3]
        return [key, -1, -1, {}]
    
    
    class Trie(object):
        """
        Trie/Aho-Corasick automaton specialized to store integer sequences.
        """
    
        def __init__(self, items_range):
            """
            Initialize a Trie for a storing integers in the range(`items_range`)
            contiguous integer items.
            """
            self.root = TrieNode([])
            self.items_range = items_range
    
        def __get_node(self, seq):
            """
            Return a final node or None if the trie does not contain the sequence of
            integers.
            """
            node = self.root
            for key in seq:
                try:
                    # note: kids may be None
                    node = node[_kids][key]
                except KeyError:
                    return None
            return node
    
        def get(self, seq, default=-1):
            """
            Return the value associated with the sequence of integers. If the
            sequence of integers is not present in trie return the `default` if
            provided or raise a KeyError if not provided.
            """
            node = self.__get_node(seq)
            value = -1
            if node:
                value = node[_val]
    
            if value == -1:
                if default == -1:
                    raise KeyError()
                else:
                    return default
            else:
                return value
    
        def iterkeys(self):
            return (k for k, _v in self.iteritems())
    
        def itervalues(self):
            return (v for _k, v in self.iteritems())
    
        def iteritems(self):
            L = []
    
            def walk(node, s):
                s = s + [node[_key]]
                if node[_val] != -1:
                    L.append((s, node[_val]))
    
                # FIXME: this is using recursion rather than a stack
                for child in node[_kids].values():
                    if child is not node:
                        walk(child, s)
    
            walk(self.root, [])
            return iter(L)
    
        def __len__(self):
            stack = deque()
            stack.append(self.root)
            n = 0
            while stack:
                node = stack.pop()
                if node[_val] != -1:
                    n += 1
                for child in node[_kids].itervalues():
                    stack.append(child)
            return n
    
        def add(self, seq, value):
            """
            Add a sequence of integers and its associated value to the trie. If `seq`
            already exists in the trie, its value is replaced by `value`.
            """
            if not seq:
                return
    
            node = self.root
            for key in seq:
                try:
                    # note: kids may be None
                    node = node[_kids][key]
                except KeyError:
                    n = TrieNode(key)
                    node[_kids][key] = n
                    node = n
    
            # only assign the value to the last item of the sequence
            node[_val] = value
    
        def clear(self):
            """
            Clears trie.
            """
            self.root = TrieNode([])
    
    
        def exists(self, seq):
            """
            Return True if the sequence of integers is present in the trie.
            """
            node = self.__get_node(seq)
            if node:
                return bool(node[_val] != -1)
            else:
                return False
    
        def match(self, seq):
            """
            Return True if the sequence of items is a prefix of any existing
            sequence of items in the trie.
            """
            return self.__get_node(seq) is not None
    
        def make_automaton(self):
            """
            Convert the trie to an Aho-Corasick automaton adding the failure links.
            """
            queue = deque()
    
            #1. create top root kids over the items range, failing to root
            for item in range(self.items_range):
                # self.content is either int or chr
                # item = self.content(i)
                if item in self.root[_kids]:
                    node = self.root[_kids][item]
                    # f(s) = 0
                    node[_fail] = self.root
                    queue.append(node)
                else:
                    self.root[_kids][item] = self.root
    
            #2. using the queue of all possible items, walk the trie and add failure links
            while queue:
                current = queue.popleft()
                for node in current[_kids].values():
                    queue.append(node)
                    state = current[_fail]
                    while node[_key] not in state[_kids]:
                        state = state[_fail]
                    node[_fail] = state[_kids].get(node[_key], self.root)
    
        def search(self, seq):
            """
            Yield all matches of `seq` sequence of integers in the automaton
            performing an Aho-Corasick search. This includes overlapping matches.
    
            The returned tuples are: (matched end index in seq, [associated values, ...])
            such that the actual matched sub-sequence is: seq[end_index - n + 1:end_index + 1]
            """
            state = self.root
            for index, key in enumerate(seq):
                # find the first failure link and next state
                while key not in state[_kids]:
                    state = state[_fail]
    
                # follow kids or get back to root
                state = state[_kids].get(key, self.root)
                tmp = state
                value = []
                while tmp != -1:
                    if tmp == -1:
                        break
                    if tmp[_val] != -1:
                        value.append(tmp[_val])
                    tmp = tmp[_fail]
                if value:
                    yield index, value
    
        def search_long(self, seq):
            """
            Yield all loguest non-overlapping matches of the `seq` sequence of
            integers in the automaton performing an Aho-Corasick search such that
            when matches overlap, only the longuest is returned.
    
            Note that because of the original index construction, two matches cannot
            be the same as not two rules are identical.
    
            The returned tuples are: (matched end index in seq, [associated values, ...])
            such that the actual matched sub-sequence is: seq[end_index - n + 1:end_index + 1]
            """
            state = self.root
            last = None
    
            index = 0
            while index < len(seq):
                item = seq[index]
    
                if item in state[_kids]:
                    state = state[_kids][item]
                    if state[_val] != -1:
                        # save the last node on the path
                        last = index, [state[_val]]
                    index += 1
                else:
                    if last:
                        # return the saved match
                        yield last
                        # and start over since we do not want overlapping results
                        # Note: this leads to quadratic complexity in the worst case
                        index = last[1] + 1
                        state = self.root
                        last = None
                    else:
                        # if no output, perform classic Aho-Corasick algorithm
                        while item not in state[_kids]:
                            state = state[_fail]
            # last match if any
            if last:
                yield last
    
    
    enhancement question 
    opened by pombredanne 26
  • Windows build in MSVC 2015

    Windows build in MSVC 2015

    Nishant Sharma has reported that following change is required in windows.h

    typedef __int32 int32_t typedef unsinged __int32 uint32_t

    in order to build with the latest MSVC. I'll include it in the next release.

    opened by WojciechMula 16
  • _pickle.load fails for more than 64 words

    _pickle.load fails for more than 64 words

    Hello I don't know is this connected to #50, so created a new issue:

    This works ok:

    import ahocorasick
    import _pickle
    
    A = ahocorasick.Automaton()
    for i in range(0, 64):
        A.add_word(str(i), (i, i))
    _pickle.dump(A, open('aho', 'wb'))
    _pickle.load(open('aho', 'rb'))
    #<ahocorasick.Automaton object at 0x7ff51acc4a58>
    

    And this fails constantly:

    import ahocorasick
    import _pickle
    
    A = ahocorasick.Automaton()
    for i in range(0, 65):
        A.add_word(str(i), (i, i))
    _pickle.dump(A, open('aho', 'wb'))
    _pickle.load(open('aho', 'rb'))
    #---------------------------------------------------------------------------
    #ValueError                                Traceback (most recent call last)
    #<ipython-input-129-f886db783629> in <module>()
    #      3     A.add_word(str(i), (i, i))
    #      4 _pickle.dump(A, open('aho', 'wb'))
    #----> 5 _pickle.load(open('aho', 'rb'))
    
    #ValueError: binary data truncated (2)
    
    python --version
    # Python 3.6.2
    
    pip list | grep aho
    # pyahocorasick (1.1.4)
    
    bug 
    opened by slavaGanzin 14
  • How to import module from .so file inside egg file without using an absolute file path?

    How to import module from .so file inside egg file without using an absolute file path?

    I posted the details on SO. Thanks all.

    http://stackoverflow.com/questions/43274153/python-how-to-import-module-from-so-file-from-egg-file-without-using-an-absol

    question 
    opened by Guangyi-Z 14
  • Display of README on Pypi

    Display of README on Pypi

    See https://pypi.python.org/pypi/pyahocorasick/1.1.1 This is not looking good. There are some limits to what ReST features Pypi supports. The issue is likely around links:

    • http://stackoverflow.com/questions/16367770/my-rst-readme-is-not-formatted-on-pypi-python-org
    • http://inre.dundeemt.com/2014-05-04/pypi-vs-readme-rst-a-tale-of-frustration-and-unnecessary-binding/
    doc 
    opened by pombredanne 14
  • installation under python 2: C compilation is attempted and fails

    installation under python 2: C compilation is attempted and fails

    Hello!

    I've tried to install pyahocorasick both using pip and by cloning current master and doing a python setup.py install. Using python 2.7.9 I see setup trying to conpile the C module and failing with

    pyahocorasick.c:42:1: error: unknown type name ‘PyModuleDef’
     PyModuleDef ahocorasick_module = {
     ^
    pyahocorasick.c:43:2: error: ‘PyModuleDef_HEAD_INIT’ undeclared here (not in a function)
    

    Isn't the module supposed to be compiled only for Python 3?

    bug 
    opened by RobertoMaurizzi 14
  • iter_long function not exists in 1.4.0 version

    iter_long function not exists in 1.4.0 version

    I found iter_long function in the documentation below

    https://pyahocorasick.readthedocs.io/en/latest/#iter-long-string-start-end

    But when i'm coding with it, it turns out not exits

    macos 10.13.4 and 10.14.2

    opened by zhouxinhit 12
  • Why printf(

    Why printf("Parsed count %zu\n", count); [Automaton.c;121]

    Why is there this printf:

    printf("Parsed count %zu\n", count);

    in Automaton.c at row 121?

    It is very annoying, every time I reload an automaton stored with pickle it displays this "Parsed count" in stdout.

    opened by gygabyte017 11
  • PUBLISH BINARY WHEELS PLEASE

    PUBLISH BINARY WHEELS PLEASE

    NO EXPLANATION NEEDED. C modules that are compiled to native should not require compilation (and, by extension, gcc, and python.h) every time they are installed. This is especially true now that wheel has support for both glibc and musl. hy whold thousands (maybe more) of users be required to install an entire compiler toolkit and python-devel headers just because y'all don't want to publish 2 binary wheels.

    opened by HarryCaveMan 1
  • unpickle test failures on i586

    unpickle test failures on i586

    [   49s] =================================== FAILURES ===================================
    [   49s] _________________ TestUnpickleRaw.test__construct_simple_trie __________________
    [   49s] 
    [   49s] self = <test_unpickle.TestUnpickleRaw testMethod=test__construct_simple_trie>
    [   49s] 
    [   49s]     @skipIf(not ahocorasick.unicode, "Run only with unicode build")
    [   49s]     def test__construct_simple_trie(self):
    [   49s]     
    [   49s]         r"""
    [   49s]         trie for set {he, her, his, him, it}
    [   49s]     
    [   49s]         #0 -> [h #1 ] -> [e #2*] -> [r #3*]
    [   49s]          |           \-> [i #4 ] -> [s #5*]
    [   49s]          |                      \-> [m #6*]
    [   49s]          |
    [   49s]          +--> [i #7 ] -> [t #8 ]
    [   49s]         """
    [   49s]         values = ["HE", "HER", "HIS", "HIM", "IT"]
    [   49s]     
    [   49s]         node0 = self.create_raw_node(0, [('h', 1), ('i', 7)])
    [   49s]         node1 = self.create_raw_node(0, [('e', 2), ('i', 4)])
    [   49s]         node2 = self.create_raw_node(1, [('r', 3)])  # HE
    [   49s]         node3 = self.create_raw_node(1, [])  # HER
    [   49s]         node4 = self.create_raw_node(0, [('s', 5), ('m', 6)])
    [   49s]         node5 = self.create_raw_node(1, [])  # HIS
    [   49s]         node6 = self.create_raw_node(1, [])  # HIM
    [   49s]         node7 = self.create_raw_node(0, [('t', 8)])
    [   49s]         node8 = self.create_raw_node(1, [])  # IT
    [   49s]     
    [   49s]         self.count = 9
    [   49s]         self.raw = node0 + node1 + node2 + node3 + node4 + node5 + node6 + node7 + node8
    [   49s]         self.kind = ahocorasick.TRIE
    [   49s]         self.values = values
    [   49s]         self.word_count = 5
    [   49s]     
    [   49s] >       A = self.create_automaton()
    [   49s] 
    [   49s] tests/test_unpickle.py:166: 
    [   49s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    [   49s] 
    [   49s] self = <test_unpickle.TestUnpickleRaw testMethod=test__construct_simple_trie>
    [   49s] use_exact_raw = False
    [   49s] 
    [   49s]     def create_automaton(self, use_exact_raw=False):
    [   49s]         # alter values that were set in setUp
    [   49s]         if use_exact_raw:
    [   49s]             raw = self.raw
    [   49s]         else:
    [   49s]             raw = [self.create_raw_count(self.count) + self.raw]
    [   49s]     
    [   49s]         args = (raw, self.kind, self.store, self.key_type,
    [   49s]                 self.word_count, self.longest, self.values);
    [   49s]     
    [   49s] >       return ahocorasick.Automaton(*args)
    [   49s] E       ValueError: Data truncated [parsing children of node #2]: chunk #0 @ offset 54, expected at least 840 bytes
    [   49s] 
    [   49s] tests/test_unpickle.py:111: ValueError
    [   49s] _ TestUnpickleRaw.test__construct_simple_trie__split_across_a_few_chunks_unicode _
    [   49s] 
    [   49s] self = <test_unpickle.TestUnpickleRaw testMethod=test__construct_simple_trie__split_across_a_few_chunks_unicode>
    [   49s] 
    [   49s]     @skipIf(not ahocorasick.unicode, "Run only with unicode build")
    [   49s]     def test__construct_simple_trie__split_across_a_few_chunks_unicode(self):
    [   49s]     
    [   49s]         r"""
    [   49s]         trie for set {he, her, his, him, it}
    [   49s]     
    [   49s]         #0 -> [h #1 ] -> [e #2*] -> [r #3*]
    [   49s]          |           \-> [i #4 ] -> [s #5*]
    [   49s]          |                      \-> [m #6*]
    [   49s]          |
    [   49s]          +--> [i #7 ] -> [t #8 ]
    [   49s]         """
    [   49s]         values = ["HE", "HER", "HIS", "HIM", "IT"]
    [   49s]     
    [   49s]         node0 = self.create_raw_node(0, [('h', 1), ('i', 7)])
    [   49s]         node1 = self.create_raw_node(0, [('e', 2), ('i', 4)])
    [   49s]         node2 = self.create_raw_node(1, [('r', 3)])  # HE
    [   49s]         node3 = self.create_raw_node(1, [])  # HER
    [   49s]         node4 = self.create_raw_node(0, [('s', 5), ('m', 6)])
    [   49s]         node5 = self.create_raw_node(1, [])  # HIS
    [   49s]         node6 = self.create_raw_node(1, [])  # HIM
    [   49s]         node7 = self.create_raw_node(0, [('t', 8)])
    [   49s]         node8 = self.create_raw_node(1, [])  # IT
    [   49s]     
    [   49s]         self.count = 9
    [   49s]         self.raw = [
    [   49s]             self.create_raw_count(2) + node0 + node1,
    [   49s]             self.create_raw_count(3) + node2 + node3 + node4,
    [   49s]             self.create_raw_count(1) + node5,
    [   49s]             self.create_raw_count(3) + node6 + node7 + node8
    [   49s]         ]
    [   49s]         self.kind = ahocorasick.TRIE
    [   49s]         self.values = values
    [   49s]         self.word_count = 5
    [   49s]     
    [   49s]         A = self.create_automaton(USE_EXACT_RAW)
    [   49s]         self.assertEqual(len(A), 5)
    [   49s] >       self.assertEqual(A.get("he"), "HE")
    [   49s] E       KeyError
    [   49s] 
    [   49s] tests/test_unpickle.py:211: KeyError
    [   49s] _______ TestUnpickleRaw.test__construct_simple_trie__wrong_index_unicode _______
    [   49s] 
    [   49s] self = <test_unpickle.TestUnpickleRaw testMethod=test__construct_simple_trie__wrong_index_unicode>
    [   49s] 
    [   49s]     @skipIf(not ahocorasick.unicode, "Run only with unicode build")
    [   49s]     def test__construct_simple_trie__wrong_index_unicode(self):
    [   49s]         """
    [   49s]         trie for set {he}
    [   49s]     
    [   49s]         #0 -> [h #1*] -> [e #2*]
    [   49s]         """
    [   49s]     
    [   49s]         node0 = self.create_raw_node(0, [('h', 1)])
    [   49s]         node1 = self.create_raw_node(1, [('e', 2)])  # expect python value
    [   49s]         node2 = self.create_raw_node(1, [])  # also python value
    [   49s]     
    [   49s]         self.count = 3
    [   49s]         self.raw = node0 + node1 + node2
    [   49s]         self.kind = ahocorasick.TRIE
    [   49s]         self.values = ["HE"]  # but we provide a too short collection
    [   49s]         self.word_count = 2
    [   49s]     
    [   49s]         with self.assertRaises(IndexError):
    [   49s] >           self.create_automaton()
    [   49s] E           AssertionError: IndexError not raised
    [   49s] 
    [   49s] tests/test_unpickle.py:257: AssertionError
    [   49s] _________________ TestUnpickleRaw.test__malicious_fail_pointer _________________
    [   49s] 
    [   49s] self = <test_unpickle.TestUnpickleRaw testMethod=test__malicious_fail_pointer>
    [   49s] 
    [   49s]     def test__malicious_fail_pointer(self):
    [   49s]         """
    [   49s]         trie with just one node
    [   49s]         """
    [   49s]     
    [   49s]         builder = self.create_node_builder(0, [])
    [   49s]         builder.fail = 42
    [   49s]     
    [   49s]         self.count = 1
    [   49s]         self.raw = builder.dump()
    [   49s]         self.kind = ahocorasick.TRIE
    [   49s]     
    [   49s]         with self.assertRaisesRegex(ValueError, "Node #0 malformed: the fail link points to.*"):
    [   49s] >           self.create_automaton()
    [   49s] 
    [   49s] tests/test_unpickle.py:354: 
    [   49s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    [   49s] 
    [   49s]     def create_automaton(self, use_exact_raw=False):
    [   49s]         # alter values that were set in setUp
    [   49s]         if use_exact_raw:
    [   49s]             raw = self.raw
    [   49s]         else:
    [   49s]             raw = [self.create_raw_count(self.count) + self.raw]
    [   49s]     
    [   49s]         args = (raw, self.kind, self.store, self.key_type,
    [   49s]                 self.word_count, self.longest, self.values);
    [   49s]     
    [   49s] >       return ahocorasick.Automaton(*args)
    [   49s] E       IndexError: list index out of range
    [   49s] 
    [   49s] tests/test_unpickle.py:111: IndexError
    [   49s] _____________ TestUnpickleRaw.test__malicious_next_pointer_unicode _____________
    [   49s] 
    [   49s] self = <test_unpickle.TestUnpickleRaw testMethod=test__malicious_next_pointer_unicode>
    [   49s] 
    [   49s]     @skipIf(not ahocorasick.unicode, "Run only with unicode build")
    [   49s]     def test__malicious_next_pointer_unicode(self):
    [   49s]         """
    [   49s]         #0 -> [? #1 ]
    [   49s]         """
    [   49s]     
    [   49s]         node0 = self.create_raw_node(0, [('?', 1)])
    [   49s]         node1 = self.create_raw_node(0, [('x', 16)])  # the second node point to non-existent node
    [   49s]     
    [   49s]         self.count = 2
    [   49s]         self.raw = node0 + node1
    [   49s]         self.kind = ahocorasick.TRIE
    [   49s]     
    [   49s]         with self.assertRaisesRegex(ValueError, "Node #1 malformed: next link #0 points to.*"):
    [   49s] >           self.create_automaton()
    [   49s] 
    [   49s] tests/test_unpickle.py:323: 
    [   49s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    [   49s] 
    [   49s]     def create_automaton(self, use_exact_raw=False):
    [   49s]         # alter values that were set in setUp
    [   49s]         if use_exact_raw:
    [   49s]             raw = self.raw
    [   49s]         else:
    [   49s]             raw = [self.create_raw_count(self.count) + self.raw]
    [   49s]     
    [   49s]         args = (raw, self.kind, self.store, self.key_type,
    [   49s]                 self.word_count, self.longest, self.values);
    [   49s]     
    [   49s] >       return ahocorasick.Automaton(*args)
    [   49s] E       IndexError: list index out of range
    [   49s] 
    [   49s] tests/test_unpickle.py:111: IndexError
    [   49s] _________________ TestUnpickleRaw.test__truncated_raw__case_2 __________________
    [   49s] 
    [   49s] self = <test_unpickle.TestUnpickleRaw testMethod=test__truncated_raw__case_2>
    [   49s] 
    [   49s]     def test__truncated_raw__case_2(self):
    [   49s]         """
    [   49s]         trie for set {he}
    [   49s]     
    [   49s]         #0 -> [h #1 ] -> [e #2*]
    [   49s]         """
    [   49s]     
    [   49s]         node0 = self.create_raw_node(0, [('h', 1)])
    [   49s]         node1 = self.create_raw_node(0, [('e', 2)])
    [   49s]         node2 = self.create_raw_node(1, [])
    [   49s]         raw = node0 + node1 + node2
    [   49s]     
    [   49s]         self.count = 3
    [   49s]         self.kind = ahocorasick.TRIE
    [   49s]     
    [   49s]         for length in range(len(raw)):
    [   49s]             self.raw = raw[:length]  # truncate data and expect fail
    [   49s]             with self.assertRaisesRegex(ValueError, "Data truncated.*"):
    [   49s] >               self.create_automaton()
    [   49s] 
    [   49s] tests/test_unpickle.py:307: 
    [   49s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    [   49s] 
    [   49s]     def create_automaton(self, use_exact_raw=False):
    [   49s]         # alter values that were set in setUp
    [   49s]         if use_exact_raw:
    [   49s]             raw = self.raw
    [   49s]         else:
    [   49s]             raw = [self.create_raw_count(self.count) + self.raw]
    [   49s]     
    [   49s]         args = (raw, self.kind, self.store, self.key_type,
    [   49s]                 self.word_count, self.longest, self.values);
    [   49s]     
    [   49s] >       return ahocorasick.Automaton(*args)
    [   49s] E       IndexError: list index out of range
    [   49s] 
    [   49s] tests/test_unpickle.py:111: IndexError
    [   49s] ______________________ TestUnpickleRaw.test__values_leaks ______________________
    [   49s] 
    [   49s] self = <test_unpickle.TestUnpickleRaw testMethod=test__values_leaks>
    [   49s] 
    [   49s]     def test__values_leaks(self):
    [   49s]     
    [   49s]         # create not connected nodes, but each hold a value
    [   49s]         good_nodes = 1000
    [   49s]         raw = b''
    [   49s]         values = []
    [   49s]         for i in range(good_nodes):
    [   49s]             raw += self.create_raw_node(1, [])
    [   49s]             values.append(tuple("node %d" % i))
    [   49s]     
    [   49s]         # create the last node that will cause error -- malformed next pointer
    [   49s]         raw += self.create_raw_node(1, [('_', 10000)])
    [   49s]         values.append(tuple("never reached"))
    [   49s]     
    [   49s]         self.count = good_nodes + 1
    [   49s]         self.raw = raw
    [   49s]         self.kind = ahocorasick.TRIE
    [   49s]         self.values = values
    [   49s]     
    [   49s]         with self.assertRaises(ValueError):
    [   49s] >           self.create_automaton()
    [   49s] E           AssertionError: ValueError not raised
    [   49s] 
    [   49s] tests/test_unpickle.py:376: AssertionError
    
    opened by jayvdb 0
  • Fix Integer Cast Order

    Fix Integer Cast Order

    This PR fixes a compiler warning created by negating an unsigned integer before casting it to a signed integer. This is fixed by explicitly casting the unsized integer to a long int, the type it is implicitly casted to, before negating it.

    opened by nathaniel-daniel 0
  • Add mingw declarations

    Add mingw declarations

    I added a mingw declarations file to support that platform. It was copied from the cygwin definitions. I appear to be able to compile, load, and use this library from msys2 with this change. __MINGW32__ is defined for 64 bit targets as well as 32 bit. I only tested this on mingw64 through msys2, but I think this should work for the original mingw as well, though I'm not sure if people still use that.

    opened by nathaniel-daniel 0
Owner
Wojciech Muła
Wojciech Muła
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ ?? Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

Jeong Ukjae 27 Dec 12, 2022
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI ?? Online live demos: http://tworld.io/s

Sergio Burdisso 285 Jan 2, 2023
A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

DavidChen 46 Sep 23, 2022
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 2, 2022
An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Extension - matrix and vocabulary extractor for TF-IDF and Doc2Vec An extension for ASReview that adds a tf-idf extractor that saves the matrix and th

ASReview 4 Jun 17, 2022
ExKaldi-RT: An Online Speech Recognition Extension Toolkit of Kaldi

ExKaldi-RT is an online ASR toolkit for Python language. It reads realtime streaming audio and do online feature extraction, probability computation, and online decoding.

Wang Yu 31 Aug 16, 2021
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 6, 2023
This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

Raihan Ahmed 1 Dec 9, 2021
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

DevMysT 5 Jul 17, 2022
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 8, 2023
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 2.5k Feb 17, 2021
The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

CIF-PyTorch This is a PyTorch based implementation of continuous integrate-and-fire (CIF) module for end-to-end (E2E) automatic speech recognition (AS

Minglun Han 24 Dec 29, 2022
This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

UIS-RNN Overview This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm. UIS-RNN solves the problem of s

Google 1.4k Dec 28, 2022
Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

NLP learning Trying to learn NLP to use in my projects! Table of Contents About The Project Built With Getting Started Requirements Run Usage License

Faraz Farangizadeh 3 Aug 25, 2022