Snowball compiler and stemming algorithms

Overview

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.

Snowball was originally designed and built by Martin Porter. Martin retired from development in 2014 and Snowball is now maintained as a community project. Martin originally chose the name Snowball as a tribute to SNOBOL, the excellent string handling language from the 1960s. It now also serves as a metaphor for how the project grows by gathering contributions over time.

The Snowball compiler translates a Snowball program into source code in another language - currently ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.

This repository contains the source code for the snowball compiler and the stemming algorithms. The snowball compiler is written in ISO C - you'll need a C compiler which support C99 to build it (but the C code it generates should work with any ISO C compiler.)

See https://snowballstem.org/ for more information about Snowball.

What is Stemming?

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Comments
  • Rust Backend

    Rust Backend

    Hey,

    for my own needs I hacked together a rust backend for the snowball compiler. It's mostly a literal translation of the python and java backends.

    The current state of it is sufficient for my needs but not enough to merge it upstream. As far as I see it, following things would need to be done:

    • [x] Add a library similar to the ones residing in java/ and python/
    • [x] Test the implementation against all stemmers (currently only tested on eight)
    • [x] Integrate these tests into travis
    • [x] Yield warning-free rust code
    • [x] Implement missing features

    Before tackeling these tasks I just wanted to know if pull request to that repository are still accepted/merged.

    opened by JDemler 17
  • Optimized snowball code for tamil stemmer

    Optimized snowball code for tamil stemmer

    I have replaced simple or conditions with among() command as suggested by Martin. I tried to use among() which consisted of some test() commands inside and they didnt work as expected so have left them as it is. I tested the modification against the data in https://github.com/rdamodharan/snowball-data/tree/master/tamil and the results were same.

    opened by rdamodharan 17
  • Add Python generator

    Add Python generator

    I am not the original author of these generators; @shibukawa is. However I needed to use the Python generator, so I have fixed some compilation errors in the code, rebased it against the current snowball code, and submitted this pull request.

    The original code by Yoshiki can be found at shibukawa/snowball (in snowball directory).

    Python needs no introduction. JSX is a statically typed JavaScript-like language that compiles to JavaScript, the compiler can be found at jsx/JSX (requires node.js).

    Update: I excluded the JSX generator from this pull request

    opened by mitya57 15
  • add Go generator

    add Go generator

    initial implementation ported from rust generator primarily focused on getting functionality working not on the best or most performant Go code

    seems to work, based on: make check_go and other ad-hoc testing

    opened by mschoch 14
  • Problems with Russian letter Ё

    Problems with Russian letter Ё

    As http://snowballstem.org/algorithms/russian/stemmer.html properly mentions, Russian alphabet contains letter Ё [jo] which is quite often replaced with Е, especially in regular, non-academic texts.

    So indeed the beast approach is to replace Ё -> Е when stemming.

    Now if you check the existing demo http://snowballstem.org/demo.html you can see that it doesn't actually happen.

    Let's take Russian word for "honey" — «мёд», and its form with different ending — «мёдом». If you paste it along with its "normalized" form (with Е) you can see that the form with Ё is not properly stemmed: demo

    Here's sample input so that you can run the tests yourself: "мёд мёдом мед медом".

    This is a serious problem when searching through the corpus of natural texts. Even if you're purist (like me in this case) and type all your search terms with properly placed Ё you won't be able to match the original texts that are using Е.

    opened by emirotin 13
  • Java builds and tests

    Java builds and tests

    Hi @rboulton ,

    I would like the Java stemmers to also be included in travis-ci builds, ideally with stemming tests like it's C counterparts (see #5).

    If it's alright with you, I'd like to work on this and submit a pull-request shortly.

    oerd

    opened by oerd 11
  • [Ada] Add support for Ada generator

    [Ada] Add support for Ada generator

    This pull request adds the support for Ada code generator.

    The Stemmer library is available in https://github.com/stcarrez/ada-stemmer

    The Ada code generator has been checked with English, Danish, Dutch, French, German, Greek, Italian, Serbian, Spanish, Swedish, Russian.

    opened by stcarrez 10
  • Snowball version of Porter stemmer for Lithuanian language

    Snowball version of Porter stemmer for Lithuanian language

    Hello,

    I've been working on a Snowball stemmer for Lithuanian language and I'd like to contribute it to a wider community. By contributing my work I hope that community can have some benefit.

    Please let me know if you would like to know more about anything in this pull request. If there are any problem with my code or there are more things to do in order to merge the code, I'm more than willing put effort to fix it.

    Best wishes, Dainius

    opened by dainiusjocas 10
  • Clarification on if `snowball` (specifically python implemenation) is not thread-safe

    Clarification on if `snowball` (specifically python implemenation) is not thread-safe

    Hi, we've been experiencing intermittent inconsistent outputs (i.e. bug) when using snowball with dask multiprocessing. We can stop these bugs occurring by any of a) using a single process/thread, b) removing the stemming, or c) moving the instantiation of the stemmer inside the function which is being applied within threads.

    Could someone with expertise input on whether snowball is thread-safe or not?

    Might be related to #146 which seems to imply that the C# implementation is not thread safe.

    opened by DBCerigo 8
  • python AttributeError snowballstemmer.algorithms()

    python AttributeError snowballstemmer.algorithms()

    Hello,

    I installed the library from PyPI

    pip install snowballstemmer
    

    There is a bug in https://github.com/snowballstem/snowball/blob/master/python/create_init.py#L42

    ----> 1 snowballstemmer.algorithms()
    
         67         return Stemmer.language()
         68     else:
    ---> 69         return list(_languages.key())
         70
         71 def stemmer(lang):
    
    AttributeError: 'dict' object has no attribute 'key'
    

    It should be _languages.keys()

    opened by kkaiser 8
  • UTF-8 ?

    UTF-8 ?

    From: algorithms/french/stem_ISO_8859_1.sbl

    stringdef a^   hex 'E2'  // a-circumflex
    stringdef a`   hex 'E0'  // a-grave
    stringdef c,   hex 'E7'  // c-cedilla
    
    stringdef e"   hex 'EB'  // e-diaeresis (rare)
    stringdef e'   hex 'E9'  // e-acute
    stringdef e^   hex 'EA'  // e-circumflex
    stringdef e`   hex 'E8'  // e-grave
    stringdef i"   hex 'EF'  // i-diaeresis
    stringdef i^   hex 'EE'  // i-circumflex
    stringdef o^   hex 'F4'  // o-circumflex
    stringdef u^   hex 'FB'  // u-circumflex
    stringdef u`   hex 'F9'  // u-grave
    

    So far there is no UTF-8 version. Why?

    opened by drzraf 8
  • Add a script that replaces Latin chars with Unicode letters

    Add a script that replaces Latin chars with Unicode letters

    Adds a script that replaces Latin chars with Unicode letters that facilitates reading the Snowball file. The script produces a readable sbl file, that allows printing it out for human reading and exploring algorithm, for ex:

    $ bin/readable_sbl ./algorithms/greek.sbl
    // A stemmer for Modern Greek language, based on:
    //...
    

    So, instead of:

      //...
      define step6 as (
        do (
          [substring] among (
            '{m}{a}{t}{a}' '{m}{a}{t}{oo}{n}' '{m}{a}{t}{o}{s}' (<- '{m}{a}')
          )
        )
        test1
        [substring] among (
          '{a}' '{a}{g}{a}{t}{e}' '{a}{g}{a}{n}' '{a}{e}{y}' '{a}{m}{a}{y}' '{a}{n}' '{a}{s}' '{a}{s}{a}{y}' '{a}{t}{a}{y}' '{a}{oo}' '{e}' '{e}{y}'
          '{e}{y}{s}' '{e}{y}{t}{e}' '{e}{s}{a}{y}' '{e}{s}' '{e}{t}{a}{y}' '{y}' '{y}{e}{m}{a}{y}' '{y}{e}{m}{a}{s}{t}{e}' '{y}{e}{t}{a}{y}' '{y}{e}{s}{a}{y}'
          '{y}{e}{s}{a}{s}{t}{e}' '{y}{o}{m}{a}{s}{t}{a}{n}' '{y}{o}{m}{o}{u}{n}' '{y}{o}{m}{o}{u}{n}{a}' '{y}{o}{n}{t}{a}{n}' '{y}{o}{n}{t}{o}{u}{s}{a}{n}' '{y}{o}{s}{a}{s}{t}{a}{n}'
          '{y}{o}{s}{a}{s}{t}{e}' '{y}{o}{s}{o}{u}{n}' '{y}{o}{s}{o}{u}{n}{a}' '{y}{o}{t}{a}{n}' '{y}{o}{u}{m}{a}' '{y}{o}{u}{m}{a}{s}{t}{e}' '{y}{o}{u}{n}{t}{a}{y}'
          '{y}{o}{u}{n}{t}{a}{n}' '{i}' '{i}{d}{e}{s}' '{i}{d}{oo}{n}' '{i}{th}{e}{y}' '{i}{th}{e}{y}{s}' '{i}{th}{e}{y}{t}{e}' '{i}{th}{i}{k}{a}{t}{e}' '{i}{th}{i}{k}{a}{n}'
          '{i}{th}{o}{u}{n}' '{i}{th}{oo}' '{i}{k}{a}{t}{e}' '{i}{k}{a}{n}' '{i}{s}' '{i}{s}{a}{n}' '{i}{s}{a}{t}{e}' '{i}{s}{e}{y}' '{i}{s}{e}{s}' '{i}{s}{o}{u}{n}'
          '{i}{s}{oo}' '{o}' '{o}{y}' '{o}{m}{a}{y}' '{o}{m}{a}{s}{t}{a}{n}' '{o}{m}{o}{u}{n}' '{o}{m}{o}{u}{n}{a}' '{o}{n}{t}{a}{y}' '{o}{n}{t}{a}{n}'
          '{o}{n}{t}{o}{u}{s}{a}{n}' '{o}{s}' '{o}{s}{a}{s}{t}{a}{n}' '{o}{s}{a}{s}{t}{e}' '{o}{s}{o}{u}{n}' '{o}{s}{o}{u}{n}{a}' '{o}{t}{a}{n}' '{o}{u}' '{o}{u}{m}{a}{y}'
          '{o}{u}{m}{a}{s}{t}{e}' '{o}{u}{n}' '{o}{u}{n}{t}{a}{y}' '{o}{u}{n}{t}{a}{n}' '{o}{u}{s}' '{o}{u}{s}{a}{n}' '{o}{u}{s}{a}{t}{e}' '{u}' '{u}{s}' '{oo}'
          '{oo}{n}' (delete)
        )
      )
    
      define step7 as (
        [substring] among (
          '{e}{s}{t}{e}{r}' '{e}{s}{t}{a}{t}' '{o}{t}{e}{r}' '{o}{t}{a}{t}' '{u}{t}{e}{r}' '{u}{t}{a}{t}' '{oo}{t}{e}{r}' '{oo}{t}{a}{t}' (delete)
        )
      )
      //...
    

    =>

      define step6 as (
        do (
          [substring] among (
            'ματα' 'ματων' 'ματοσ' (<- 'μα')
          )
        )
        test1
        [substring] among (
          'α' 'αγατε' 'αγαν' 'αει' 'αμαι' 'αν' 'ασ' 'ασαι' 'αται' 'αω' 'ε' 'ει'
          'εισ' 'ειτε' 'εσαι' 'εσ' 'εται' 'ι' 'ιεμαι' 'ιεμαστε' 'ιεται' 'ιεσαι'
          'ιεσαστε' 'ιομασταν' 'ιομουν' 'ιομουνα' 'ιονταν' 'ιοντουσαν' 'ιοσασταν'
          'ιοσαστε' 'ιοσουν' 'ιοσουνα' 'ιοταν' 'ιουμα' 'ιουμαστε' 'ιουνται'
          'ιουνταν' 'η' 'ηδεσ' 'ηδων' 'ηθει' 'ηθεισ' 'ηθειτε' 'ηθηκατε' 'ηθηκαν'
          'ηθουν' 'ηθω' 'ηκατε' 'ηκαν' 'ησ' 'ησαν' 'ησατε' 'ησει' 'ησεσ' 'ησουν'
          'ησω' 'ο' 'οι' 'ομαι' 'ομασταν' 'ομουν' 'ομουνα' 'ονται' 'ονταν'
          'οντουσαν' 'οσ' 'οσασταν' 'οσαστε' 'οσουν' 'οσουνα' 'οταν' 'ου' 'ουμαι'
          'ουμαστε' 'ουν' 'ουνται' 'ουνταν' 'ουσ' 'ουσαν' 'ουσατε' 'υ' 'υσ' 'ω'
          'ων' (delete)
        )
      )
    
      define step7 as (
        [substring] among (
          'εστερ' 'εστατ' 'οτερ' 'οτατ' 'υτερ' 'υτατ' 'ωτερ' 'ωτατ' (delete)
        )
      )
    
    
    opened by abratashov 1
  • Turkish stemmer has a problem with word

    Turkish stemmer has a problem with word "aile"

    Hello,

    I have an issue related to the Turkish stemmer. But the problem is more related to the Turkish stemming algorithm, as on page https://snowballstem.org/algorithms/turkish/stemmer.html.

    When I want to use a snowball to stem the Turkish word (aile), it always cuts the "le" and leaves the phrase only "ai." And the word "ai" doesn't have any meaning in Turkish. I think because "le" in Turkish means "with." that's why it cut the word "aile" into two words, "ai" and "le."

    How do I exclude the word "aile" in stemming using snowball? Thank you.

    opened by dwicak 2
  • Is it normal that comparatives and superlatives are not stemmed?

    Is it normal that comparatives and superlatives are not stemmed?

    >>> import Stemmer
    >>> stemmer = Stemmer.Stemmer('english')
    >>> print(stemmer.stemWord('poorer'))
    poorer
    >>> print(stemmer.stemWord('cleaner'))
    cleaner
    >>> print(stemmer.stemWord('cleanest'))
    cleanest
    
    opened by raffaem 1
  • Spelling

    Spelling

    This PR corrects misspellings identified by the check-spelling action.

    The misspellings have been reported at https://github.com/jsoref/snowball/commit/25df83387d7b449b530cdc1a38306cba71d9e714#commitcomment-69675396

    The action reports that the changes in this PR would make it happy: https://github.com/jsoref/snowball/commit/3ed0647bd9596b724133e8273188e38624fef328

    Note: this PR does not include the action. If you're interested in running a spell check on every PR and push, that can be offered separately.

    opened by jsoref 0
  • German stemmer possible improvements

    German stemmer possible improvements

    Hello, Snowball developers team!

    I work in developing translation software. We use snowball algorithms in our product to find inflected forms of terms in texts. We have gathered feedback from our customers on German stemming algorithm and developed some changes.

    1. Remove ending -ers

    Example (word - stem by Snowball demo - stem by customized algorithm): Förderer - ford - ford Förderers - forder - ford Förderern - ford - ford

    1. Feminine nouns

    -erinnen is replaced with -erin

    There are already some discussions on feminine endings in German (#153, #85). We have opted out to let our customers to decide themselves how a gendered word in German should be translated to a different language. Our addition to the algorithm simply provides a way to stem plural feminine nouns and singular feminine nouns in the same manner.

    Example (word - stem by Snowball demo - stem by customized algorithm): Politikerin - politikerin - politikerin Politikerinnen - politikerinn - politikerin

    1. Remove -stern

    Example (word - stem by Snowball demo - stem by customized algorithm): morgenstern - morgen - morgen morgensterne - morgenstern - morgen

    1. Remove ending -em

    That change does lead to ocassional overstemming. However, the word "systems" is often used in the CS and engineering terminology, so it is crucial for our customers to find words like "...system" when searching for "...systems".

    Example (word - stem by Snowball demo - stem by customized algorithm): system - syst - syst systems - system - syst

    1. -ln replaced with -l

    Example (word - stem by Snowball demo - stem by customized algorithm): artikel - artikel - artikel artikeln - artikeln - artikel

    We have implemented those changes (including updating word lists), so if after discussion you find changes (or some of them) useful, I can create a PR.

    Standart suffix algorithms with described above changes
     define standard_suffix as (
    	do (
    	[substring] R1 among(
    		'ers'
    		(
    			delete
    		)
                )
    	)	
            do (
                [substring] R1 among(
    		'erinnen'
    		(
    			 <- 'erin'
    		)
                    'em' 'ern' 'er' 
                    (   delete
                    )						
                    'e' 'en' 'es' 
                    (   delete
                        try (['s'] 'nis' delete)
                    )
                    's'
                    (   s_ending delete
                    )
                )
            )
            do (
                [substring] R1 among(
    		'stern'
    		(
    		delete 
    		)
                    'en' 'er' 'est' 'em'
                    (   delete
                    )
                    'st'
                    (   st_ending hop 3 delete
                    )
                )
            )
            do (
                [substring] R2 among(
                    'end' 'ung'
                    (   delete
                        try (['ig'] not 'e' R2 delete)
                    )
                    'ig' 'ik' 'isch'
                    (   not 'e' delete
                    )
                    'lich' 'heit'
                    (   delete
                        try (
                            ['er' or 'en'] R1 delete
                        )
                    )
                    'keit'
                    (   delete
                        try (
                            [substring] R2 among(
                                'lich' 'ig'
                                (   delete
                                )
                            )
                        )
                    )
                )
            )
    	do (
                [substring] R1 among(
                    'ln'
                    (   <- 'l'
                    )
    	)
        )
    )
    

    Thanks you for your time!

    opened by OlgaGuselnikova 0
Owner
Snowball Stemming language and algorithms
Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval
Snowball Stemming language and algorithms
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 7, 2022
Yet Another Compiler Visualizer

yacv: Yet Another Compiler Visualizer yacv is a tool for visualizing various aspects of typical LL(1) and LR parsers. Check out demo on YouTube to see

Ashutosh Sathe 129 Dec 17, 2022
Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

Nateve 7 Jan 15, 2022
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 3, 2023
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 3k Jan 6, 2023
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 1.9k Feb 18, 2021
Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

Kwan-Yuet 466 Dec 6, 2022
Meta learning algorithms to train cross-lingual NLI (multi-task) models

Meta learning algorithms to train cross-lingual NLI (multi-task) models

M.Hassan Mojab 4 Nov 20, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 3, 2023
C.J. Hutto 3.8k Dec 30, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 10.8k Feb 18, 2021
C.J. Hutto 2.8k Feb 18, 2021
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

null 730 Jan 9, 2023
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 2, 2023
Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

J.A.R.V.I.S Kindly consider starring this repository if you like the program :-) What/Who is J.A.R.V.I.S? J.A.R.V.I.S is an chatbot written that is bu

Epicalable 50 Dec 31, 2022
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 644 Jan 1, 2023