Snowball compiler and stemming algorithms

Snowball Stemming language and algorithms

Last update: Jan 7, 2023

Related tags

Text Data & NLP snowball

Overview

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.

Snowball was originally designed and built by Martin Porter. Martin retired from development in 2014 and Snowball is now maintained as a community project. Martin originally chose the name Snowball as a tribute to SNOBOL, the excellent string handling language from the 1960s. It now also serves as a metaphor for how the project grows by gathering contributions over time.

The Snowball compiler translates a Snowball program into source code in another language - currently ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.

This repository contains the source code for the snowball compiler and the stemming algorithms. The snowball compiler is written in ISO C - you'll need a C compiler which support C99 to build it (but the C code it generates should work with any ISO C compiler.)

See https://snowballstem.org/ for more information about Snowball.

What is Stemming?

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Comments

Rust Backend
Hey,

for my own needs I hacked together a rust backend for the snowball compiler. It's mostly a literal translation of the python and java backends.

The current state of it is sufficient for my needs but not enough to merge it upstream. As far as I see it, following things would need to be done:

[x] Add a library similar to the ones residing in java/ and python/

[x] Test the implementation against all stemmers (currently only tested on eight)

[x] Integrate these tests into travis

[x] Yield warning-free rust code

[x] Implement missing features

Before tackeling these tasks I just wanted to know if pull request to that repository are still accepted/merged.
opened by JDemler 17
Optimized snowball code for tamil stemmer

I have replaced simple or conditions with among() command as suggested by Martin. I tried to use among() which consisted of some test() commands inside and they didnt work as expected so have left them as it is. I tested the modification against the data in https://github.com/rdamodharan/snowball-data/tree/master/tamil and the results were same.

opened by rdamodharan 17
Add Python generator

I am not the original author of these generators; @shibukawa is. However I needed to use the Python generator, so I have fixed some compilation errors in the code, rebased it against the current snowball code, and submitted this pull request.

The original code by Yoshiki can be found at shibukawa/snowball (in snowball directory).

Python needs no introduction. ~~JSX is a statically typed JavaScript-like language that compiles to JavaScript, the compiler can be found at jsx/JSX (requires node.js).~~

Update: I excluded the JSX generator from this pull request

opened by mitya57 15
add Go generator

initial implementation ported from rust generator primarily focused on getting functionality working not on the best or most performant Go code

seems to work, based on: make check_go and other ad-hoc testing

opened by mschoch 14
Problems with Russian letter Ё

As http://snowballstem.org/algorithms/russian/stemmer.html properly mentions, Russian alphabet contains letter Ё [jo] which is quite often replaced with Е, especially in regular, non-academic texts.

So indeed the beast approach is to replace Ё -> Е when stemming.

Now if you check the existing demo http://snowballstem.org/demo.html you can see that it doesn't actually happen.

Let's take Russian word for "honey" — «мёд», and its form with different ending — «мёдом». If you paste it along with its "normalized" form (with Е) you can see that the form with Ё is not properly stemmed:

Here's sample input so that you can run the tests yourself: "мёд мёдом мед медом".

This is a serious problem when searching through the corpus of natural texts. Even if you're purist (like me in this case) and type all your search terms with properly placed Ё you won't be able to match the original texts that are using Е.

opened by emirotin 13
Java builds and tests

Hi @rboulton ,

I would like the Java stemmers to also be included in travis-ci builds, ideally with stemming tests like it's C counterparts (see #5).

If it's alright with you, I'd like to work on this and submit a pull-request shortly.

oerd

opened by oerd 11
[Ada] Add support for Ada generator

This pull request adds the support for Ada code generator.

The Stemmer library is available in https://github.com/stcarrez/ada-stemmer

The Ada code generator has been checked with English, Danish, Dutch, French, German, Greek, Italian, Serbian, Spanish, Swedish, Russian.

opened by stcarrez 10
Snowball version of Porter stemmer for Lithuanian language

Hello,

I've been working on a Snowball stemmer for Lithuanian language and I'd like to contribute it to a wider community. By contributing my work I hope that community can have some benefit.

Please let me know if you would like to know more about anything in this pull request. If there are any problem with my code or there are more things to do in order to merge the code, I'm more than willing put effort to fix it.

Best wishes, Dainius

opened by dainiusjocas 10
Clarification on if `snowball` (specifically python implemenation) is not thread-safe

Hi, we've been experiencing intermittent inconsistent outputs (i.e. bug) when using snowball with dask multiprocessing. We can stop these bugs occurring by any of a) using a single process/thread, b) removing the stemming, or c) moving the instantiation of the stemmer inside the function which is being applied within threads.

Could someone with expertise input on whether snowball is thread-safe or not?

Might be related to #146 which seems to imply that the C# implementation is not thread safe.

opened by DBCerigo 8

python AttributeError snowballstemmer.algorithms()

Hello,

I installed the library from PyPI

pip install snowballstemmer

There is a bug in https://github.com/snowballstem/snowball/blob/master/python/create_init.py#L42

----> 1 snowballstemmer.algorithms()

     67         return Stemmer.language()
     68     else:
---> 69         return list(_languages.key())
     70
     71 def stemmer(lang):

AttributeError: 'dict' object has no attribute 'key'

It should be _languages.keys()

opened by kkaiser 8

UTF-8 ?

From: algorithms/french/stem_ISO_8859_1.sbl

stringdef a^   hex 'E2'  // a-circumflex
stringdef a`   hex 'E0'  // a-grave
stringdef c,   hex 'E7'  // c-cedilla

stringdef e"   hex 'EB'  // e-diaeresis (rare)
stringdef e'   hex 'E9'  // e-acute
stringdef e^   hex 'EA'  // e-circumflex
stringdef e`   hex 'E8'  // e-grave
stringdef i"   hex 'EF'  // i-diaeresis
stringdef i^   hex 'EE'  // i-circumflex
stringdef o^   hex 'F4'  // o-circumflex
stringdef u^   hex 'FB'  // u-circumflex
stringdef u`   hex 'F9'  // u-grave

So far there is no UTF-8 version. Why?

opened by drzraf 8

Add a script that replaces Latin chars with Unicode letters

Adds a script that replaces Latin chars with Unicode letters that facilitates reading the Snowball file. The script produces a readable sbl file, that allows printing it out for human reading and exploring algorithm, for ex:

$ bin/readable_sbl ./algorithms/greek.sbl
// A stemmer for Modern Greek language, based on:
//...

So, instead of:

  //...
  define step6 as (
    do (
      [substring] among (
        '{m}{a}{t}{a}' '{m}{a}{t}{oo}{n}' '{m}{a}{t}{o}{s}' (<- '{m}{a}')
      )
    )
    test1
    [substring] among (
      '{a}' '{a}{g}{a}{t}{e}' '{a}{g}{a}{n}' '{a}{e}{y}' '{a}{m}{a}{y}' '{a}{n}' '{a}{s}' '{a}{s}{a}{y}' '{a}{t}{a}{y}' '{a}{oo}' '{e}' '{e}{y}'
      '{e}{y}{s}' '{e}{y}{t}{e}' '{e}{s}{a}{y}' '{e}{s}' '{e}{t}{a}{y}' '{y}' '{y}{e}{m}{a}{y}' '{y}{e}{m}{a}{s}{t}{e}' '{y}{e}{t}{a}{y}' '{y}{e}{s}{a}{y}'
      '{y}{e}{s}{a}{s}{t}{e}' '{y}{o}{m}{a}{s}{t}{a}{n}' '{y}{o}{m}{o}{u}{n}' '{y}{o}{m}{o}{u}{n}{a}' '{y}{o}{n}{t}{a}{n}' '{y}{o}{n}{t}{o}{u}{s}{a}{n}' '{y}{o}{s}{a}{s}{t}{a}{n}'
      '{y}{o}{s}{a}{s}{t}{e}' '{y}{o}{s}{o}{u}{n}' '{y}{o}{s}{o}{u}{n}{a}' '{y}{o}{t}{a}{n}' '{y}{o}{u}{m}{a}' '{y}{o}{u}{m}{a}{s}{t}{e}' '{y}{o}{u}{n}{t}{a}{y}'
      '{y}{o}{u}{n}{t}{a}{n}' '{i}' '{i}{d}{e}{s}' '{i}{d}{oo}{n}' '{i}{th}{e}{y}' '{i}{th}{e}{y}{s}' '{i}{th}{e}{y}{t}{e}' '{i}{th}{i}{k}{a}{t}{e}' '{i}{th}{i}{k}{a}{n}'
      '{i}{th}{o}{u}{n}' '{i}{th}{oo}' '{i}{k}{a}{t}{e}' '{i}{k}{a}{n}' '{i}{s}' '{i}{s}{a}{n}' '{i}{s}{a}{t}{e}' '{i}{s}{e}{y}' '{i}{s}{e}{s}' '{i}{s}{o}{u}{n}'
      '{i}{s}{oo}' '{o}' '{o}{y}' '{o}{m}{a}{y}' '{o}{m}{a}{s}{t}{a}{n}' '{o}{m}{o}{u}{n}' '{o}{m}{o}{u}{n}{a}' '{o}{n}{t}{a}{y}' '{o}{n}{t}{a}{n}'
      '{o}{n}{t}{o}{u}{s}{a}{n}' '{o}{s}' '{o}{s}{a}{s}{t}{a}{n}' '{o}{s}{a}{s}{t}{e}' '{o}{s}{o}{u}{n}' '{o}{s}{o}{u}{n}{a}' '{o}{t}{a}{n}' '{o}{u}' '{o}{u}{m}{a}{y}'
      '{o}{u}{m}{a}{s}{t}{e}' '{o}{u}{n}' '{o}{u}{n}{t}{a}{y}' '{o}{u}{n}{t}{a}{n}' '{o}{u}{s}' '{o}{u}{s}{a}{n}' '{o}{u}{s}{a}{t}{e}' '{u}' '{u}{s}' '{oo}'
      '{oo}{n}' (delete)
    )
  )

  define step7 as (
    [substring] among (
      '{e}{s}{t}{e}{r}' '{e}{s}{t}{a}{t}' '{o}{t}{e}{r}' '{o}{t}{a}{t}' '{u}{t}{e}{r}' '{u}{t}{a}{t}' '{oo}{t}{e}{r}' '{oo}{t}{a}{t}' (delete)
    )
  )
  //...

  define step6 as (
    do (
      [substring] among (
        'ματα' 'ματων' 'ματοσ' (<- 'μα')
      )
    )
    test1
    [substring] among (
      'α' 'αγατε' 'αγαν' 'αει' 'αμαι' 'αν' 'ασ' 'ασαι' 'αται' 'αω' 'ε' 'ει'
      'εισ' 'ειτε' 'εσαι' 'εσ' 'εται' 'ι' 'ιεμαι' 'ιεμαστε' 'ιεται' 'ιεσαι'
      'ιεσαστε' 'ιομασταν' 'ιομουν' 'ιομουνα' 'ιονταν' 'ιοντουσαν' 'ιοσασταν'
      'ιοσαστε' 'ιοσουν' 'ιοσουνα' 'ιοταν' 'ιουμα' 'ιουμαστε' 'ιουνται'
      'ιουνταν' 'η' 'ηδεσ' 'ηδων' 'ηθει' 'ηθεισ' 'ηθειτε' 'ηθηκατε' 'ηθηκαν'
      'ηθουν' 'ηθω' 'ηκατε' 'ηκαν' 'ησ' 'ησαν' 'ησατε' 'ησει' 'ησεσ' 'ησουν'
      'ησω' 'ο' 'οι' 'ομαι' 'ομασταν' 'ομουν' 'ομουνα' 'ονται' 'ονταν'
      'οντουσαν' 'οσ' 'οσασταν' 'οσαστε' 'οσουν' 'οσουνα' 'οταν' 'ου' 'ουμαι'
      'ουμαστε' 'ουν' 'ουνται' 'ουνταν' 'ουσ' 'ουσαν' 'ουσατε' 'υ' 'υσ' 'ω'
      'ων' (delete)
    )
  )

  define step7 as (
    [substring] among (
      'εστερ' 'εστατ' 'οτερ' 'οτατ' 'υτερ' 'υτατ' 'ωτερ' 'ωτατ' (delete)
    )
  )

opened by abratashov 1

Turkish stemmer has a problem with word "aile"

Hello,

I have an issue related to the Turkish stemmer. But the problem is more related to the Turkish stemming algorithm, as on page https://snowballstem.org/algorithms/turkish/stemmer.html.

When I want to use a snowball to stem the Turkish word (aile), it always cuts the "le" and leaves the phrase only "ai." And the word "ai" doesn't have any meaning in Turkish. I think because "le" in Turkish means "with." that's why it cut the word "aile" into two words, "ai" and "le."

How do I exclude the word "aile" in stemming using snowball? Thank you.

opened by dwicak 2

Is it normal that comparatives and superlatives are not stemmed?

>>> import Stemmer
>>> stemmer = Stemmer.Stemmer('english')
>>> print(stemmer.stemWord('poorer'))
poorer
>>> print(stemmer.stemWord('cleaner'))
cleaner
>>> print(stemmer.stemWord('cleanest'))
cleanest

opened by raffaem 1

Spelling

This PR corrects misspellings identified by the check-spelling action.

The misspellings have been reported at https://github.com/jsoref/snowball/commit/25df83387d7b449b530cdc1a38306cba71d9e714#commitcomment-69675396

The action reports that the changes in this PR would make it happy: https://github.com/jsoref/snowball/commit/3ed0647bd9596b724133e8273188e38624fef328

Note: this PR does not include the action. If you're interested in running a spell check on every PR and push, that can be offered separately.

opened by jsoref 0
German stemmer possible improvements
Hello, Snowball developers team!

I work in developing translation software. We use snowball algorithms in our product to find inflected forms of terms in texts. We have gathered feedback from our customers on German stemming algorithm and developed some changes.

Remove ending -ers

Example (word - stem by Snowball demo - stem by customized algorithm): Förderer - ford - ford Förderers - forder - ford Förderern - ford - ford

Feminine nouns

-erinnen is replaced with -erin

There are already some discussions on feminine endings in German (#153, #85). We have opted out to let our customers to decide themselves how a gendered word in German should be translated to a different language. Our addition to the algorithm simply provides a way to stem plural feminine nouns and singular feminine nouns in the same manner.

Example (word - stem by Snowball demo - stem by customized algorithm): Politikerin - politikerin - politikerin Politikerinnen - politikerinn - politikerin

Remove -stern

Example (word - stem by Snowball demo - stem by customized algorithm): morgenstern - morgen - morgen morgensterne - morgenstern - morgen

Remove ending -em

That change does lead to ocassional overstemming. However, the word "systems" is often used in the CS and engineering terminology, so it is crucial for our customers to find words like "...system" when searching for "...systems".

Example (word - stem by Snowball demo - stem by customized algorithm): system - syst - syst systems - system - syst

-ln replaced with -l

Example (word - stem by Snowball demo - stem by customized algorithm): artikel - artikel - artikel artikeln - artikeln - artikel

We have implemented those changes (including updating word lists), so if after discussion you find changes (or some of them) useful, I can create a PR.

Standart suffix algorithms with described above changes

define standard_suffix as ( do ( [substring] R1 among( 'ers' ( delete ) ) ) do ( [substring] R1 among( 'erinnen' ( <- 'erin' ) 'em' 'ern' 'er' ( delete ) 'e' 'en' 'es' ( delete try (['s'] 'nis' delete) ) 's' ( s_ending delete ) ) ) do ( [substring] R1 among( 'stern' ( delete ) 'en' 'er' 'est' 'em' ( delete ) 'st' ( st_ending hop 3 delete ) ) ) do ( [substring] R2 among( 'end' 'ung' ( delete try (['ig'] not 'e' R2 delete) ) 'ig' 'ik' 'isch' ( not 'e' delete ) 'lich' 'heit' ( delete try ( ['er' or 'en'] R1 delete ) ) 'keit' ( delete try ( [substring] R2 among( 'lich' 'ig' ( delete ) ) ) ) ) ) do ( [substring] R1 among( 'ln' ( <- 'l' ) ) ) )

Thanks you for your time!
opened by OlgaGuselnikova 0

Snowball compiler and stemming algorithms

Related tags

Overview

What is Stemming?

Comments

Owner

Snowball Stemming language and algorithms

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Yet Another Compiler Visualizer

Nateve compiler developed with python.

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Various Algorithms for Short Text Mining

Meta learning algorithms to train cross-lingual NLI (multi-task) models

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

Client library to download and publish models and other files on the huggingface.co hub