greenery

Tools for parsing and manipulating regular expressions (greenery.lego), for producing finite-state machines (greenery.fsm), and for freely converting between the two. Python 3 only.

This project was undertaken because I wanted to be able to compute the intersection between two regular expressions. The "intersection" is the set of strings which both regexes will accept, represented as a third regular expression.

Example

>> print(parse("\d{4}-\d{2}-\d{2}") & parse("19.*")) 19\d{2}-\d{2}-\d{2} >>> print(parse("\W*") & parse("[a-g0-8$%\^]+") & parse("[^d]{2,8}")) [$%\^]{2,8} >>> print(parse("[bc]*[ab]*") & parse("[ab]*[bc]*")) ([ab]*a|[bc]*c)?b* >>> print(parse("a*") & parse("b*")) >>> print(parse("a") & parse("b")) [] ">

>>> from greenery.lego import parse
>>> print(parse("abc...") & parse("...def"))
abcdef
>>> print(parse("\d{4}-\d{2}-\d{2}") & parse("19.*"))
19\d{2}-\d{2}-\d{2}
>>> print(parse("\W*") & parse("[a-g0-8$%\^]+") & parse("[^d]{2,8}"))
[$%\^]{2,8}
>>> print(parse("[bc]*[ab]*") & parse("[ab]*[bc]*"))
([ab]*a|[bc]*c)?b*
>>> print(parse("a*") & parse("b*"))

>>> print(parse("a") & parse("b"))
[]

In the penultimate example, the empty string is returned, because only the empty string is in both of the regular languages a* and b*. In the final example, an empty character class has been returned. An empty character class can never match anything, which means that this is the smallest representation of a regular expression which matches no strings at all. (Note that this is different from only matching the empty string.)

greenery works by converting both regexes to finite state machines, computing the intersection of the two FSMs as a third FSM, and converting the third FSM back to a regex.

As such, greenery is divided into two libraries:

greenery.fsm

This module provides for the creation and manipulation of deterministic finite state machines.

Example

To do: a slightly more impressive example.

>> print(a) name final? a b ------------------ * 0 False 1 1 True >>> a.accepts([]) False >>> a.accepts(["a"]) True >>> a.accepts(["b"]) False >>> print(a.accepts(["c"])) Traceback (most recent call last): File " ", line 1, in File "fsm.py", line 68, in accepts state = self.map[state][symbol] KeyError: 'c' ">

>>> from greenery import fsm
>>> a = fsm.fsm(
...     alphabet = {"a", "b"},
...     states   = {0, 1},
...     initial  = 0,
...     finals   = {1},
...     map      = {
...             0 : {"a" : 1},
...     },
... )
>>> print(a)
  name final? a b
------------------
* 0    False  1
  1    True
>>> a.accepts([])
False
>>> a.accepts(["a"])
True
>>> a.accepts(["b"])
False
>>> print(a.accepts(["c"]))
Traceback (most recent call last):
  File "
     
      ", line 1, in 
      
       
  File "fsm.py", line 68, in accepts
    state = self.map[state][symbol]
KeyError: 'c'

Functions in this module

`fsm(alphabet, states, initial, finals, map)`

Constructor for an fsm object, as demonstrated above. fsm objects are intended to be immutable.

map may be sparse. If a transition is missing from map, then it is assumed that this transition leads to an undocumented "oblivion state" which is not final. This oblivion state does not appear when the FSM is printed out.

Ordinarily, you may only feed known alphabet symbols into the FSM. Any other symbol will result in an exception, as seen above. However, if you add the special symbol fsm.anything_else to your alphabet, then any unrecognised symbol will be automatically converted into fsm.anything_else before following whatever transition you have specified for this symbol.

`crawl(alphabet, initial, final, follow)`

Crawl what is assumed to be an FSM and return a new fsm object representing it. Starts at state initial. At any given state, crawl calls final(state) to determine whether it is final. Then, for each symbol in alphabet, it calls follow(state, symbol) to try to discover new states. Obviously this procedure could go on for ever if your implementation of follow is faulty. follow may also throw an OblivionError to indicate that you have reached an inescapable, non-final "oblivion state"; in this case, the transition will be omitted from the resulting FSM.

`null(alphabet)`

Returns an FSM over the supplied alphabet which accepts no strings at all.

`epsilon(alphabet)`

Returns an FSM over the supplied alphabet which accepts only the empty string, "".

Methods on class `fsm`

An FSM accepts a possibly-infinite set of strings. With this in mind, fsm implements numerous methods like those on frozenset, as well as many FSM-specific methods. FSMs are immutable.

Method	Behaviour
`fsm1.accepts("a")` `"a" in fsm1`	Returns `True` or `False` or throws an exception if the string contains a symbol which is not in the FSM's alphabet. The string should be an iterable of symbols.
`fsm1.strings()` `for string in fsm1`	Returns a generator of all the strings that this FSM accepts.
`fsm1.empty()`	Returns `True` if this FSM accepts no strings, otherwise `False`.
`fsm1.cardinality()` `len(fsm1)`	Returns the number of strings which the FSM accepts. Throws an `OverflowError` if this number is infinite.
`fsm1.equivalent(fsm2)` `fsm1 == fsm2`	Returns `True` if the two FSMs accept exactly the same strings, otherwise `False`.
`fsm1.different(fsm2)` `fsm1 != fsm2`	Returns `True` if the FSMs accept different strings, otherwise `False`.
`fsm1.issubset(fsm2)` `fsm1 <= fsm2`	Returns `True` if the set of strings accepted by `fsm1` is a subset of those accepted by `fsm2`, otherwise `False`.
`fsm1.ispropersubset(fsm2)` `fsm1 < fsm2`	Returns `True` if the set of strings accepted by `fsm1` is a proper subset of those accepted by `fsm2`, otherwise `False`.
`fsm1.issuperset(fsm2)` `fsm1 >= fsm2`	Returns `True` if the set of strings accepted by `fsm1` is a superset of those accepted by `fsm2`, otherwise `False`.
`fsm1.ispropersuperset(fsm2)` `fsm1 > fsm2`	Returns `True` if the set of strings accepted by `fsm1` is a proper superset of those accepted by `fsm2`, otherwise `False`.
`fsm1.isdisjoint(fsm2)`	Returns `True` if the set of strings accepted by `fsm1` is disjoint from those accepted by `fsm2`, otherwise `False`.
`fsm1.copy()`	Returns a copy of `fsm1`.
`fsm1.reduce()`	Returns an FSM which accepts exactly the same strings as `fsm1` but has a minimal number of states.
`fsm1.star()`	Returns a new FSM which is the Kleene star closure of the original. For example, if `fsm1` accepts only `"asdf"`, `fsm1.star()` accepts `""`, `"asdf"`, `"asdfasdf"`, `"asdfasdfasdf"`, and so on.
`fsm1.everythingbut()`	Returns an FSM which accepts every string not accepted by the original. `x.everythingbut().everythingbut()` accepts the same strings as `x` for all `fsm` objects `x`, but is not necessarily mechanically identical.
`fsm1.reversed()` `reversed(fsm1)`	Returns a reversed FSM. For each string that `fsm1` accepted, `reversed(fsm1)` will accept the reversed string. `reversed(reversed(x))` accepts the same strings as `x` for all `fsm` objects `x`, but is not necessarily mechanically identical.
`fsm1.times(7)` `fsm1 * 7`	Essentially, this is repeated self-concatenation. If `fsm1` only accepts `"z"`, `fsm2` only accepts `"zzzzzzz"`.
`fsm1.concatenate(fsm2, ...)` `fsm1 + fsm2 + ...`	Returns the concatenation of the FSMs. If `fsm1` accepts all strings in A and `fsm2` accepts all strings in B, then `fsm1 + fsm2` accepts all strings of the form a·b where a is in A and b is in B.
`fsm1.union(fsm2, ...)` `fsm1 \| fsm2 \| ...`	Returns an FSM accepting any string accepted by any input FSM. This is also called alternation.
`fsm1.intersection(fsm2, ...)` `fsm1 & fsm2 & ...`	Returns an FSM accepting any string accepted by all input FSMs.
`fsm1.difference(fsm2, ...)` `fsm1 - fsm2 - ...`	Subtract the set of strings accepted by `fsm2` onwards from those accepted by `fsm1` and return the resulting new FSM.
`fsm1.symmetric_difference(fsm2, ...)` `fsm1 ^ fsm2 ^ ...`	Returns an FSM accepting any string accepted by `fsm1` or `fsm2` but not both.
`fsm1.derive("a")`	Return the Brzozowski derivative of the original FSM with respect to the input string. E.g. if `fsm1` only accepts `"ab"` or `"ac+"`, returns an FSM only accepting `"b"` or `"c+"`.

greenery.lego

This module provides methods for parsing a regular expression (i.e. a string) into a manipulable nested data structure, and for manipulating that data structure.

Note that this is an entirely different concept from that of simply creating and using those regexes, functionality which is present in basically every programming language in the world, Python included.

This module requires greenery.fsm in order to carry out many of its most important functions. (greenery.fsm, in comparison, is completely standalone.)

Classes in this module

`lego.bound`

A non-negative integer, or inf, plus a bunch of arithmetic methods which make it possible to compare, add and multiply them.

Special bounds

inf

`lego.multiplier`

A combination of a finite lower bound and a possibly-infinite upper bound, plus a bunch of methods which make it possible to compare, add and multiply them.

Special multipliers

zero (multiplier(bound(0), bound(0)) (has some occasional uses internally)
qm (multiplier(bound(0), bound(1)))
star (multiplier(bound(0), inf))
one (multiplier(bound(1), bound(1)))
plus (multiplier(bound(1), inf))

`lego.lego`

Parent class for charclass, mult, conc and pattern. In general, this represents a regular expression object.

`lego.charclass`

Represents a character class, e.g a, [abc], [^xyz], \d.

Special character classes

w (charclass("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"))
d (charclass("0123456789"))
s (charclass("\t\n\v\f\r "))
W (any character except those matched by w)
D (any character except those matched by d)
S (any character except those matched by s)
dot (any character)
nothing (empty character class, no matches possible)

`lego.mult`

Represents a charclass or pattern combined with a multiplier, e.g. [abc]* or (a|bc)*.

`lego.conc`

Represents a sequence of zero or more mults, e.g. ab, [abc]*d.

Special concatenations

emptystring, the regular expression which only matches the empty string (conc())

`lego.pattern`

Represents an alternation between one or more concs, e.g. [abc]*d|e.

Methods in this module

`lego.from_fsm()`

Uses the Brzozowski algebraic method to convert a greenery.fsm object into a lego object, which is a regular expression.

`lego.parse(string)`

Returns a lego object representing the regular expression in the string.

The following metacharacters and formations have their usual meanings: ., *, +, ?, {m}, {m,}, {m,n}, (), |, [], ^ within [] character ranges only, - within [] character ranges only, and \ to escape any of the preceding characters or itself.

These character escapes are possible: \t, \r, \n, \f, \v.

These predefined character sets also have their usual meanings: \w, \d, \s and their negations \W, \D, \S. . matches any character, including new line characters and carriage returns.

An empty charclass [] is legal and matches no characters: when used in a regex, the regex may match no strings.

Unsupported constructs

This method is intentionally rigorously simple, and tolerates no ambiguity. For example, a hyphen must be escaped in a character class even if it appears first or last. [-abc] is a syntax error, write [\-abc]. Escaping something which doesn't need it is a syntax error too: [\ab] resolves to neither [\\ab] nor [ab].
The ^ and $ metacharacters are not supported. By default, greenery assumes that all regexes are anchored at the start and end of any input string. Carets and dollar signs will be parsed as themselves. If you want to not anchor at the start or end of the string, put .* at the start or end of your regex respectively.

This is because computing the intersection between .*a.* and .*b.* (1) is largely pointless and (2) usually results in gibberish coming out of the program.

The greedy operators *?, +?, ?? and {m,n}? are not supported, since they do not alter the regular language.
Parentheses are used to alternate between multiple possibilities e.g. (a|bc) only, not for capture grouping. Here's why:
```
  >>> print(parse("(ab)c") & parse("a(bc)"))
  abc
```

The (?:...) syntax for non-capturing groups is permitted, but does nothing.

Other (?...) constructs are not supported (and most are not regular in the computer science sense).
Back-references, such as ([aeiou])\1, are not regular.

Methods on the `lego` class

All objects of class lego (charclass, mult, conc and pattern) share these methods.

Method	Behaviour
`lego1.to_fsm()`	Returns an `fsm` object, a finite state machine which recognises exactly the strings that the original regular expression can match. The majority of the other methods employ this one.
`lego1.matches("a")` `"a" in lego1`	Returns `True` if the regular expression matches the string or `False` if not.
`lego1.strings()` `for string in lego1`	Returns a generator of all the strings that this regular expression matches.
`lego1.empty()`	Returns `True` if this regular expression matches no strings, otherwise `False`.
`lego1.cardinality()` `len(lego1)`	Returns the number of strings which the regular expression matches. Throws an `OverflowError` if this number is infinite.
`lego1.equivalent(lego2)`	Returns `True` if the two regular expressions match exactly the same strings, otherwise `False`.
`lego1.copy()`	Returns a copy of `lego1`.
`lego1.everythingbut()`	Returns a regular expression which matches every string not matched by the original. `x.everythingbut().everythingbut()` matches the same strings as `x` for all `lego` objects `x`, but is not necessarily identical.
`lego1.reversed()` `reversed(lego1)`	Returns a reversed regular expression. For each string that `lego1` matched, `reversed(lego1)` will match the reversed string. `reversed(reversed(x))` matches the same strings as `x` for all `lego` objects `x`, but is not necessarily identical.
`lego1.times(star)` `lego1 * star`	Returns the input regular expression multiplied by any `multiplier`.
`lego1.concatenate(lego2, ...)` `lego1 + lego2 + ...`	Returns the concatenation of the regular expressions.
`lego1.union(lego2, ...)` `lego1 \| lego2 \| ...`	Returns the alternation of the two regular expressions.
`lego1.intersection(lego2, ...)` `lego1 & lego2 & ...`	Returns a regular expression matching any string matched by all input regular expressions. The successful implementation of this method was the ultimate goal of this entire project.
`lego1.difference(lego2, ...)` `lego1 - lego2 - ...`	Subtract the set of strings matched by `lego2` onwards from those matched by `lego1` and return the resulting regular expression.
`lego1.symmetric_difference(lego2, ...)` `lego1 ^ lego2 ^ ...`	Returns a regular expression matching any string accepted by `lego1` or `lego2` but not both.
`lego1.reduce()`	Returns a regular expression which matches exactly the same strings as `lego1` but is simplified as far as possible. See dedicated section below.
`lego1.derive("a")`	Return the Brzozowski derivative of the input regular expression with respect to "a".

`reduce()`

Call this method to try to simplify the regular expression object, according to the following patterns:

(ab|cd|ef|)g to (ab|cd|ef)?g
([ab])* to [ab]*
ab?b?c to ab{0,2}c
a(d(ab|a*c)) to ad(ab|a*c)
0|[2-9] to [02-9]
abc|ade to a(bc|de)
xyz|stz to (xy|st)z
abc()def to abcdef
a{1,2}|a{3,4} to a{1,4}

The various reduce() methods are extensible.

Note that in a few cases this did not result in a shorter regular expression.

Name

I spent a long time trying to find an appropriate metaphor for what I was trying to do: "I need an X such that lots of Xs go together to make a Y, but lots of Ys go together to make an X". Unfortunately the real world doesn't seem to be recursive in this way so I plumped for "lego" as a basic catchall term for the various components that go together to make up a data structure.

This was a dumb idea in retrospect and it will be changed to greenery.re or greenery.rx in the near future. Vote now if you have an opinion.

Hi, I have found that many of the functions of the fsm class will fail on an unpickled fsm object. Below is an example of checking if one regex is a subset of another both before and after pickling and unpickling two fsm objects.

import pickle
from greenery import fsm, lego

r1 = "[A-Za-z0-9]{0,4}"
r2 = "[A-Za-z0-9]{0,3}"

l1: lego.lego = lego.parse(r1)
l2: lego.lego = lego.parse(r2)

f1: fsm.fsm = l1.to_fsm()
f2: fsm.fsm = l2.to_fsm()

if f2 < f1:
    print("r2 is a proper subset of r1")
else:
    print("r2 is NOT a proper subset of r1")

with open("/tmp/f1.bin", "wb") as f:
    pickle.dump(f1, f)

with open("/tmp/f2.bin", "wb") as f:
    pickle.dump(f2, f)

with open("/tmp/f1.bin", "rb") as f:
    f1_unpickled: fsm.fsm = pickle.load(f)

with open("/tmp/f2.bin", "rb") as f:
    f2_unpickled: fsm.fsm = pickle.load(f)

if f2_unpickled < f1_unpickled:
    print("r2 is a proper subset of r1")
else:
    print("r2 is NOT a proper subset of r1")

In the first if-statement it will correctly print out r2 is a proper subset of r1, but in the second one it will fail with the following traceback:

Traceback (most recent call last):
  File "/home/test/test/test.py", line 33, in <module>
    if f2_unpickled < f1_unpickled:
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 615, in __lt__
    return self.ispropersubset(other)
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 608, in ispropersubset
    return self <= other and self != other
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 601, in __le__
    return self.issubset(other)
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 594, in issubset
    return (self - other).empty()
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 542, in __sub__
    return self.difference(other)
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 539, in difference
    return parallel(fsms, lambda accepts: accepts[0] and not any(accepts[1:]))
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 757, in parallel
    return crawl(alphabet, initial, final, follow).reduce()
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 782, in crawl
    for symbol in sorted(alphabet, key=key):
TypeError: '<' not supported between instances of 'anything_else_cls' and 'str'

As can be seen it seems to be caused by not being able to sort the alphabet set because it cannot compare instances of anything_else_cls and str. I have found that casting the symbol variable to a string in the key function like below will fix the issue, but I don't know if it is the correct way to do it?

def key(symbol):
    '''Ensure `fsm.anything_else` always sorts last'''
    return (symbol is anything_else, str(symbol))

Strange behaviour
>>> print(parse("(\d{2})+") & parse("(\d{3})+") == parse("(\d{6})+")) False

I don't see how \d(\d{6})*\d{5} differs from \d{6} (neither I see why it's computed to be a minimal regex).
opened by polkovnikov-ph 8
Transition to setuptools from distutils

Distutils has poor support for packages arranged with the package source in the root project directory. Transition to 'setuptools', which supports this without warnings.

opened by pjkundert 8
Pylance sees greenery.lego.parse as returning NoReturn
This code fails to pass Pylance with a 'Cannot access member "everythingbut" for type "NoReturn"'

from greenery.lego import parse test_regex = parse(".") test_regex.everythingbut()

The heart of the issue is that the lego class has a number of functions that do not have a return type hint and are not marked with @abstractmethod thus forcing Pylance to infer that the class never actually returns and thus is labeled NoReturn.

Suggestions would be to add at least return type hints to the lego and pattern classes.
opened by Delwin9999 6
Are you interested in contributions?
I finally picked up greenery this afternoon, and have had a lot of fun with it. Before I spend much more time on it though I thought I'd check if you're interested in contributions:

I wrote an alternative string-to-lego parser which leans heavily on the CPython sre_parse module - https://github.com/qntm/greenery/compare/master...Zac-HD:pyregex. It supports every construct that can be re.compiled, but has somewhat worse errors at the moment (eg no context given when bailing on a groupref). Also needs more testing for eg repeats :smile:

I also wrote some property-based tests with Hypothesis, https://github.com/qntm/greenery/compare/master...Zac-HD:property-tests. This has already turned up a few bugs of the form x != parse(str(x)) for some lego object x, but there's little point looking for more if you don't consider this a bug worth fixing. (I originally started this on the parser branch, but similar problems seem to exist on master too)

Either way, thanks for a great little library and a fun evening poking at it!
opened by Zac-HD 5

Sorting on alphabet set fails on unpickled fsm object

opened by oliverhaagh 4

Unicode support

I've encountered some problems using this library with Russian language. As far as I can see, the problem is in the usage of str's and str literals in code, as well as hardcoding character lists for English. I've managed to make some things work in Py2.7 by adding "from future import unicode_literals" to code and tests, adding Russian to character lists and replacing str to unicode in some places. Maybe I'll come up to a pull request, but in the meantime I'd like to know whether you think it's really needed or you have any ideas on how to implement it the most straightforward way.

opened by machinehead 4
Version 3.3.4 breaks support for Python versions earlier than 3.9
The reason is in fsm.py, line 59:

finals: set[state_type]

In Python versions earlier than 3.9 this causes an error:

TypeError: 'type' object is not subscriptable

This is because Type Hinting Generics in Standard Collections was only introduced in Python 3.9.
opened by zivnevo 3
Use frozen dataclasses and type annotations

The main benefit of using the more established convention @dataclass(frozen=True) and object.__setattr__(self, "attribute", set(self.attibute)) instead of the current self.__dict__["attribute" ] = attribute is it allows for better static inspection by IDEs like PyCharm and VSCode.

This only took me an hour of work, so I'm fine you reject this MR because if this type of code feels unnatural to you and/or if you don't use an IDE.

I'm also happy to discuss this MR further if you'd like more explanation behind this MR.

opened by Seanny123 3
Why are instance attributes initialized with `self.__dict__`?
Less of an issue, more of a curiosity.

In all the class __init__(self) methods, there is something like:

self.__dict__["chars"] = chars self.__dict__["negated"] = negateMe

Instead of:

self.chars = chars self.negated = negateMe

This breaks static checks, but since this is mostly your repository I don't think that's a problem. However, I am curious why you initialize attributes this way. Does it give some special functionality that I missed?
enhancement question
opened by Seanny123 3
fsm.union() with no args fails, on Python 2; causes failing test case in fsm_test.py
Calling fsm.union() with no arguments fails on Python 2.7:

>>> from greenery.fsm import fsm >>> fsm.union() Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unbound method union() must be called with fsm instance as first argument (got nothing instead)

The method documentation makes it sound like it should succeed. This succeeds on Python 3.5.

This causes one of the test cases in fsm_test.py to fail, on Python 2.7:

def test_new_set_methods(a, b): > assert len(fsm.union()) == 0 E TypeError: unbound method union() must be called with fsm instance as first argument (got nothing instead)

In contrast, all tests pass for fsm_test.py on Python 3.5. I'm using version 3.0 of the greenery library, as installed via Pip.
opened by davidwagner 3
Allow strings of length 2 or more for FSMs being converted to regexes

At present the fsm.lego() method only works if your FSM's alphabet contains only single-character strings (or lego.otherchars). It would be neat to also accept longer strings here. (Maybe even zero-character strings, and more complex lego objects?)
enhancement

opened by qntm 3