Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

Lark - Parsing Library & Toolkit

Last update: Jan 5, 2023

Related tags

General Utilities python parser tree parse parser-library grammar parsing-engine lark earley lalr parsing-library cyk

Overview

Lark - a parsing toolkit for Python

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

Lark can parse all context-free languages. To put it simply, it means that it is capable of parsing almost any programming language out there, and to some degree most natural languages too.

Who is it for?

Beginners: Lark is very friendly for experimentation. It can parse any grammar you throw at it, no matter how complicated or ambiguous, and do so efficiently. It also constructs an annotated parse-tree for you, using only the grammar and an input, and it gives you convienient and flexible tools to process that parse-tree.
Experts: Lark implements both Earley(SPPF) and LALR(1), and several different lexers, so you can trade-off power and speed, according to your requirements. It also provides a variety of sophisticated features and utilities.

What can it do?

Parse all context-free grammars, and handle any ambiguity gracefully
Build an annotated parse-tree automagically, no construction code required.
Provide first-rate performance in terms of both Big-O complexity and measured run-time (considering that this is Python ;)
Run on every Python interpreter (it's pure-python)
Generate a stand-alone parser (for LALR(1) grammars)

And many more features. Read ahead and find out!

Most importantly, Lark will save you time and prevent you from getting parsing headaches.

Quick links

Install Lark

$ pip install lark --upgrade

Lark has no dependencies.

Syntax Highlighting

Lark provides syntax highlighting for its grammar files (*.lark):

Clones

These are implementations of Lark in other languages. They accept Lark grammars, and provide similar utilities.

Lerche (Julia) - an unofficial clone, written entirely in Julia.
Lark.js (Javascript) - a port of the stand-alone LALR(1) parser generator to Javascsript.

Hello World

Here is a little program to parse "Hello, World!" (Or any other similar phrase):

from lark import Lark

l = Lark('''start: WORD "," WORD "!"

            %import common.WORD   // imports from terminal library
            %ignore " "           // Disregard spaces in text
         ''')

print( l.parse("Hello, World!") )

And the output is:

Tree(start, [Token(WORD, 'Hello'), Token(WORD, 'World')])

Notice punctuation doesn't appear in the resulting tree. It's automatically filtered away by Lark.

Fruit flies like bananas

Lark is great at handling ambiguity. Here is the result of parsing the phrase "fruit flies like bananas":

Read the code here, and see more examples here.

List of main features

Builds a parse-tree (AST) automagically, based on the structure of the grammar
Earley parser
- Can parse all context-free grammars
- Full support for ambiguous grammars
LALR(1) parser
- Fast and light, competitive with PLY
- Can generate a stand-alone parser (read more)
CYK parser, for highly ambiguous grammars
EBNF grammar
Unicode fully supported
Python 2 & 3 compatible
Automatic line & column tracking
Standard library of terminals (strings, numbers, names, etc.)
Import grammars from Nearley.js (read more)
Extensive test suite
MyPy support using type stubs
And much more!

See the full list of features here

Comparison to other libraries

Performance comparison

Lark is the fastest and lightest (lower is better)

Check out the JSON tutorial for more details on how the comparison was made.

Note: I really wanted to add PLY to the benchmark, but I couldn't find a working JSON parser anywhere written in PLY. If anyone can point me to one that actually works, I would be happy to add it!

Note 2: The parsimonious code has been optimized for this specific test, unlike the other benchmarks (Lark included). Its "real-world" performance may not be as good.

Feature comparison

Library	Algorithm	Grammar	Builds tree?	Supports ambiguity?	Can handle every CFG?	Line/Column tracking	Generates Stand-alone
Lark	Earley/LALR(1)	EBNF	Yes!	Yes!	Yes!	Yes!	Yes! (LALR only)
PLY	LALR(1)	BNF	No	No	No	No	No
PyParsing	PEG	Combinators	No	No	No*	No	No
Parsley	PEG	EBNF	No	No	No*	No	No
Parsimonious	PEG	EBNF	Yes	No	No*	No	No
ANTLR	LL(*)	EBNF	Yes	No	Yes?	Yes	No

(* PEGs cannot handle non-deterministic grammars. Also, according to Wikipedia, it remains unanswered whether PEGs can really parse all deterministic CFGs)

Projects using Lark

Poetry - A utility for dependency management and packaging
tartiflette - a GraphQL server by Dailymotion
PyQuil - Python library for quantum programming using Quil
Preql - An interpreted relational query language that compiles to SQL
Hypothesis - Library for property-based testing
mappyfile - a MapFile parser for working with MapServer configuration
synapse - an intelligence analysis platform
Datacube-core - Open Data Cube analyses continental scale Earth Observation data through time
SPFlow - Library for Sum-Product Networks
Torchani - Accurate Neural Network Potential on PyTorch
Command-Block-Assembly - An assembly language, and C compiler, for Minecraft commands
EQL - Event Query Language
Fabric-SDK-Py - Hyperledger fabric SDK with Python 3.x
required - multi-field validation using docstrings
miniwdl - A static analysis toolkit for the Workflow Description Language
pytreeview - a lightweight tree-based grammar explorer
harmalysis - A language for harmonic analysis and music theory
gersemi - A CMake code formatter

Using Lark? Send me a message and I'll add your project!

License

Lark uses the MIT license.

(The standalone tool is under MPL2)

Contribute

Lark is currently accepting pull-requests. See How to develop Lark

Sponsor

If you like Lark, and want to see it grow, please consider sponsoring us!

Contact the author

Questions about code are best asked on gitter or in the issues.

For anything else, I can be reached by email at erezshin at gmail com.

-- Erez

Comments

Bug in handling ambiguity?

When running this code:

grammar = """
expression: "c" | "d" | "c" "d"
unit: expression "a"
    | "a" expression
    | "b" unit
    | "b" expression
start: unit*

%import common.WS
%ignore WS
"""

l = Lark(grammar, parser='earley', ambiguity='explicit')
print(l.parse('b c d a a c').pretty())

It is expected to have an ambiguous parse, but there is no '_ambig' node.

At least these options are valid:

unit(
    b
    unit(
        expression(
            c
            d
        )
        a
    )
)
unit(
    a
    expression(
        c
    )
)

and this parse:

unit(
    b
    expression(
        c
    )
)
unit(
    expression(
        d
    )
    a
)
unit(
    a
    expression(
        c
    )
)

The only parse that comes back is the second one. When one removes the "b" expression option, you get the first one.

bug

opened by uriva 67

Lark runs on Pyodide! (Online IDE)

Lark runs out-of-the-box inside the browser using Pyodide:

Pyodide is a CPython 3.7 interpreter compiled to web-assembly (wasm). Here's the Python console from above: https://pyodide.cdn.iodide.io/console.html

Maybe this can be helpful as a quick start for all who quickly want to get into?
discussion

opened by phorward 36

0.11.2: pytest is failing

I'm trying to package your module as rpm packag. So I'm using typical in such case build, install and test cycle used on building package from non-root account:

"setup.py build"
"setup.py install --root </install/prefix>"
"pytest with PYTHONPATH pointing to sitearch and sitelib inside </install/prefix>

May I ask for help because few units are failing:

+ PYTHONPATH=/home/tkloczko/rpmbuild/BUILDROOT/python-lark-parser-0.11.3-2.fc35.x86_64/usr/lib64/python3.8/site-packages:/home/tkloczko/rpmbuild/BUILDROOT/python-lark-parser-0.11.3-2.fc35.x86_64/usr/lib/python3.8/site-packages
+ /usr/bin/pytest -ra
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=2126451817
rootdir: /home/tkloczko/rpmbuild/BUILD/lark-0.11.3
plugins: forked-1.3.0, shutil-1.7.0, virtualenv-1.7.0, expect-1.1.0, flake8-1.0.7, timeout-1.4.2, betamax-0.8.1, freezegun-0.4.2, aspectlib-1.5.2, toolbox-0.5, rerunfailures-9.1.1, requests-mock-1.9.3, cov-2.12.1, pyfakefs-4.5.0, flaky-3.7.0, benchmark-3.4.1, xdist-2.3.0, pylama-7.7.1, datadir-1.3.1, regressions-2.2.0, cases-3.6.3, xprocess-0.18.1, black-0.3.12, checkdocs-2.7.1, anyio-3.3.0, Faker-8.11.0, asyncio-0.15.1, trio-0.7.0, httpbin-1.0.0, subtests-0.5.0, isort-2.0.0, hypothesis-6.14.6, mock-3.6.1, profiling-1.7.0, randomly-3.8.0
collected 998 items

tests/test_tools.py ....                                                                                                                                             [  0%]
tests/test_logger.py ...                                                                                                                                             [  0%]
tests/test_reconstructor.py .......                                                                                                                                  [  1%]
tests/test_trees.py ..............                                                                                                                                   [  2%]
tests/test_parser.py ...............s.....ss.s.s......ss.....ss..s..s.......s......s.s.s...s.....s...s.s......s...s.....s......s................s...s...s........... [ 17%]
..s...................s.........s.....s...................s...s..s...s........s...................s.........s....................................................... [ 33%]
...........s....s.......s........................s.................s.............s..s...................s....s.s...ss......................s...............s..s.s... [ 50%]
.........s..s...s....................s..................s..........s...s................s.........s..s..s.....s........s.....s.s.......s......s......s....s......... [ 66%]
...........s............s.....s....................s.s............................s.......s....ss..ss..s...........s.ss......s...............s.s........s.s.s...s.s. [ 82%]
....ss...............s.......s.........................s....s............s..........s..........................................                                      [ 95%]
tests/test_lexer.py .                                                                                                                                                [ 95%]
tests/test_nearley/test_nearley.py ..FF...F                                                                                                                          [ 96%]
tests/test_cache.py ....                                                                                                                                             [ 96%]
. .                                                                                                                                                                  [ 97%]
tests/test_cache.py F.                                                                                                                                               [ 97%]
tests/test_grammar.py .......F.......                                                                                                                                [ 98%]
tests/test_tree_forest_transformer.py ............                                                                                                                   [100%]

================================================================================= FAILURES =================================================================================
_________________________________________________________________________ TestNearley.test_include _________________________________________________________________________

self = <tests.test_nearley.test_nearley.TestNearley testMethod=test_include>

    def test_include(self):
        fn = os.path.join(NEARLEY_PATH, 'test/grammars/folder-test.ne')
>       with open(fn) as f:
E       FileNotFoundError: [Errno 2] No such file or directory: '/home/tkloczko/rpmbuild/BUILD/lark-0.11.3/tests/test_nearley/nearley/test/grammars/folder-test.ne'

tests/test_nearley/test_nearley.py:48: FileNotFoundError
______________________________________________________________________ TestNearley.test_multi_include ______________________________________________________________________

self = <tests.test_nearley.test_nearley.TestNearley testMethod=test_multi_include>

    def test_multi_include(self):
        fn = os.path.join(NEARLEY_PATH, 'test/grammars/multi-include-test.ne')
>       with open(fn) as f:
E       FileNotFoundError: [Errno 2] No such file or directory: '/home/tkloczko/rpmbuild/BUILD/lark-0.11.3/tests/test_nearley/nearley/test/grammars/multi-include-test.ne'

tests/test_nearley/test_nearley.py:61: FileNotFoundError
___________________________________________________________________________ TestNearley.test_css ___________________________________________________________________________

self = <tests.test_nearley.test_nearley.TestNearley testMethod=test_css>

    def test_css(self):
        fn = os.path.join(NEARLEY_PATH, 'examples/csscolor.ne')
>       with open(fn) as f:
E       FileNotFoundError: [Errno 2] No such file or directory: '/home/tkloczko/rpmbuild/BUILD/lark-0.11.3/tests/test_nearley/nearley/examples/csscolor.ne'

tests/test_nearley/test_nearley.py:28: FileNotFoundError
__________________________________________________________________________ TestCache.test_imports __________________________________________________________________________

self = <tests.test_cache.TestCache testMethod=test_imports>

    def test_imports(self):
        g = """
        %import .grammars.ab (startab, expr)
        """
>       parser = Lark(g, parser='lalr', start='startab', cache=True)

tests/test_cache.py:131:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
lark/lark.py:299: in __init__
    self.grammar, used_files = load_grammar(grammar, self.source_path, self.options.import_paths, self.options.keep_all_tokens)
lark/load_grammar.py:1229: in load_grammar
    builder.load_grammar(grammar, source)
lark/load_grammar.py:1082: in load_grammar
    self.do_import(dotted_path, base_path, aliases, mangle)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <lark.load_grammar.GrammarBuilder object at 0x7f9944b73640>, dotted_path = (Token('RULE', 'grammars'), Token('RULE', 'ab')), base_path = '/usr/bin'
aliases = {Token('RULE', 'expr'): Token('RULE', 'expr'), Token('RULE', 'startab'): Token('RULE', 'startab')}, base_mangle = None

    def do_import(self, dotted_path, base_path, aliases, base_mangle=None):
        assert dotted_path
        mangle = _get_mangle('__'.join(dotted_path), aliases, base_mangle)
        grammar_path = os.path.join(*dotted_path) + EXT
        to_try = self.import_paths + ([base_path] if base_path is not None else []) + [stdlib_loader]
        for source in to_try:
            try:
                if callable(source):
                    joined_path, text = source(base_path, grammar_path)
                else:
                    joined_path = os.path.join(source, grammar_path)
                    with open(joined_path, encoding='utf8') as f:
                        text = f.read()
            except IOError:
                continue
            else:
                h = hashlib.md5(text.encode('utf8')).hexdigest()
                if self.used_files.get(joined_path, h) != h:
                    raise RuntimeError("Grammar file was changed during importing")
                self.used_files[joined_path] = h

                gb = GrammarBuilder(self.global_keep_all_tokens, self.import_paths, self.used_files)
                gb.load_grammar(text, joined_path, mangle)
                gb._remove_unused(map(mangle, aliases))
                for name in gb._definitions:
                    if name in self._definitions:
                        raise GrammarError("Cannot import '%s' from '%s': Symbol already defined." % (name, grammar_path))

                self._definitions.update(**gb._definitions)
                break
        else:
            # Search failed. Make Python throw a nice error.
>           open(grammar_path, encoding='utf8')
E           FileNotFoundError: [Errno 2] No such file or directory: 'grammars/ab.lark'

lark/load_grammar.py:1162: FileNotFoundError
______________________________________________________________________ TestGrammar.test_override_rule ______________________________________________________________________

self = <tests.test_grammar.TestGrammar testMethod=test_override_rule>

    def test_override_rule(self):
        # Overrides the 'sep' template in existing grammar to add an optional terminating delimiter
        # Thus extending it beyond its original capacity
        p = Lark("""
            %import .test_templates_import (start, sep)

            %override sep{item, delim}: item (delim item)* delim?
            %ignore " "
        """, source_path=__file__)

        a = p.parse('[1, 2, 3]')
        b = p.parse('[1, 2, 3, ]')
        assert a == b

>       self.assertRaises(GrammarError, Lark, """
            %import .test_templates_import (start, sep)

            %override sep{item}: item (delim item)* delim?
        """)

tests/test_grammar.py:39:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
lark/lark.py:299: in __init__
    self.grammar, used_files = load_grammar(grammar, self.source_path, self.options.import_paths, self.options.keep_all_tokens)
lark/load_grammar.py:1229: in load_grammar
    builder.load_grammar(grammar, source)
lark/load_grammar.py:1082: in load_grammar
    self.do_import(dotted_path, base_path, aliases, mangle)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    def do_import(self, dotted_path, base_path, aliases, base_mangle=None):
        assert dotted_path
        mangle = _get_mangle('__'.join(dotted_path), aliases, base_mangle)
        grammar_path = os.path.join(*dotted_path) + EXT
        to_try = self.import_paths + ([base_path] if base_path is not None else []) + [stdlib_loader]
        for source in to_try:
            try:
                if callable(source):
                    joined_path, text = source(base_path, grammar_path)
                else:
                    joined_path = os.path.join(source, grammar_path)
                    with open(joined_path, encoding='utf8') as f:
                        text = f.read()
            except IOError:
                continue
            else:
                h = hashlib.md5(text.encode('utf8')).hexdigest()
                if self.used_files.get(joined_path, h) != h:
                    raise RuntimeError("Grammar file was changed during importing")
                self.used_files[joined_path] = h

                gb = GrammarBuilder(self.global_keep_all_tokens, self.import_paths, self.used_files)
                gb.load_grammar(text, joined_path, mangle)
                gb._remove_unused(map(mangle, aliases))
                for name in gb._definitions:
                    if name in self._definitions:
                        raise GrammarError("Cannot import '%s' from '%s': Symbol already defined." % (name, grammar_path))

                self._definitions.update(**gb._definitions)
                break
        else:
            # Search failed. Make Python throw a nice error.
>           open(grammar_path, encoding='utf8')
E           FileNotFoundError: [Errno 2] No such file or directory: 'test_templates_import.lark'

lark/load_grammar.py:1162: FileNotFoundError
============================================================================= warnings summary =============================================================================
tests/test_cache.py:110
  /home/tkloczko/rpmbuild/BUILD/lark-0.11.3/tests/test_cache.py:110: DeprecationWarning: invalid escape sequence \d
    g = """

tests/test_cache.py:48
  /home/tkloczko/rpmbuild/BUILD/lark-0.11.3/tests/test_cache.py:48: PytestCollectionWarning: cannot collect test class 'TestT' because it has a __init__ constructor (from: tests/test_cache.py)
    class TestT(Transformer):

tests/test_parser.py:166
  /home/tkloczko/rpmbuild/BUILD/lark-0.11.3/tests/test_parser.py:166: DeprecationWarning: invalid escape sequence \d
    g = """

tests/test_reconstructor.py:75
  /home/tkloczko/rpmbuild/BUILD/lark-0.11.3/tests/test_reconstructor.py:75: DeprecationWarning: invalid escape sequence \s
    g = """

tests/test_reconstructor.py:90
  /home/tkloczko/rpmbuild/BUILD/lark-0.11.3/tests/test_reconstructor.py:90: DeprecationWarning: invalid escape sequence \s
    g = """

tests/test_reconstructor.py:154
  /home/tkloczko/rpmbuild/BUILD/lark-0.11.3/tests/test_reconstructor.py:154: DeprecationWarning: invalid escape sequence \s
    g1 = """

tests/test_reconstructor.py:162
  /home/tkloczko/rpmbuild/BUILD/lark-0.11.3/tests/test_reconstructor.py:162: DeprecationWarning: invalid escape sequence \s
    g2 = """

-- Docs: https://docs.pytest.org/en/stable/warnings.html
========================================================================= short test summary info ==========================================================================
SKIPPED [7] tests/test_parser.py:2005: Currently only Earley supports priority sum in rules
SKIPPED [2] tests/test_parser.py:2077: No empty rules
SKIPPED [7] tests/test_parser.py:2309: Serialize currently only works for LALR parsers without custom lexers (though it should be easy to extend)
SKIPPED [9] tests/test_parser.py:1045: cStringIO not available
SKIPPED [3] tests/test_parser.py:2355: match_examples() not supported for CYK/old custom lexer
SKIPPED [9] tests/test_parser.py:1249: Flattening list isn't implemented (and may never be)
SKIPPED [2] tests/test_parser.py:1961: Doesn't work for CYK
SKIPPED [2] tests/test_parser.py:2231: Empty rules
SKIPPED [2] tests/test_parser.py:2220: Empty rules
SKIPPED [2] tests/test_parser.py:1120: Takes forever
SKIPPED [9] tests/test_parser.py:1265: Flattening list isn't implemented (and may never be)
SKIPPED [6] tests/test_parser.py:1705: Only standard lexers care about token priority
SKIPPED [2] tests/test_parser.py:1512: No empty rules
SKIPPED [2] tests/test_parser.py:1194: No empty rules
SKIPPED [2] tests/test_parser.py:1650: No empty rules
SKIPPED [6] tests/test_parser.py:2435: interactive_parser error handling only works with LALR for now
SKIPPED [6] tests/test_parser.py:2398: interactive_parser is only implemented for LALR at the moment
SKIPPED [2] tests/test_parser.py:1451: No empty rules
SKIPPED [9] tests/test_parser.py:1281: Flattening list isn't implemented (and may never be)
SKIPPED [2] tests/test_parser.py:1213: No empty rules
SKIPPED [2] tests/test_parser.py:1233: No empty rules
SKIPPED [4] tests/test_parser.py:2194: Priority not handled correctly right now
SKIPPED [2] tests/test_parser.py:1915: %declare/postlex doesn't work with dynamic
SKIPPED [2] tests/test_parser.py:1938: %declare/postlex doesn't work with dynamic
SKIPPED [1] tests/test_parser.py:754: Only relevant for the dynamic_complete parser
SKIPPED [1] tests/test_parser.py:402: Only relevant for the dynamic_complete parser
FAILED tests/test_nearley/test_nearley.py::TestNearley::test_include - FileNotFoundError: [Errno 2] No such file or directory: '/home/tkloczko/rpmbuild/BUILD/lark-0.11.3...
FAILED tests/test_nearley/test_nearley.py::TestNearley::test_multi_include - FileNotFoundError: [Errno 2] No such file or directory: '/home/tkloczko/rpmbuild/BUILD/lark-...
FAILED tests/test_nearley/test_nearley.py::TestNearley::test_css - FileNotFoundError: [Errno 2] No such file or directory: '/home/tkloczko/rpmbuild/BUILD/lark-0.11.3/tes...
FAILED tests/test_cache.py::TestCache::test_imports - FileNotFoundError: [Errno 2] No such file or directory: 'grammars/ab.lark'
FAILED tests/test_grammar.py::TestGrammar::test_override_rule - FileNotFoundError: [Errno 2] No such file or directory: 'test_templates_import.lark'
========================================================= 5 failed, 889 passed, 103 skipped, 7 warnings in 44.97s ==========================================================
pytest-xprocess reminder::Be sure to terminate the started process by running 'pytest --xkill' if you have not explicitly done so in your fixture with 'xprocess.getinfo(<process_name>).terminate()'.

opened by kloczek 35

Fix #696 now providing the correct amount of placeholders
p = Lark("""!start: ["a" "b" "c"] """, maybe_placeholders=True) p.parse("").children

now returns [None, None, None] instead of [None]

same for !start: ["a" ["b" "c"]].
opened by ornariece 32
Changing file-extension for standalone grammar definitions from .g?

Currently standalone files like common and the example json use the file extension .g. However, it looks like g is already associated with the ANTLR parser. While I suppose it's possible to make Lark compatible with ANTLR, in the meantime it's probably best to use a different file extension. I would propose the extension lrk as it doesn't seem to be used by anything currently.

I'm about to submit a pull request to change the file extensions to .lrk in the relevant file names and in the code referencing the .g extension. It's on a separate branch so if you want to use a different extension that should be easy enough to change.
discussion

opened by RobRoseKnows 31

Fix `python.number` pattern

Python doesn't accept numbers with the _ in the beginning/end and numbers with more than one _ in the allowed places:

>>> 69420
69420
>>> 69_420
69420
>>> 69__420
  File "<stdin>", line 1
    69__420
      ^
SyntaxError: invalid decimal literal
>>> 69_420_
  File "<stdin>", line 1
    69_420_
          ^
SyntaxError: invalid decimal literal
>>> _69_420
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name '_69_420' is not defined

>>> 03.1415
3.1415
>>> 0_3.14_15
3.1415
>>> 0__3.14_15
  File "<stdin>", line 1
    0__3.14_15
     ^
SyntaxError: invalid decimal literal
>>> 0_3.14__15
  File "<stdin>", line 1
    0_3.14__15
          ^
SyntaxError: invalid decimal literal
>>> 0_3.14_15_
  File "<stdin>", line 1
    0_3.14_15_
             ^
SyntaxError: invalid decimal literal
>>> 0_3._14_15
  File "<stdin>", line 1
    0_3._14_15
       ^
SyntaxError: invalid decimal literal
>>> 0_3_.14_15
  File "<stdin>", line 1
    0_3_.14_15
       ^
SyntaxError: invalid decimal literal
>>> _0_3.14_15
  File "<stdin>", line 1
    _0_3.14_15
    ^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?

the same goes with complex numbers. And yes, python recognizes _xxx as a name, even though x is a digit, but it's still not a number, so this doesn't affect us.

The current implementation only filters numbers with _ in the beginning, so here's the fix for the other cases.

Hopefully, we still can make backward-incompatible changes, so it's fine to change IMAG_NUMBER to COMPLEX_NUMBER

I also tested \d(?:_?\d+)* for DEC_NUMBER but haven't seen any significant performance changes (everything is within the normal range, considering that I was not using a stable environment).

opened by 0dminnimda 30

Newbie questions

Consider the below snippet:

from lark import Lark, inline_args, Transformer

grammars = [
    """
        ?start: sum | NAME "=" sum
        ?sum: product | sum "+" product | sum "-" product
        ?product: atom | product "*" atom | product "/" atom
        ?atom: NUMBER | "-" atom | NAME | "(" sum ")"
        %import common.CNAME -> NAME
        %import common.NUMBER
        %import common.WS_INLINE
        %ignore WS_INLINE
    """,
    """
        ?start: sum | NAME "=" sum
        ?sum: product | sum "+" product | sum "-" product
        ?product: atom | product "*" atom | product "/" atom
        ?atom: NUMBER | "-" atom | NAME | "(" sum ")"
        EQUAL: "="
        LPAR: "("
        RPAR: ")"
        SLASH: "/"
        STAR: "*"
        MINUS: "-"
        PLUS: "+"
        %import common.CNAME -> NAME
        %import common.NUMBER
        %import common.WS_INLINE
        %ignore WS_INLINE
    """,
    """
        ?start: sum | NAME "=" sum
        ?sum: product | sum "+" product | sum "-" product
        ?product: atom | product "*" atom | product "/" atom
        ?atom: NUMBER | "-" atom | NAME | "(" sum ")"
        OPERATOR : "=" | "(" | ")" | "/" | "*" | "-" | "+"
        %import common.CNAME -> NAME
        %import common.NUMBER
        %import common.WS_INLINE
        %ignore WS_INLINE
    """
]


def test(grammar, text):
    parser = Lark(grammar, start='start')
    # print(parser.parse(text).pretty())
    print(sorted(list(set([t.type for t in parser.lex(text)]))))
    # print([t.name for t in parser.lexer.tokens])


text = "x = 1+2 - 3-4 - 5*6 - 7/8 - (9+10-11*12/13)"
for i, grammar in enumerate(grammars):
    print('grammar {}'.format(i).center(80, '*'))
    test(grammar, text)

whose output is:

***********************************grammar 0************************************
['NAME', 'NUMBER', '__EQUAL', '__LPAR', '__MINUS', '__PLUS', '__RPAR', '__SLASH', '__STAR']
***********************************grammar 1************************************
['EQUAL', 'LPAR', 'MINUS', 'NAME', 'NUMBER', 'PLUS', 'RPAR', 'SLASH', 'STAR']
***********************************grammar 2************************************
['NAME', 'NUMBER', '__EQUAL', '__LPAR', '__MINUS', '__PLUS', '__RPAR', '__SLASH', '__STAR']

got some questions:

About grammar0, this set of token types '__EQUAL', '__LPAR', '__MINUS', '__PLUS', '__RPAR', '__SLASH', '__STAR' are generated automagically, how does this work internally?
About grammar1, following this method I'll be able to identify easily the token types so I can use the types to syntax highlight with QScintilla, is there any problem with this approach?
About grammar2, in case I want to syntax highlight a group of similar tokens, how can I do that? In this case the token types are still generated automatically instead becoming OPERATOR. I'd like to be able to apply one QScintilla style to a bunch of related tokens (ie: OPERATORS= " | "(" | ")" | "/" | "*" | "-" | "+")

opened by brupelo 30

Fail to create parser using a big grammar (memory increase infinitely)

Hello,

I'm trying to parse a text in python 3.5, using the 0.5.6 release of lark. I have a very long grammar in this format :

start: title  field+

field: rule1 -> alias1
	| rule2 -> alias2
	[…]
	| rule386 -> alias386

//AUXILIARY TERMS
title: ...
term1: ...
[...]
term90:

//RULES GROUP 1
rule1: ...
[...]
rule276: ...

//RULES GROUP2
rule277: ...
[...]
rule386: ...

//TERMINALS
[...]

Here are some examples of the syntax of rules and terms :

//AUXILIARY TERMS
adexpmsg: CHARACTER*
aidequipment: (("N"|"S") [equipmentcode])|equipmentcode
aircraftid: ALPHANUM~2..7

//PRIMARY FIELDS
aatot: _HYPHEN _sep "AATOT" _sep timehhmm
ad: _HYPHEN _sep "AD" _sep adid [_sep (fl|flblock)] [_sep eto] [_sep to] [_sep cto] [_sep sto] [_sep ptstay] [_sep ptrfl] [_sep ptrulchg] [_sep (ptspeed|ptmach)]
ada: _HYPHEN _sep "ADA" _sep date

//SUBFIELDS
addrinfo: _HYPHEN _sep "ADDRINFO" _sep networktype _sep fac
adid: _HYPHEN _sep "ADID" _sep (icaoaerodrome | "ZZZZ")
adname: _HYPHEN _sep "ADNAME" _sep (LIM_CHAR)~1..50

//TERMINALS
_sep: SEP*
ALPHA: /[A-Z]{1}/
DIGIT: /[0-9]{1}/
ALPHANUM: ALPHA|DIGIT
SPACE: " "
_HYPHEN: "-"
FEF: "\n"|"\r"
SEP: (SPACE|FEF)
SPECIAL: SPACE
	|"("
	|")"
	|"?"
	|":"
	|"."
	|","
	|"'"
	|"="
	|"+"
	|"/"
CHARACTER: ALPHA|DIGIT|SPECIAL|FEF|_HYPHEN
LIM_CHAR: ALPHA|DIGIT|SPECIAL|FEF
START_OF_FIELD: _HYPHEN
%import common.WS

The text format I'm trying to parse is the ADEXP format, which is a succession of fields, all of them beginning by a "-", followed by a name of field, and one or more values, which can also be a new field. The first field is "-TITLE". Here is an example of an ADEXP message :

-TITLE BFD -REFDATA -SENDER -FAC BORD -RECVR -FAC A -SEQNUM 001 -ARCID RYR743D -SSRCODE A1122 -NBARC 1 
-ARCTYP B738 -ADEP EDDH -ROUTE N0441F370 OLRAK DEGOL LAPRO PPG ALBER  -BEGIN RTEPTS -PT -PTID OLRAK -PT 
-PTID DEGOL -PT -PTID LAPRO -PT -PTID PPG -PT -PTID ALBER -END RTEPTS -ADES LEBL -BEGIN EQCST -EQPT Y/EQ 
-EQPT W/EQ -EQPT R/EQ -END EQCST -RFL F370 -SPEED N0441 -EOBT 2359 -WKTRC M

This format allow separators between almost every fields, but some fields have to be directly one after the other, without any separator, so i had to explicit all separators in all rules. You can find in details the complete syntax of all fields here.

I tested some rules individually and I'm able to parse a text with this rules.

But when I try to parse the entire grammar above (~900 lines), lark can't create the parser:

When I execute the code :

[...]
print("Creating parser")
parser = Lark(grammar, parser='earley')
print("Parser created")
tree_res = parser.parse(text)
print("Text parsed")

The program displays "Creating parser", and processes to create it. The RAM is increasing progressively, until it reaches 12-13GO of RAM and get killed by the system (after 15-20 minutes).

I also tried to use different parser with different lexers, but it doesn't change anything. I obtained the same result when I used a different python interpreter (Pypy).

I would like to know if you have any idea why it's taking this time to finally get killed ? Is it my grammar which is too big ? Or maybe too ambiguous ? Tell me if you need any more details about the grammar or anything else.

Thank you in advance for your answer.

opened by dryslope 29

Improvement: Use Cython for Speed

Having written a LALR parser for my language (https://github.com/eddieschoute/quippy) it still takes many seconds to parse a file of 100k LOC. One benchmark It takes very roughly 2m20s to parse a 600kLOC input file that I have, which is slow in my opinion. One straightforward improvement that I can think of is to use Cython to generate a C-implementation of the LALR parser. Most of the time seems to be spent in the main LALR parser loop, which can be significantly sped up by Cython. I would also be open to other suggestions to improve the parsing speed.

Since specifically the LALR parser is meant to compete in speed, I think it would be worth exploring the possibility of pushing this parser to its limit. Hopefully, converting the code to Cython code will be fairly painless and from there it just remains to optimize the functionality.

I do not know how the standalone parser will be affected by this, but I can image that instead of generating py files it should instead generate a pyx file that can be cythonized.
enhancement discussion

opened by eddieschoute 29
Make the Earley parser closer to the spec and add a complete SPPF forest implementation.
Key changes: Add Items to the current Column and ensure unique before adding derivations

Ensures all derivations get added to the same unique items.

Add rudimentary SPPF type implementation to derivations, indexed on start and end, end as per:

https://www.sciencedirect.com/science/article/pii/S1571066108001497

This was required after with fixed _ambig detection.

Remove earley__predict_all property.

No longer needed after the above two changes.
opened by night199uk 28
Bytes support
This is a start of implementing support for byte string as suggested in #626. This is still WIP.

My idea is that the grammar is still string, but you pass the use_bytes=True flag, make the patterns to be compiled with bytes. If you need to use match bytes that are not compatible with whatever encoding is used, you can just escaped them. They will be unescaped later.

TODO:

[X] Add use_bytes to make regex compile as bytes

[x] Add tests (essentially, everything needs to be tested again, but with bytes)

[x] Find and check edge cases

@ctrlcctrlv, does this fix your use case?
opened by MegaIng 27
Can I display progress status of Lark().parse()?

I have implemented a JSON converter for a unique format of text with a CLI using Lark. When I run Lark().parse() on a large file, I have to wait for several tens of seconds.

Is there a way to get a progress status -- for example, by returning a generator that can be passed to tqdm?

I am not having any issues with the speed of Lark. I just want to be able to inform the user that the program is running👍.
enhancement

opened by quag-cactus 2
How to keep track of tree while transforming it?

I have a lark.visitors.Transformer which converts a AST into some other AST, while doing it I will exit if there is a error, i use the tokens to show where the error occured. now it is not possible to get the token because it is transformed.
question

opened by aspizu 0

Generate Type-annotated Visitor definition from lark grammar

This feature will generate a python file containing a Visitor class definition with methods for every rule defined in the lark grammar file which will have the correct type-annotations.

Example: grammar.lark

start: "FOO" bar biz
bar: (/[a-z]/)*
biz: [/no/]

Result: lark --generate-visitor grammar.lark

from typing import Optional, Literal
from lark import Visitor, Token

class MyVisitor(Visitor[Token]):
    def start(self, args: tuple[tuple[Token, ...], Optional[Literal["no"]]]):
        ...
    
    def bar(self, args: tuple[Token, ...]):
        ...
    
    def biz(self, args: tuple[Optional[Literal["no"]]]):
        ...

enhancement

opened by aspizu 4

Macroses support. Dynamic grammar

I want to implement Macroses support for my masm parser. So is it possible to have kind of Dynamic grammar so I could add tokens at runtime? Or check if token match with my handler? (If macros was defined several lines before) It also might called custom matcher

opened by xor2003 5
Support for Python-style comments in Lark grammar
Given that

most (all?) editors are unaware of Lark's syntax

most lark grammars live in Python strings

most editors will use # when asked to comment lines or blocks within Lark strings (eg Pycharm's CTRL+D)

commenting lines and blocks is frequently done while developing a grammar (debugging...)

adding this style of comments should not break existing grammars

I propose in this small PR to enable Python-style comments in Lark grammars. If accepted, I'll do another PR to reflect that in documentation.
opened by vincent-hugot 12

Releases(1.1.5)

1.1.5(Dec 6, 2022)
What's Changed

setup.cfg: Replace deprecated license_file with license_files by @mgorny in https://github.com/lark-parser/lark/pull/1209

Fix Github shenanigans by @erezsh in https://github.com/lark-parser/lark/pull/1220

Fix AmbiguousExpander (Issue #1214) by @chanicpanic in https://github.com/lark-parser/lark/pull/1216

Fix EOF line information in InteractiveParser.resume_parse() by @erezsh in https://github.com/lark-parser/lark/pull/1224

Use generator instead of list expand or add method by @jmishra01 in https://github.com/lark-parser/lark/pull/1225

New Contributors

@mgorny made their first contribution in https://github.com/lark-parser/lark/pull/1209

@jmishra01 made their first contribution in https://github.com/lark-parser/lark/pull/1225

Full Changelog: https://github.com/lark-parser/lark/compare/1.1.4...1.1.5
Source code(tar.gz)
Source code(zip)
1.1.4(Nov 2, 2022)
What's Changed

ci: Python 3.11 final by @henryiii in https://github.com/lark-parser/lark/pull/1204

Add __all__ to __init__ by @aspizu in https://github.com/lark-parser/lark/pull/1200

PropagatePositions: Allow any object to carry the metadata, by returning it in __lark_meta__() by @erezsh in https://github.com/lark-parser/lark/pull/1203

fix: Token now pattern matches correctly by @marcinplatek in https://github.com/lark-parser/lark/pull/1181

Updates to merge PR #1151 by @erezsh in https://github.com/lark-parser/lark/pull/1205

style: pre-commit basic config by @henryiii in https://github.com/lark-parser/lark/pull/1151

PR for v1.1.4 by @erezsh in https://github.com/lark-parser/lark/pull/1208

New Contributors

@aspizu made their first contribution in https://github.com/lark-parser/lark/pull/1200

@marcinplatek made their first contribution in https://github.com/lark-parser/lark/pull/1181

Full Changelog: https://github.com/lark-parser/lark/compare/1.1.3...1.1.4
Source code(tar.gz)
Source code(zip)
1.1.3(Oct 11, 2022)
What's Changed

Add user to cache filename; better handle cache load/save failures by @klauer in https://github.com/lark-parser/lark/pull/1179

refactor: add 'usedforsecurity=False' arg to hashlib.md5 usage by @cquick01 in https://github.com/lark-parser/lark/pull/1190

Create lark/grammars/init.py by @chanicpanic in https://github.com/lark-parser/lark/pull/1171

Adjust imports for Python 3.11 by @The-Compiler in https://github.com/lark-parser/lark/pull/1140

Fix for issue #1173 by @erezsh in https://github.com/lark-parser/lark/pull/1198

Add match stmt support to python.lark by @joseph-e-k in https://github.com/lark-parser/lark/pull/1123

Added match stmt support to python.lark by @MegaIng in https://github.com/lark-parser/lark/pull/1016

Linting to fix minor issues by @Erotemic in https://github.com/lark-parser/lark/pull/1128

Simplify lexer: Use Match.lastgroup instead of lastindex by @erezsh in https://github.com/lark-parser/lark/pull/1129

Fix confusing import in examples by @JonasLoos in https://github.com/lark-parser/lark/pull/1138

Move iter_subtrees_topdown into standalone by @camgunz in https://github.com/lark-parser/lark/pull/1137

Fix 1146: use the class's get instead of the instance's get by @MegaIng in https://github.com/lark-parser/lark/pull/1147

fix: remove Python 2 legacy packaging code by @henryiii in https://github.com/lark-parser/lark/pull/1148

Fix for PR #1149 by @erezsh in https://github.com/lark-parser/lark/pull/1150

Old link for sppf is no longer valid. Point to web archive instead. by @patrickhuber in https://github.com/lark-parser/lark/pull/1159

Fix ForestToPyDotVisitor by @chanicpanic in https://github.com/lark-parser/lark/pull/1167

Close file-like objects to address ResourceWarning. by @shawnbrown in https://github.com/lark-parser/lark/pull/1183

Minor adjustments to PR #1179 by @erezsh in https://github.com/lark-parser/lark/pull/1189

Adjustments for PR #1152 by @erezsh in https://github.com/lark-parser/lark/pull/1191

Remove trailing whitespace by @bcr in https://github.com/lark-parser/lark/pull/1196

New Contributors

@joseph-e-k made their first contribution in https://github.com/lark-parser/lark/pull/1123

@Erotemic made their first contribution in https://github.com/lark-parser/lark/pull/1128

@JonasLoos made their first contribution in https://github.com/lark-parser/lark/pull/1138

@camgunz made their first contribution in https://github.com/lark-parser/lark/pull/1137

@The-Compiler made their first contribution in https://github.com/lark-parser/lark/pull/1140

@henryiii made their first contribution in https://github.com/lark-parser/lark/pull/1148

@patrickhuber made their first contribution in https://github.com/lark-parser/lark/pull/1159

@shawnbrown made their first contribution in https://github.com/lark-parser/lark/pull/1183

@klauer made their first contribution in https://github.com/lark-parser/lark/pull/1179

@cquick01 made their first contribution in https://github.com/lark-parser/lark/pull/1190

@bcr made their first contribution in https://github.com/lark-parser/lark/pull/1196

Full Changelog: https://github.com/lark-parser/lark/compare/1.1.2...1.1.3
Source code(tar.gz)
Source code(zip)
1.1.2(Mar 1, 2022)
Highlights

Tree instances now have a pretty print with the "rich" library, when doing rich.print(tree)

Bugfix for recursive regexes (with the "regex" library)

Refactors, cleanups, and better mypy support

What's Changed

Clean up tree templates implementation to reduce mypy errors by @plannigan in https://github.com/lark-parser/lark/pull/1091

Remove redefinitions related to standalone parser by @plannigan in https://github.com/lark-parser/lark/pull/1115

Added Tree.rich() method to make Tree a Rich renderable by @erezsh in https://github.com/lark-parser/lark/pull/1117

Rename lexer_state->lexer_thread, and make a few adjustments for the benefit of Lark-Cython by @erezsh in https://github.com/lark-parser/lark/pull/1118

Use isinstance() checks in expcetions match_examples() by @plannigan in https://github.com/lark-parser/lark/pull/1065

change MAXREPEAT to int by @gruebel in https://github.com/lark-parser/lark/pull/1120

Tests: Small fixes by @erezsh in https://github.com/lark-parser/lark/pull/1122

New Contributors

@gruebel made their first contribution in https://github.com/lark-parser/lark/pull/1120

Full Changelog: https://github.com/lark-parser/lark/compare/1.1.1...1.1.2
Source code(tar.gz)
Source code(zip)
1.1.1(Feb 8, 2022)
What's Changed

Add test cases for tree templates by @plannigan in https://github.com/lark-parser/lark/pull/1096

🖊 Fix Typo: plural "options" instead of singular "option" by @hf-kklein in https://github.com/lark-parser/lark/pull/1101

PEP 8: Minor Code Style Improvements by @hf-kklein in https://github.com/lark-parser/lark/pull/1102

Add Code Style Section to Contribution Guide by @hf-kklein in https://github.com/lark-parser/lark/pull/1107

Fix MyPy Warnings in lark/tools/init.py by @hf-kklein in https://github.com/lark-parser/lark/pull/1100

rename n to child when iterating over children by @hf-kklein in https://github.com/lark-parser/lark/pull/1110

specify ignored mypy error by using type: ignore[error] in lark/tree.py and lark/utils.py by @hf-kklein in https://github.com/lark-parser/lark/pull/1099

Add py.typed to package_data of lark package by @hf-kklein in https://github.com/lark-parser/lark/pull/1109

InteractiveParser: Added iter_parse() method, for easier instrumentation by @erezsh in https://github.com/lark-parser/lark/pull/1111

New Contributors

@hf-kklein made their first contribution in https://github.com/lark-parser/lark/pull/1101

Full Changelog: https://github.com/lark-parser/lark/compare/1.1.0...1.1.1
Source code(tar.gz)
Source code(zip)
1.1.0(Jan 31, 2022)
Better support for typing and mypy. Includes generic tree typing (Thanks @plannigan!)

Improvements to python.lark (walrus operator, slashes in function params, and more). Now parses the entire Python 3.10 lib successfully

Bugfixes:

Transformer.__default__ not called in tree-less LALR mode (Issue #1029)

v_args failed to apply to class under standalone parser (Issue #1059)

maybe_placeholders incorrectly accumulated params when it encountered the | operator (Issue #1078)

Source code(tar.gz)
Source code(zip)
1.0.0(Nov 15, 2021)
Over the last few years, Lark has grown to become a comprehensive toolkit for parsing structured text.

Today, I'm happy to announce the long anticipated version 1.0 of Lark, marking the API as stable.

We've made quite a few breaking changes, in order to achieve congruous API with as little "gotchas" as possible. Upgrading to version 1.0 might require a few changes to your project.

Breaking changes

Dropped Python 2 support! Lark now only supports Python 3.6 and up.

Install lark using pip install lark (instead of lark-parser ).

maybe_placeholders is now True by default.

Renamed TraditionalLexer to BasicLexer, and 'standard' lexer option to 'basic'.

Default priority is now 0, for both terminals and rules (used to be 1 for terminals).

Discard mechanism is now done by returning Discard, instead of raising it as an exception.

use_accepts in UnexpectedInput.match_examples() is now True by default.

v_args(meta=True) now gives meta as the first argument. i.e. (meta, children).

Improvments

Better type annotations

Support for terminal priorities for dynamic Earley

Python3 grammar is now officially supported, and can be used via %import python (...)

New experimental feature: Tree Templates

Various bugfixes

Acknowledgements

Many thanks to all our contributors and donors, who made this release possible. Special thanks goes to -

@MegaIng, for innumerous features, bugfixes, and code-reviews.

@chanicpanic, for his immense and continual contributions to the Earley parser, and for helping with the v1.0 effort.

@erezsh, for being myself.

Source code(tar.gz)
Source code(zip)
0.12.0(Aug 30, 2021)
Announcements

This is likely to be the last major release that supports Python 2 !

We are now working on a Python3.6+ only v1.0 branch, which will soon become the default. See the work in progress: https://github.com/lark-parser/lark/pull/925

We also have a new online IDE! Check it out here: https://lark-parser.github.io/ide

Lark can now generate standalone Javascript parsers! Check it out here: https://github.com/lark-parser/Lark.js (still in beta)

Changes

Using rule repeat (~ syntax) is now much much faster for large numbers, thanks to @MegaIng

Bugfix for the propagate_positions option. Added option value propagate_positions='ignore_ws'.

Fixed reconstructor for when keep_all_tokens=True

Added merge_transformers (Thanks Robin!)

Many minor bugfixes, and improvements to code and docs

Source code(tar.gz)
Source code(zip)
0.11.3(May 3, 2021)
Cache

Lark now tracks changes in imported grammars (%import), and updates the cache if necessary

Added support for atomicwrites, for multiprocess caching and crash recovery

InteractiveParser

Now an official interface (renamed from Puppet)

Added Lark.parse_interactive() for starting the parser in interactive mode

Other

Added ast_utils, to assist in tranforming lark.Tree into a customized AST.

Better docs

Bugfixes

Notification: Support for Python 2 is ending

In the near future, Lark will drop support for Python 2. We will continue to develop for Python 3.6+ only, which will simplify the code and ease development.

Old releases (including this one) will still work, of course, and should be stable enough to accompany the remaining Python 2 users into the sunset.

If you have any objections, feel free to voice them here: https://github.com/lark-parser/lark/discussions/874

Thanks for everyone who helped make Lark better!
Source code(tar.gz)
Source code(zip)
0.11.2(Feb 16, 2021)
New Features:

Better grammar re-use with the %override and %extend statements, which allow to rewrite and extend imported rules and tokens, similarly to class inheritance. (See this example: https://github.com/lark-parser/lark/blob/master/examples/advanced/extend_python.py)

Improvements

Indenter now throws DedentError instead of AssertionError

Improved the Python3 grammar, now works with reconstructor. (See this example: https://github.com/lark-parser/lark/blob/master/examples/advanced/reconstruct_python.py)

Lots of refactoring for a better tomorrow.

rule/terminals names can now be in unicode. (thanks @julienmalard)

Better errors.

Better type hints.

lark.lark is now part of the standard library.

Earley:

Now works with match_examples()

Now supports a custom lexer

Better handling of ignored terminals

Faster forest visiting, and a few edge-case bugfixes (thanks @chanicpanic)

Other

Lark now accepts funding as a member of Github Sponsors! See here: https://github.com/sponsors/lark-parser

Source code(tar.gz)
Source code(zip)
0.11.1(Nov 16, 2020)

Source code(tar.gz)
Source code(zip)
0.11.0(Nov 16, 2020)
LALR parser

The LALR parser now supports priority in rules, as a way to resolve collision errors

Improvements to the standalone tool, including more command-line options, like optional compression for the json data.

Improvements to the puppet error handling interface

Better error reporting on LALR collisions

Bugfixes in Earley

Misc

Added support for syntax highlighting in Atom

Fixes and improvements for the cache option. cache=True now uses a temporary directory instead of working directory.

Lark can now be imported directly from a zip (See: ed5c8ec51c4c6e8bd0ac80caff6afcb90a97d218)

Added more terminals to the grammar library (available for %import).

Nearley tools now supports case insensitive strings

Deprecated some interfaces

Improvements to docs, stubs, and various bugfixes

Thanks to @MegaIng for helping with Lark's maintenance, and to @ldbo, @chanicpanic, @michael-k, @ThatXliner and everyone else for their help and contributions.
Source code(tar.gz)
Source code(zip)
0.10.0(Sep 21, 2020)
Complete overhaul of documentation. Now using sphinx to generate API docs from docstrings. (commit 0664cbd3d3c19e321cae8df044839e7baf7135af. Thank you @chsasank !)

Many improvements and additions to documentation

New and friendlier Earley SPPF interface! (commit 555b268eb26bcbfce64991ea7517338dee85a840. Thank you @chanicpanic !)

Added the ambiguity='forest' option. Added ForestTransformer and TreeForestTranformer.

Various Bugfixes to improve the handling of ambiguous results.

Read the docs here: https://lark-parser.readthedocs.io/en/latest/forest.html

New Vim syntax highlighting for Lark (https://github.com/lark-parser/vim-lark-syntax Thank you @omega16 !)

Lark now loads faster from cache (commit 7dc00179e63efa6e98d688bfba3265d382db79c4)

Terminals can now be composed of regexps and strings with different flags, if using Python 3.6+ (commit e6fc3c9b00306e3a8661210fcc93bf50479ee229)

Added support for parsing byte-strings, with the use_bytes flag (commit 9ee8428f3f6ad285ad93e2b62ec47d33fff54768).

UnexpectedToken exception now has the accepts attribute, which contains a list of terminals that would be accepted by the parser instead (in addition to the expects attribute, which is guided by the lexer and may include terminals that won't be accepted by the parser) (commit a7bcd0bc2d3cb96030d9e77523c0007e8034ce49)

Allow multiline regexes with the x flag (commit 9923987e94547ded8a17d7a03840c4cebce39188)

Lark no longer uses the default logger. Instead uses lark.LOGGER. (commit 7010f96825b5fbac79522d1b30689065df53dc8c)

Lark now notifies on unused terminals/rules through logging.debug.

Standalone generator now creates smaller files (without comments and docstrings). Also undergone various fixes. (commit bf2d9bf7b16cddb39f2e0ea3cefecc8de5269e2c)

Wheel distribution due to (somewhat) popular demand.

Lots of small bugfixes and improvements!

Many thanks to @MegaIng for his continued work on many of these new features and fixes, and to everyone else who contributed to Lark and helped make it even better.
Source code(tar.gz)
Source code(zip)
0.9.0(Jul 1, 2020)
Added error handling to LALR!

on_error option to Lark.parse(). Read here: https://lark-parser.readthedocs.io/en/latest/classes/#larkparse

Parser now comes with a puppet for advanced error handling. Read here: https://lark-parser.readthedocs.io/en/latest/classes/#parserpuppet

Support for better regexps with the regex module, when using Lark(..., regex=True) Read here: https://lark-parser.readthedocs.io/en/latest/classes/#using-unicode-character-classes-with-regex

Source code(tar.gz)
Source code(zip)
0.8.9(Jun 16, 2020)

The last two releases were wrong. I apologize.

Hopefully that's the last of it, and we'll be back on track with periodic and accurate releases.
Source code(tar.gz)
Source code(zip)
0.8.8(Jun 13, 2020)

Source code(tar.gz)
Source code(zip)
0.8.7(Jun 13, 2020)

Source code(tar.gz)
Source code(zip)
0.8.6(Jun 10, 2020)
The main features for this release:

Grammar caching: It's now possible to cache the results of the LALR grammar analysis, for x2 to x3 faster loading. Use Lark(..., cache=True) or specify a file name. See here: https://lark-parser.readthedocs.io/en/latest/classes/

Grammar templates: Added support for grammar "functions" that expand in preprocessing. No docs yet, but see here for examples: https://github.com/lark-parser/lark/blob/master/tests/test_parser.py#L845

Lark online IDE: Technically not a feature, but it's possible to run Lark in the browser. Now we also have a simple IDE on github pages: https://lark-parser.github.io/lark/ide/app.html

Other changes:

Improved performance for large grammars

More debug prints when in debug mode

Better support for PyInstaller

Lots of bugfixes: mypy stubs, v_args, docs, and more.

Source code(tar.gz)
Source code(zip)
0.8.5(Mar 29, 2020)

Source code(tar.gz)
Source code(zip)
0.8.3(Mar 28, 2020)
Added the g_regex_flags option, to allow applying flags to all terminals.

Fixed end_pos for Earley, when using propagate_positions

Fixes for mypy

Better docs

Source code(tar.gz)
Source code(zip)
0.8.2(Mar 7, 2020)
Changes in this version are:

Added type stubs for all public APIs, in order to support type checking and completion using MyPy (or others)

Added two new methods to the Lark class: Lark.save() and Lark.load(). Both methods pickle and unpickle (respectively) the class instance into/from file objects. These can be used to allow faster loading times. (future versions will implement an automatic caching feature)

The standalone parser is now MPL2, instead of GPL. The Mozilla Public License is much less restrictive, so this shouldn't affect anyone who's already using the standalone parser. But it should make it easier for other users to adopt it.

Source code(tar.gz)
Source code(zip)
0.8.1(Jan 22, 2020)
Reverted maybe_placeholders to False by default. It didn't obey the semantic versioning standard.

Bugfix in standalone parser

Source code(tar.gz)
Source code(zip)
0.8.0(Jan 22, 2020)
- Better LALR

The biggest change to this release is a new LALR engine, that is capable of dealing with a few edge cases that the previous parser couldn't.

This parser is supposed to be fully backwards-compatible with the previous one, but that is hard to verify!

Thank you, @Raekye, for this great contribution to Lark!

For more details, see issue #418

- Transformers now visit tokens, as well as rules (an alternative to lexer_callbacks)

Transformer now visit tokens, in addition to rules.

Simply define a method with the correct name (uppercase, of course), and the transformer will visit your tokens before the rules that contain them.

It's possible to disable this, for backwards compatibility, or for the slight performance gain.

- Other Changes

Added visit_topdown methods to Visitor classes

Lark now allows line comments in its rule definitions

Better error messages

Improvements to documentation

Bugfixes

maybe_placeholders is now the default (backwards-incompatible)** (REVERTED in 0.8.1)

Source code(tar.gz)
Source code(zip)
0.7.8(Nov 1, 2019)
Improved error messages for EOF in Earley, recursive terminals, UnexpectedToken

Bugfix for declared terminals, UnexpectedToken, unicode support in Python2,

Source code(tar.gz)
Source code(zip)
0.7.7(Oct 3, 2019)
Fixed a bug in Earley where running it from different threads produced bad results

Improved error reporting when using LALR

Added 'edit_terminals' option, to allow programmatical manipulation of terminals, for example to support keywords in different languages.

Note: This release skips 0.7.6, due to simple oversight on my part. Hopefully that shouldn't be a problem.
Source code(tar.gz)
Source code(zip)

0.7.5(Sep 6, 2019)

Lark transformers can now visit tokens as well. Use like this:

class MyTransformer(Transformer):
    def TOKEN1(self, tok):
        return tok.upper()

    def rule_as_usual(self, children):
        return children

MyTransformer(visit_tokens=True).transform(tree)

Fixed a few regressions that I accidentally added to 0.7.4

Source code(tar.gz)
Source code(zip)

0.7.4(Aug 29, 2019)
Fixed long-standing non-determinism and prioritization bugs in Earley.

Serialize tool now supports multiple start symbols

iter_subtrees, find_data and find_pred methods are now included in standalone parser

Bugfixes for the transformer interface, for the custom lexer, for grammar imports, and many more

Source code(tar.gz)
Source code(zip)
0.7.3(Aug 14, 2019)
Added a new tool called Serialize, that stores Lark's internal state as JSON. That will allow for integration with other languages. I have already started such a project for Julia: https://github.com/erezsh/Lark_Julia (It's working, but still in early stages)

Minor bugfix regarding line-counting and the \s regex

Source code(tar.gz)
Source code(zip)
0.7.2(Jul 30, 2019)
New features:

Lark now allows you to specify the start symbol when calling Lark.parse() (requires pre-declaration of all possible start states, see the start option)

Negative priority now allows in rules and terminals (default value is still 1, may change in 0.8)

Also includes many minor bugfixes, optimizations, and improvements to documentation
Source code(tar.gz)
Source code(zip)
0.7.1(May 4, 2019)
Lark can now serialize its parsers, resulting in simplified stand-alone code.

Bugfix for v_args (Issue #350)

Improvements and bugfixes for importing rules from grammar files

Performance improvement for the reconstructor feature

Source code(tar.gz)
Source code(zip)