A fast, extensible and spec-compliant Markdown parser in pure Python.

Mi Yu

Last update: Jan 1, 2023

Related tags

Overview

mistletoe

mistletoe is a Markdown parser in pure Python, designed to be fast, spec-compliant and fully customizable.

Apart from being the fastest CommonMark-compliant Markdown parser implementation in pure Python, mistletoe also supports easy definitions of custom tokens. Parsing Markdown into an abstract syntax tree also allows us to swap out renderers for different output formats, without touching any of the core components.

Remember to spell mistletoe in lowercase!

Features

Fast: mistletoe is the fastest implementation of CommonMark in Python, that is, 2 to 3 times as fast as Commonmark-py, and still roughly 30% faster than Python-Markdown. Running with PyPy yields comparable performance with mistune.

See the performance section for details.
Spec-compliant: CommonMark is a useful, high-quality project. mistletoe follows the CommonMark specification to resolve ambiguities during parsing. Outputs are predictable and well-defined.
Extensible: Strikethrough and tables are supported natively, and custom block-level and span-level tokens can easily be added. Writing a new renderer for mistletoe is a relatively trivial task.

You can even write a Lisp in it.

Some alternative output formats:

HTML
LaTeX
Jira Markdown (contrib)
Mathjax (contrib)
Scheme (contrib)
HTML + code highlighting (contrib)

Installation

mistletoe is tested for Python 3.3 and above. Install mistletoe with pip:

pip3 install mistletoe

Alternatively, clone the repo:

git clone https://github.com/miyuchina/mistletoe.git
cd mistletoe
pip3 install -e .

See the contributing doc for how to contribute to mistletoe.

Usage

Basic usage

Here's how you can use mistletoe in a Python script:

import mistletoe

with open('foo.md', 'r') as fin:
    rendered = mistletoe.markdown(fin)

mistletoe.markdown() uses mistletoe's default settings: allowing HTML mixins and rendering to HTML. The function also accepts an additional argument renderer. To produce LaTeX output:

import mistletoe
from mistletoe.latex_renderer import LaTeXRenderer

with open('foo.md', 'r') as fin:
    rendered = mistletoe.markdown(fin, LaTeXRenderer)

Finally, here's how you would manually specify extra tokens and a renderer for mistletoe. In the following example, we use HTMLRenderer to render the AST, which adds HTMLBlock and HTMLSpan to the normal parsing process.

from mistletoe import Document, HTMLRenderer

with open('foo.md', 'r') as fin:
    with HTMLRenderer() as renderer:
        rendered = renderer.render(Document(fin))

From the command-line

pip installation enables mistletoe's command-line utility. Type the following directly into your shell:

mistletoe foo.md

This will transpile foo.md into HTML, and dump the output to stdout. To save the HTML, direct the output into a file:

mistletoe foo.md > out.html

You can pass in custom renderers by including the full path to your renderer class after a -r or --renderer flag:

mistletoe foo.md --renderer custom_renderer.CustomRenderer

Running mistletoe without specifying a file will land you in interactive mode. Like Python's REPL, interactive mode allows you to test how your Markdown will be interpreted by mistletoe:

mistletoe [version 0.7.2] (interactive)
Type Ctrl-D to complete input, or Ctrl-C to exit.
>>> some **bold** text
... and some *italics*
...
<p>some <strong>bold</strong> text
and some <em>italics</em></p>
>>>

The interactive mode also accepts the --renderer flag:

mistletoe [version 0.7.2] (interactive)
Type Ctrl-D to complete input, or Ctrl-C to exit.
Using renderer: LaTeXRenderer
>>> some **bold** text
... and some *italics*
...
\documentclass{article}
\begin{document}

some \textbf{bold} text
and some \textit{italics}
\end{document}
>>>

Performance

mistletoe is the fastest CommonMark compliant implementation in Python. Try the benchmarks yourself by running:

$ python3 test/benchmark.py  # all results in seconds
Test document: test/samples/syntax.md
Test iterations: 1000
Running tests with markdown, mistune, commonmark, mistletoe...
==============================================================
markdown: 33.28557115700096
mistune: 8.533771439999327
commonmark: 84.54588776299897
mistletoe: 23.5405140980001

We notice that Mistune is the fastest Markdown parser, and by a good margin, which demands some explanation. mistletoe's biggest performance penalty comes from stringently following the CommonMark spec, which outlines a highly context-sensitive grammar for Markdown. Mistune takes a simpler approach to the lexing and parsing process, but this means that it cannot handle more complex cases, e.g., precedence of different types of tokens, escaping rules, etc.

To see why this might be important to you, consider the following Markdown input (example 392 from the CommonMark spec):

***foo** bar*

The natural interpretation is:

<p><em><strong>foo</strong> bar</em></p>

... and it is indeed the output of Python-Markdown, Commonmark-py and mistletoe. Mistune (version 0.8.3) greedily parses the first two asterisks in the first delimiter run as a strong-emphasis opener, the second delimiter run as its closer, but does not know what to do with the remaining asterisk in between:

<p><strong>*foo</strong> bar*</p>

The implication of this runs deeper, and it is not simply a matter of dogmatically following an external spec. By adopting a more flexible parsing algorithm, mistletoe allows us to specify a precedence level to each token class, including custom ones that you might write in the future. Code spans, for example, has a higher precedence level than emphasis, so

*foo `bar* baz`

... is parsed as:

<p>*foo <code>bar* baz</code></p>

... whereas Mistune parses this as:

<p><em>foo `bar</em> baz`</p>

Of course, it is not impossible for Mistune to modify its behavior, and parse these two examples correctly, through more sophisticated regexes or some other means. It is nevertheless highly likely that, when Mistune implements all the necessary context checks, it will suffer from the same performance penalties.

Contextual analysis is why Python-Markdown is slow, and why CommonMark-py is slower. The lack thereof is the reason mistune enjoys stellar performance among similar parser implementations, as well as the limitations that come with these performance benefits.

If you want an implementation that focuses on raw speed, mistune remains a solid choice. If you need a spec-compliant and readily extensible implementation, however, mistletoe is still marginally faster than Python-Markdown, while supporting more functionality (lists in block quotes, for example), and significantly faster than CommonMark-py.

One last note: another bottleneck of mistletoe compared to mistune is the function overhead. Because, unlike mistune, mistletoe chooses to split functionality into modules, function lookups can take significantly longer than mistune. To boost the performance further, it is suggested to use PyPy with mistletoe. Benchmark results show that on PyPy, mistletoe's performance is on par with mistune:

$ pypy3 test/benchmark.py mistune mistletoe
Test document: test/samples/syntax.md
Test iterations: 1000
Running tests with mistune, mistletoe...
========================================
mistune: 13.645681533998868
mistletoe: 15.088351159000013

Developer's Guide

Here's an example to add GitHub-style wiki links to the parsing process, and provide a renderer for this new token.

A new token

GitHub wiki links are span-level tokens, meaning that they reside inline, and don't really look like chunky paragraphs. To write a new span-level token, all we need to do is make a subclass of SpanToken:

from mistletoe.span_token import SpanToken

class GithubWiki(SpanToken):
    pass

mistletoe uses regular expressions to search for span-level tokens in the parsing process. As a refresher, GitHub wiki looks something like this: [[alternative text | target]]. We define a class variable, pattern, that stores the compiled regex:

class GithubWiki(SpanToken):
    pattern = re.compile(r"\[\[ *(.+?) *\| *(.+?) *\]\]")
    def __init__(self, match):
        pass

The regex will be picked up by SpanToken.find, which is used by the tokenizer to find all tokens of its kind in the document. If regexes are too limited for your use case, consider overriding the find method; it should return a list of all token occurrences.

Three other class variables are available for our custom token class, and their default values are shown below:

class SpanToken:
    parse_group = 1
    parse_inner = True
    precedence = 5

Note that alternative text can also contain other span-level tokens. For example, [[*alt*|link]] is a GitHub link with an Emphasis token as its child. To parse child tokens, parse_inner should be set to True (the default value in this case), and parse_group should correspond to the match group in which child tokens might occur (also the default value, 1, in this case).

Once these two class variables are set correctly, GitHubWiki.children attribute will automatically be set to the list of child tokens. Note that there is no need to manually set this attribute, unlike previous versions of mistletoe.

Lastly, the SpanToken constructors take a regex match object as its argument. We can simply store off the target attribute from match_obj.group(2).

from mistletoe.span_token import SpanToken

class GithubWiki(SpanToken):
    pattern = re.compile(r"\[\[ *(.+?) *\| *(.+?) *\]\]")
    def __init__(self, match_obj):
        self.target = match_obj.group(2)

There you go: a new token in 5 lines of code.

Side note about precedence

Normally there is no need to override the precedence value of a custom token. The default value is the same as InlineCode, AutoLink and HTMLSpan, which means that whichever token comes first will be parsed. In our case:

`code with [[ text` | link ]]

... will be parsed as:

<code>code with [[ text</code> | link ]]

If we set GitHubWiki.precedence = 6, we have:

`code with <a href="link">text`</a>

A new renderer

Adding a custom token to the parsing process usually involves a lot of nasty implementation details. Fortunately, mistletoe takes care of most of them for you. Simply pass your custom token class to super().__init__() does the trick:

from mistletoe.html_renderer import HTMLRenderer

class GithubWikiRenderer(HTMLRenderer):
    def __init__(self):
        super().__init__(GithubWiki)

We then only need to tell mistletoe how to render our new token:

def render_github_wiki(self, token):
    template = '<a href="{target}">{inner}</a>'
    target = token.target
    inner = self.render_inner(token)
    return template.format(target=target, inner=inner)

Cleaning up, we have our new renderer class:

from mistletoe.html_renderer import HTMLRenderer, escape_url

class GithubWikiRenderer(HTMLRenderer):
    def __init__(self):
        super().__init__(GithubWiki)

    def render_github_wiki(self, token):
        template = '<a href="{target}">{inner}</a>'
        target = escape_url(token.target)
        inner = self.render_inner(token)
        return template.format(target=target, inner=inner)

Take it for a spin?

It is preferred that all mistletoe's renderers be used as context managers. This is to ensure that your custom tokens are cleaned up properly, so that you can parse other Markdown documents with different token types in the same program.

from mistletoe import Document
from contrib.github_wiki import GithubWikiRenderer

with open('foo.md', 'r') as fin:
    with GithubWikiRenderer() as renderer:
        rendered = renderer.render(Document(fin))

For more info, take a look at the base_renderer module in mistletoe. The docstrings might give you a more granular idea of customizing mistletoe to your needs.

Why mistletoe?

"For fun," says David Beazley.

Copyright & License

mistletoe's logo uses artwork by Freepik, under CC BY 3.0.
mistletoe is released under MIT.

Comments

Add repr methods to all token classes

This pull request adds a __repr__ method to all token classes. To simplify implementation it introduces a common base class token.Token for block_token.BlockToken and span_taken.SpanToken. For the following example program:

from mistletoe import block_token, span_token, utils
d = block_token.Document(open("test/samples/quotes.md", "r").read())
print([tr.node for tr in utils.traverse(d)])

The output looks like this without __repr__ methods:

[<mistletoe.block_token.Heading at 0x10fa5aec0>,
 <mistletoe.block_token.Quote at 0x10f99e2c0>,
 <mistletoe.block_token.Paragraph at 0x10fb95a50>,
 <mistletoe.block_token.Quote at 0x10fb94130>,
 <mistletoe.block_token.Paragraph at 0x10fb97970>,
 <mistletoe.block_token.Quote at 0x10fb94c10>,
 <mistletoe.block_token.Paragraph at 0x10fbc00d0>,
 <mistletoe.block_token.Quote at 0x10fbc0d30>,
 <mistletoe.block_token.Paragraph at 0x10fbc1270>,
 <mistletoe.span_token.RawText at 0x10fa5b4f0>,
 <mistletoe.block_token.Paragraph at 0x10f99d8a0>,
 <mistletoe.span_token.RawText at 0x10fb95a20>,
 <mistletoe.block_token.Paragraph at 0x10fb97910>,
 <mistletoe.block_token.Paragraph at 0x10fb97c70>,
 <mistletoe.span_token.RawText at 0x10fb97ca0>,
 <mistletoe.block_token.List at 0x10fbc0c40>,
 <mistletoe.span_token.RawText at 0x10fbc0580>,
 <mistletoe.block_token.Quote at 0x10fbc0850>,
 <mistletoe.block_token.Paragraph at 0x10fbc0fd0>,
 <mistletoe.block_token.Quote at 0x10fbc1090>,
 <mistletoe.span_token.RawText at 0x10fbc12a0>,
 <mistletoe.span_token.RawText at 0x10f99cd60>,
 <mistletoe.span_token.RawText at 0x10fb94160>,
 <mistletoe.span_token.RawText at 0x10fb948b0>,
 <mistletoe.block_token.ListItem at 0x10fbc0880>,
 <mistletoe.block_token.ListItem at 0x10fbc03d0>,
 <mistletoe.block_token.Paragraph at 0x10fbc0f10>,
 <mistletoe.span_token.RawText at 0x10fbc1000>,
 <mistletoe.block_token.Paragraph at 0x10fbc10f0>,
 <mistletoe.block_token.Paragraph at 0x10fbc11b0>,
 <mistletoe.block_token.Paragraph at 0x10fbc04f0>,
 <mistletoe.block_token.Paragraph at 0x10fbc0460>,
 <mistletoe.span_token.RawText at 0x10fbc0f70>,
 <mistletoe.span_token.RawText at 0x10fbc1120>,
 <mistletoe.span_token.RawText at 0x10fbc11e0>,
 <mistletoe.span_token.RawText at 0x10fbc0640>,
 <mistletoe.span_token.RawText at 0x10fbc0040>]

and like this with __repr__ methods:

[<mistletoe.block_token.Heading with 1 child content='Quotes' level=2 at 0x10f748c40>,
 <mistletoe.block_token.Quote with 1 child at 0x10f748250>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f776830>,
 <mistletoe.block_token.Quote with 2 children at 0x10f7748b0>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f777310>,
 <mistletoe.block_token.Quote with 1 child at 0x10f774220>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f776440>,
 <mistletoe.block_token.Quote with 3 children at 0x10f7b9c90>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7b82e0>,
 <mistletoe.span_token.RawText content='Quotes' at 0x10f748340>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f664a90>,
 <mistletoe.span_token.RawText content='A response to single quote.' at 0x10f776ec0>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f774370>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f777490>,
 <mistletoe.span_token.RawText content='Quote with a list inside:' at 0x10f7776d0>,
 <mistletoe.block_token.List with 2 children loose=False start=1 at 0x10f7762f0>,
 <mistletoe.span_token.RawText content='Nested quotes:' at 0x10f7b9150>,
 <mistletoe.block_token.Quote with 1 child at 0x10f7ba410>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7b8580>,
 <mistletoe.block_token.Quote with 2 children at 0x10f7b8b50>,
 <mistletoe.span_token.RawText content='Another paragraph.' at 0x10f7b8be0>,
 <mistletoe.span_token.RawText content='A single quote' at 0x10f665060>,
 <mistletoe.span_token.RawText content='A quote spreading...' at 0x10f777be0>,
 <mistletoe.span_token.RawText content='... multiple paragraphs' at 0x10f776bc0>,
 <mistletoe.block_token.ListItem with 1 child leader='1.' prepend=3 loose=False at 0x10f775d50>,
 <mistletoe.block_token.ListItem with 1 child leader='2.' prepend=3 loose=False at 0x10f776d40>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7ba3e0>,
 <mistletoe.span_token.RawText content='Quoted paragraph.' at 0x10f7b8460>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7b9570>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7b9ae0>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f7749a0>,
 <mistletoe.block_token.Paragraph with 1 child at 0x10f776b30>,
 <mistletoe.span_token.RawText content='Nested line quote' at 0x10f7b8640>,
 <mistletoe.span_token.RawText content='Nested block quote.' at 0x10f7b86a0>,
 <mistletoe.span_token.RawText content='Jira does not seem to support '...+51 at 0x10f7b93f0>,
 <mistletoe.span_token.RawText content='first' at 0x10f776260>,
 <mistletoe.span_token.RawText content='second' at 0x10f7771c0>]

enhancement

opened by doerwalter 14

Extracting content by intercepting render_raw_text

Thanks for this nice project.

I may be a noob in this, but I was able to parse a readme get the returned content wrapper in their html elements. But the returned content is one giant text. So, my question is: is there any built-in function to extract only contents but not codes? as in get only data from ..<p></p> or <h5></h5> ...
question

opened by samayo 14
$Ensure LaTeX renderer uses valid \verb delimiter$

Ensure LaTeX renderer uses valid \verb delimiter

The LaTeX renderer uses \verb for inline code, but the delimiter is always a vertical bar, which produces incorrect output when the inline code also contains a vertical bar (e.g., example | pipe).

Rather than using a single static character (i.e., a vertical bar), this change modifies render_inline_code to search for a non-letter delimiter that does not appear in the inline code. If no such delimiter can be found, a RuntimeError is raised to avoid incorrect output.

Note that the list of possible delimiters is not exhaustive. For example, numbers (0, 1, 2, etc.) are all valid delimiters for \verb but are omitted from the search.

Fixes #149

opened by joel-coffman 13

Problem rendering pipe characters in code blocks within tables

Hello, thank you heartily for this great library. It took me quite few months to encounter a bug, which I am reporting here.

Mistletoe seems unable to render pipe characters in code blocks within tables. Here is an example of observed behavior:

mistletoe [version 0.7.2] (interactive)
Type Ctrl-D to complete input, or Ctrl-C to exit.
>>> | Table | Header |
... |---    |---     |
... | `<|>` | `<|>`  |
...
... ^Z

<table>
<thead>
<tr>
<th align="left">Table</th>
<th align="left">Header</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">`&lt;</td>
<td align="left">&gt;`</td>
<td align="left">`&lt;</td>
<td align="left">&gt;`</td>
</tr>
</tbody>
</table>
>>>

The following was my expected rendering:

| Table | Header | |--- |--- | | <\|> | <\|> |

Interestingly, this behavior can not be expected in GFM, which requires escapes for pipes: https://github.com/github/markup/issues/1078

However, escaping pipes is not working in mistletoe:

>>> | Table | Header |
... |---    |---     |
... |`<\|>` | `<\|>` |
... ^Z

<table>
<thead>
<tr>
<th align="left">Table</th>
<th align="left">Header</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">`&lt;\</td>
<td align="left">&gt;`</td>
<td align="left">`&lt;\</td>
<td align="left">&gt;`</td>
</tr>
</tbody>
</table>

enhancement

opened by huettenhain 9

Mistletoe plant and logo

Hi,

first of all, thanks for the tool: it's a piece of cake for markdown parsing (and also for custom rendering)!

Just for the sake of precision, I'd like to point out a common misconception about mistletoe, which is not the plant depicted in the logo. The plant of the logo is either the Ruscus aculeatus, also called butcher's-broom or christmas berry or the Ilex aquifolium, called christmas holly. Instead, the mistletoe is the Viscum album, which has white berries and plain thick leaves (not spiny as the other two plants).

I'd like also to make clear that letting you change the logo is not my purpose, but I want only to friendly make you aware of which is the “right“ mistletoe. :-)

All the best,

Luca
question

opened by liuq 9
Contrib folder update

This PR is a proposed solution for #101, Make renderers from the "contrib" folder easier to use. The approach taken is as described in the discussion thread on that issue: basically to move the contrib directory inside the mistletoe source directory.

In addition to moving files around, the PR also includes some documentation updates to explain the use of the contrib folder and how to use the renderers in the contrib directory from the CLI.

NOTE the mistletoe.egg-info folder has not been updated, as the egg format seems to be quite obsolete by now and superseeded by wheel. I'd recommend to remove the egg-info folder.
breaking-change

opened by anderskaplan 7
Support JIRA renderer

Feature request - add support for rendering Markdown (in particular GFM) to JIRA markup.

JIRA markup documentation at this link: https://jira.atlassian.com/secure/WikiRendererHelpAction.jspa?section=all
help wanted feature

opened by kickingvegas 7
Fix for part of #108, Update to CommonMark v0.30

This PR fixes nine of the failing examples in the CommonMark 0.30 specification. They all had in common that content inside code spans was not handled according to the spec.

This was solved by preserving space and escape sequences during parsing, and by removing leading and trailing space according to the spec during HTML rendering.

The PR also includes a fix for a warning, a simplified way to download the spec tests examples, and an improvement to the spec test runner.

opened by anderskaplan 6
Mistletoe hangs when parsing some specifically formatted Footnotes
>>> import mistletoe >>> input = "foo bar [1]:\r\nfoo bar\r\n\r\n[1]: https://example.org/\r\nhttps://example.org" >>> mistletoe.markdown(input)

This never returns, or at least does not return within the limits of my patience.
bug has-workaround
opened by ddevault 6
Document side-effects of renderers' initialisation
Hello, this is possibly an issue concerning the doc and not the code.

Parsing outside of the renderer's context manager:

d = Document('a <b> c') with HTMLRenderer() as r: print(r.render(d)) # <p>a <b> c</p>

Parsing inside of the renderer's context manager:

with HTMLRenderer() as r: d = Document('a <b> c') print(r.render(d)) # <p>a <b> c</p>

Not sure where the difference in output comes from. CommonMark asks for the second output though, which seems to be what is performed in mistletoe.markdown and by the mistletoe command line.

$ python -V Python 3.7.0 $ pip freeze mistletoe==0.7.1
documentation
opened by Rogdham 6
FootnoteLink removing trailing spaces
First of all, this is a great project and incidentally it offers the only way to have a decent Markdown to Jira converter. I encountered one tiny bug in the converter:

Assume the following markdown document:

Test [link] will remove space. [link]: http://www.nullteilerfrei.de/

Then the output of md2jira will be the following:

Test [link|http://www.nullteilerfrei.de/]will remove space.

As you can see, there should be a space right after the link.
bug
opened by huettenhain 6
Finalize version 1.0.0
These should be the final changes before releasing the next version of mistletoe - this time we should feel already confident enough to name it 1.0.0.

And this a preview of the release notes:

WARNING - Backwards compatibility changes:

#167: For practical reasons and for following common packaging practices, contrib folder got moved under the mistletoe folder / package. So if you reference a renderer from that folder, you need to reference it as mistletoe.contrib.<renderer> now.

See "change handling of (white)space characters in code spans" below. We keep processing extra whitespace characters at the parsing level, so that all renderers can benefit from it out-of-the-box. Provided that a custom renderer, for whatever reason, relied on all the spaces being collapsed, it needs to do that collapsing itself now (e.g. ' '.join(re.split('[ \n]+', content.strip())), or re.sub('[ \n]+', ' ', content.strip())).

Added:

JIRARenderer: Support link title notation, i.e. [label](url "title") gets transformed to [label|url|title] (#161)

Fixed:

Make the traverse() function actually work with various input parameters:

filtering by class (#157, #158)

filtering by depth (#159)

Compatibility with the latest CommonMark specification v0.30 (#108):

Change handling of (white)space characters in code spans (#156; in line with commonmark/commonmark-spec#532 and commonmark/commonmark-spec#569)

Fix handling of tabs and parsing continuation lines within list items (#89 via #164)

Fix the other examples, mostly edge cases, from the spec (see #165 and #168 for details; #173 seems to be fixed as well)

Make parsing of link reference definitions (a.k.a. Footnotes) more strict - spec compliant (#132)

Updated:

Smaller inner working refactorings, like #171.
opened by pbodnar 3
Improvements to metadata handling
This is an attempt to make the two places in the code base where metadata is handled -- the ASTRenderer and Token.repr -- work in a similar way, and most importantly: to display only selected attributes on the tokens (as specified in repr_attributes), instead of displaying everything it can find. The PR also includes a small bug fix for Token.repr.

This PR was split out from #169.

In more detail:

Updated the AST renderer to display the "official" attributes on the tokens (as specified in repr_attributes), instead of grabbing all attributes it can find.

Added footnotes to the repr_attributes on the Document token. This is the only attribute expected by the ASTRenderer test suite which wasn't already listed in repr_attributes.

Bug fix for Token.repr: only instance attributes should be included in repr, not class attributes. The class attributes are temporary and used during parsing. Affects headings.
opened by anderskaplan 0
Include custom HTML attributes
This is similar to this PR#134

I would like to propose an alternative solution that will not alter existing render methods.

Html attribute block Proposed Spec

Line containing the following string ${html_attr_name:value, another_html_attr:value} will describe the html attributes for the element proceeding it.

Contents within the ${...} string will be a comma separated list of key/value pairs. The > character will separate parent attributes from child attributes.

Example Html attribute block INPUT

${id:my-value, class:some-class} # Mistletoe is Awesome ${id:my-list, class:foo >class:bar-items} - Item One - Item Two - Item Three\

OUTPUT

<h1 id="my-value" class="some-class">Mistletoe is Awesome</h1> <ul id="my-list" class="foo"> <li class="bar-items">Item One</li> <li class="bar-items">Item Two</li> <li class="bar-items">Item Three</li> </ul>
feature
opened by hyperking 8
Modified the HTMLBlock token as described in #163, Inconsistencies in the block tokens

Modified the HTMLBlock to store its content in a RawText child node, like all the other block tokens, instead of a content attribute.

Added a convenience wrapper to preserve backward compatibility. Also fixed a bug in Token.repr which caused class attributes on the tokens to be included and not only instance attributes.

opened by anderskaplan 5

Enable tables which interrupt a paragraph (like GFM does)

Input markdown:

A:
|a|b|
|-|-|
|a|b|

ast_renderer outputs:

    {
      "type": "Paragraph",
      "children": [
        {
          "type": "RawText",
          "content": "A:"
        },
        {
          "type": "LineBreak",
          "soft": true,
          "content": ""
        },
        {
          "type": "RawText",
          "content": "|a|b|"
        },
        {
          "type": "LineBreak",
          "soft": true,
          "content": ""
        },
        {
          "type": "RawText",
          "content": "|-|-|"
        },
        {
          "type": "LineBreak",
          "soft": true,
          "content": ""
        },
        {
          "type": "RawText",
          "content": "|a|b|"
        }
      ]
    },

I believe Table should be output, not RawText.

Actual output in GitHub:

A: |a|b| |-|-| |a|b|

enhancement breaking-change

opened by minaminao 2

Question: Inconsistencies in the block tokens
There are some inconsistencies among the block tokens that maybe should be fixed before stepping up to version 1.0:

Trailing newlines are sometimes preserved and sometimes not. CodeFence and BlockCode preserve them; Paragraph and HTMLBlock do not.

CodeFence and BlockCode keep their content in a single RawText child node, whereas the HTMLBlock keeps it in the content property. In fact, the HTMLBlock is the only block token to have a content property. It is typically used with span tokens.

So what to do about it?

My suggestion would be to remove the trailing newlines from all block tokens. The other consistent option, to keep them for all block tokens, would add a trailing LineBreak to all Paragraph's, and that would just be a pain. Of course there's also the option to leave it as it is.

I would also suggest to place the HTMLBlock content in a single RawText node, so it would be consistent with the other block tokens. Maybe keep its content property, too, in order to not break the API. The content property could be turned into a property getter and marked as deprecated.

Thoughts?
question
opened by anderskaplan 2

Releases(v0.9.0)

v0.9.0(Aug 18, 2022)
WARNING - Backwards compatibility changes:

Python versions below 3.5 are no longer supported (Python 3.6 end-of-life: December 2021)

html module (available since Python 3.4) is no longer included

As unescaping of HTML character references (entities) is now correctly done in parsing phase already, prospective custom renderers should be altered accordingly provided they do the unescaping themselves now.

HTMLRenderer: single quote is no longer rendered as ', but as ' (see #115; let us know if you would need the old behavior)

BaseRenderer.__getattr__() is removed and replaced by explicit render_*() methods definitions for clearer API (#133)

Added:

Add __repr__() methods to all token classes (#140)

Add type hints for HTMLRenderer methods (#133; supported since Python 3.0)

Fixed:

Correctly unescape HTML character references (entities) for LaTeXRenderer - refactored globally (#135)

Ensure LaTeX renderer uses valid \verb delimiter - not always just | (#149)

GithubWiki unit test failing when run via pytest (#142)

Others:

Simplify implementation of escaping special HTML characters (#135)

Remove unused imports and variables (#146)

Document (for maintainers) how to create a new mistletoe release

Fix and extend docstring documentation of various token types (#154)

This version is about 0,5% faster according to the benchmark test. :)

Source code(tar.gz)
Source code(zip)
mistletoe-0.9.0-py3-none-any.whl(29.27 KB)
v0.8.2(Feb 9, 2022)
Fixed:

Support emphasized inline code (#70)

Failure to parse paragraph containing just "[" (#130) (a side-effect of the fix of #124 in v0.8.1)

IndexError when parsing unfinished-link-like text (#98)

Others:

Small documentation improvements.

Source code(tar.gz)
Source code(zip)
mistletoe-0.8.2-py3-none-any.whl(29.48 KB)
v0.8.1(Dec 18, 2021)
Added:

Documentation (#122 - covering #56, #99 and some other basic topics)

Fixed:

Avoid infinite loop when parsing specific Footnotes (#124)

Read and write to files / console in UTF-8 (in all remaining locations)

Testing:

Benchmark tests made up-to-date (#119)

Source code(tar.gz)
Source code(zip)
mistletoe-0.8.1-py3-none-any.whl(29.27 KB)
v0.8.0(Oct 9, 2021)
Added:

Support escaped pipes in table cells (#85)

traverse() function, to recursively yield children of a token (breadth-first traverse) (#94)

XWiki20Renderer - supports XWiki syntax 2.0 (#113)

Fixed:

JIRARenderer is basically ready for real life scenarios now

Fixed output of empty lines in lists and others (#100)

Don't HTML-escape special chars (#100)

Fixed output of table headers (#105)

Escape special Jira chars (#111)

Fixed output of empty cells (#109; see JRASERVER-70048)

Read and write to files / console in UTF-8, so that UnicodeDecodeError-s and UnicodeEncodeError-s are avoided (#100)

Various Markdown parsing problems (#86, #91)

Removed over-escaping of URLs in HTML and Jira renderers (#102)

TOCRenderer: The resulting toc property is properly generated (#88)

LaTeXRenderer: Escape underscores and percentages + don't escape in inline code (#93 / #112)

Testing:

Don't limit diffs from assertEquals, so that all differences are visible (#100)

Introduced filesBasedTest decorator for simple tests via conventionally named test files (#100)

Source code(tar.gz)
Source code(zip)
v0.7.2(Jun 8, 2019)
Fixed:

Fixed incorrect handling of loose list (#54, #65, thanks @Rogdham and @Vallentin)

Fixed FileWrapper backstep after StopIteration (#58, thanks @Rogdham)

Allow more than one level of token subclass (#62, thanks @Rogdham)

Tables can handle rows with missing columns (#67, thanks @Grollicus)

Fixed unresolved reference (#73, thanks @Vallentin)

Fixed EOL markers in LaTeX tables (#79, thanks @liuq)

Testing:

Add Python 3.7 to integration testing (#63, thanks @nikolas)

Source code(tar.gz)
Source code(zip)
mistletoe-0.7.2-py3-none-any.whl(27.92 KB)
v0.7.1(Jun 25, 2018)
Fixed:

only matching the first instance of InlineCode (#50, thanks @huettenhain);

normalize newlines after every line (#51, thanks @elebow and @rsrdesarrollo);

trailing characters after reference definition.

Performance:

small speed boost to ParseToken.append_child.

Source code(tar.gz)
Source code(zip)
mistletoe-0.7.1-py3-none-any.whl(29.58 KB)
v0.7(Jun 11, 2018)
Warning: this is a release that breaks backwards compatibility in non-trivial ways (hopefully for the last time!) Read the full release notes if you are updating from a previous version.

Features:

all tests passing in CommonMark test suite (finally! :tada:)

allow specifying span token precedence levels;

new and shiny span_tokenizer.tokenize.

Fixed:

well, all the CommonMark test cases..

ASTRenderer crashes on tables with headers (#48, thanks @timfox456!)

Where I break backwards compatibility:

Previously span-level tokens need to have their children attribute manually specified. This is no longer the case, as the children attribute will automatically be set based on the class variable parse_group, which correspond to the regex match group in which child tokens might occur.

As an example, previously GithubWiki is implemented as this:

from mistletoe.span_token import SpanToken, tokenize_inner import re class GithubWiki(SpanToken): pattern = re.compile(r'...') def __init__(self, match_obj): super().__init__(match_obj) # alternatively, self.children = tokenize_inner(match_obj.group(1)) self.target = match_obj.group(2)

Now we can write:

from mistletoe.span_token import SpanToken import re class GithubWiki(SpanToken): pattern = re.compile(r'...') parse_inner = True # default value, can be omitted parse_group = 1 # default value, can be omitted precedence = 5 # default value, can be omitted def __init__(self, match_obj): self.target = match_obj.group(2)

If we have a span token that does not need further parsing, we can write:

class Foo(SpanToken): pattern = re.compile(r'(foo)') parse_inner = False def __init__(self, match_obj): self.content = match_obj.group(1)

See the readme for more details.
Source code(tar.gz)
Source code(zip)
mistletoe-0.7-py3-none-any.whl(26.95 KB)
v0.6.2(May 27, 2018)
Features:

CommonMark compliant CodeFence;

CommonMark compliant BlockCode;

CommonMark compliant HTMLBlock;

CommonMark compliant HTMLSpan;

CommonMark compliant AutoLink;

CommonMark compliant InlineCode;

CommonMark compliant Heading;

CommonMark compliant SetextHeading;

added span-level token LineBreak;

better handling of lazy-continuation in Quote;

Footnotes can be defined in any block-level containers.

Fixes:

loose lists conform to CommonMark spec (#44, thanks @huettenhain);

not parsing sub-lists deeper than two levels (#46, thanks @daerhu);

FileWrapper._index should not go below -1.

Development:

refactored handling of SetextHeading;

removed block_tokenizer.MismatchException;

removed _children attribute, using children directly; (potentially breaking change?)

renamed Separator to ThematicBreak;

renamed FootnoteBlock to Footnote;

tokenize and tokenize_inner returns lists of tokens;

refactored CommonMark testing script.

Source code(tar.gz)
Source code(zip)
mistletoe-0.6.2-py3-none-any.whl(20.31 KB)
v0.6.1(May 13, 2018)
Features:

CommonMark compliant CodeFence (#41);

allow multiple backticks for InlineCode;

strips whitespace around InlineCode;

Fixed:

Separator needs at least three characters;

indented code blocks should not interrupt paragraphs (#40, thanks @joncass);

crashes when sublists have different marker type (#42, thanks @JBartlett86);

typo in Paragraph.read (#43, thanks @NatTupper);

preliminary fixes for handling loose lists (#44, thanks @huettenhain);

removed corrupted block_token.until function;

html code language tags starts with "language-".

Source code(tar.gz)
Source code(zip)
mistletoe-0.6.1-py3-none-any.whl(19.43 KB)
v0.6(May 2, 2018)
Features:

added Pygments renderer to contrib (#35, thanks to @Bridouz);

HTMLSpan now supports comments (#37);

(more or less) Commonmark compliant List implementation (#40).

Fixes:

changed logo to an actual mistletoe (#21, thanks to @liuq);

allow lists after block tokens without newlines (#34, thanks to @huettenhain);

recognize headings within paragraphs (#36);

disallow opening space in html tag (#37).

Performance:

removed FileWrapper.normalize;

utilized universal newline mode.

Breaking changes:

BlockToken.start does not advance file iterator.

Special shout-out to @joncass for raising the unattributed issues above, and giving me the motivation to finally fix the list implementation!

Note that this is a release with major changes. If you notice any rough edges (as there will certainly be), please do not hesitate to open an issue.
Source code(tar.gz)
Source code(zip)
mistletoe-0.6-py3-none-any.whl(19.25 KB)
v0.5.5(Apr 15, 2018)
Features:

added default render methods for all tokens;

added reset_tokens function to block_token and span_token;

allowed BlockToken.read to return any iterable;

BaseRenderer is now available at mistletoe.BaseRenderer;

added Scheme.

Fixes:

throw better AttributeError when accessing RawText.children (#31, thanks @jabdoa2);

disallow whitespace in span_token.Link (#32, thanks @DMRobertson);

allowed empty alt text in Image and FootnoteImage (#33, thanks @joncass).

Source code(tar.gz)
Source code(zip)
mistletoe-0.5.5-py3-none-any.whl(19.34 KB)
v0.5.4(Mar 27, 2018)
Features:

md2jira: read from stdin if no input file is given (#27, thanks @alexkolson!);

better command line options and help messages;

auto-splitlines when mistletoe.markdown is given a string;

inline tokens can span multiple lines (#30, thanks @duckwork!).

Fixes:

TableRow now supports table shorthand (#29, thanks @huettenhain!);

normalize line breaks.

... plus various refactors and documentation improvements.
Source code(tar.gz)
Source code(zip)
mistletoe-0.5.4-py3-none-any.whl(19.80 KB)
v0.5.3(Feb 5, 2018)
Features:

shortened mistletoe.markdown keyword argument name (renderer_cls to renderer);

removed List reference lookup;

list items can contain paragraphs (CM5.2);

shorthand syntax added for tables (#26).

Fixed:

ignored invisible characters at line end for CodeFence (#24);

fixed extra newlines for headings in JIRARenderer (#25, thanks @huettenhain!);

Development:

moved documentation to docs directory;

solved the biggest mystery in the codebase.

Source code(tar.gz)
Source code(zip)
mistletoe-0.5.3-py3-none-any.whl(19.78 KB)
v0.5.2(Jan 30, 2018)
Fixed:

contrib/md2jira.py was importing from the wrong directory (#20, thanks to @cctile);

characters in LaTeX lstlisting environment should not be escaped (#23, thanks to @liuq).

Source code(tar.gz)
Source code(zip)
mistletoe-0.5.2-py3-none-any.whl(19.83 KB)
v0.5.1(Jan 24, 2018)
Features:

added JIRA Markdown support (thanks to @cctile);

Strong / Emphasis elements must open with non-whitespace characters;

no more than 6 levels of Heading;

Fixed:

render_table crashing when iterating token.children (#12);

FootnoteLink engulfing trailing spaces (#14);

Paragraph.read not stopping before CodeFence (#15);

Development:

added testing for CommonMark compliance;

merged plugins directory into contrib (thanks to @huettenhain);

Lastly, I miss cheeseburgers. 🍔
Source code(tar.gz)
Source code(zip)
mistletoe-0.5.1-py3-none-any.whl(19.80 KB)
v0.5(Jan 9, 2018)
Features:

BlockToken is a hell lot more flexible now;

add_token accepts an additional position argument;

Newlines are now preserved in Paragraph tokens.

Fixed:

ASTRenderer fails to serialize FootnoteAnchor.

Where I broke backwards compatibility:

BlockToken now has start and read methods, instead of match method. This allows for much more granular control of parsing when defining custom block-level tokens.

Heading and SetextHeading are now different token classes, though their renderer functions are still the same.

CodeFence and BlockCode are now different token classes, though their renderer functions are still the same.

What has been in my life for the past few weeks:

❄️
Source code(tar.gz)
Source code(zip)
mistletoe-0.5-py3-none-any.whl(19.56 KB)
v0.4.1(Dec 25, 2017)
Features:

added support for empty or self-closing HTMLSpan;

added --renderer flag for command line usage;

token.children now has idempotent behavior!

Development:

refactored command line functionalities;

testing and error handling.

Merry Christmas! 🎄
Source code(tar.gz)
Source code(zip)
mistletoe-0.4.1-py3-none-any.whl(19.38 KB)
v0.4(Nov 18, 2017)
Features:

removed argument footnotes from render functions;

make custom tokens usable without invoking renderer as context manager (#5).

Development:

documentation updates and bug fixes (more to come!);

refactoring and slight performance gains.

Now for a beer emoji: 🍺
Source code(tar.gz)
Source code(zip)
mistletoe-0.4-py3-none-any.whl(18.06 KB)
v0.3.1(Sep 1, 2017)
Features:

auto-closes unclosed code fences;

adds support for "[footnote]"-style links;

interactive mode adds keyboard control support;

accepts multiple filenames from the command line.

Fixed:

render_image function missing argument;

mistletoe crashes with empty list items;

removes redundant whitespace for empty lines in code fences;

fixed performance issues on PyPy: very, very fast now.

Source code(tar.gz)
Source code(zip)
mistletoe-0.3.1-py3-none-any.whl(18.36 KB)
v0.3(Aug 27, 2017)
Features:

span-level token constructors now accept match objects;

simplified adding custom tokens to the parsing process;

simplified creating new renderer classes.

Development:

refactored (most of the) tests;

cleaned up benchmarking script;

cleaned up some spaghetti in tokenizer modules.

Source code(tar.gz)
Source code(zip)
mistletoe-0.3-py2.py3-none-any.whl(17.71 KB)
v0.2.1(Aug 14, 2017)
Features:

added table-of-contents plugin;

added rudimentary MathJax support in mathjax plugin;

Fixed:

mistletoe crashes with text between underscores;

incorrect handling of hashes in code blocks;

Relicensed under MIT.
Source code(tar.gz)
Source code(zip)
mistletoe-0.2.1-py3-none-any.whl(17.19 KB)
v0.2(Aug 7, 2017)
Features:

added support for footnote-style images and links;

added support for LaTeX renderer;

renderer classes are now context managers (see README).

Development:

added test suite for LaTeX renderer;

added benchmarking script for performance comparison;

added scripts to compare render output across commits;

added CI testing for versions up to Python 3.3.

Fixed:

a bunch of regex craziness;

outdated documentation.

Source code(tar.gz)
Source code(zip)
mistletoe-0.2-py3-none-any.whl(17.12 KB)
v0.1.1(Jul 26, 2017)
This release is mainly to celebrate that I shaved. Other than that:

Block-level token support:

heading (ATX and setext);

quote;

paragraph;

block code (code fence and indented code);

lists and nested lists;

tables;

horizontal rule;

Span-level token support:

strong (with asterisks or underscore);

emphasis (with asterisks or underscore);

inline code;

strikethrough;

images (inline link only);

links (inline link only) and autolinks;

Output format support:

render to HTML;

render to mdast-like AST;

render to really janky LaTeX;

Lastly, hello world!
Source code(tar.gz)
Source code(zip)
mistletoe-0.1.1-py3-none-any.whl(24.67 KB)