A fast, extensible and spec-compliant Markdown parser in pure Python.

Overview

mistletoe

Build Status Coverage Status PyPI is wheel

mistletoe is a Markdown parser in pure Python, designed to be fast, spec-compliant and fully customizable.

Apart from being the fastest CommonMark-compliant Markdown parser implementation in pure Python, mistletoe also supports easy definitions of custom tokens. Parsing Markdown into an abstract syntax tree also allows us to swap out renderers for different output formats, without touching any of the core components.

Remember to spell mistletoe in lowercase!

Features

  • Fast: mistletoe is the fastest implementation of CommonMark in Python, that is, 2 to 3 times as fast as Commonmark-py, and still roughly 30% faster than Python-Markdown. Running with PyPy yields comparable performance with mistune.

    See the performance section for details.

  • Spec-compliant: CommonMark is a useful, high-quality project. mistletoe follows the CommonMark specification to resolve ambiguities during parsing. Outputs are predictable and well-defined.

  • Extensible: Strikethrough and tables are supported natively, and custom block-level and span-level tokens can easily be added. Writing a new renderer for mistletoe is a relatively trivial task.

    You can even write a Lisp in it.

Some alternative output formats:

Installation

mistletoe is tested for Python 3.3 and above. Install mistletoe with pip:

pip3 install mistletoe

Alternatively, clone the repo:

git clone https://github.com/miyuchina/mistletoe.git
cd mistletoe
pip3 install -e .

See the contributing doc for how to contribute to mistletoe.

Usage

Basic usage

Here's how you can use mistletoe in a Python script:

import mistletoe

with open('foo.md', 'r') as fin:
    rendered = mistletoe.markdown(fin)

mistletoe.markdown() uses mistletoe's default settings: allowing HTML mixins and rendering to HTML. The function also accepts an additional argument renderer. To produce LaTeX output:

import mistletoe
from mistletoe.latex_renderer import LaTeXRenderer

with open('foo.md', 'r') as fin:
    rendered = mistletoe.markdown(fin, LaTeXRenderer)

Finally, here's how you would manually specify extra tokens and a renderer for mistletoe. In the following example, we use HTMLRenderer to render the AST, which adds HTMLBlock and HTMLSpan to the normal parsing process.

from mistletoe import Document, HTMLRenderer

with open('foo.md', 'r') as fin:
    with HTMLRenderer() as renderer:
        rendered = renderer.render(Document(fin))

From the command-line

pip installation enables mistletoe's command-line utility. Type the following directly into your shell:

mistletoe foo.md

This will transpile foo.md into HTML, and dump the output to stdout. To save the HTML, direct the output into a file:

mistletoe foo.md > out.html

You can pass in custom renderers by including the full path to your renderer class after a -r or --renderer flag:

mistletoe foo.md --renderer custom_renderer.CustomRenderer

Running mistletoe without specifying a file will land you in interactive mode. Like Python's REPL, interactive mode allows you to test how your Markdown will be interpreted by mistletoe:

mistletoe [version 0.7.2] (interactive)
Type Ctrl-D to complete input, or Ctrl-C to exit.
>>> some **bold** text
... and some *italics*
...
<p>some <strong>bold</strong> text
and some <em>italics</em></p>
>>>

The interactive mode also accepts the --renderer flag:

mistletoe [version 0.7.2] (interactive)
Type Ctrl-D to complete input, or Ctrl-C to exit.
Using renderer: LaTeXRenderer
>>> some **bold** text
... and some *italics*
...
\documentclass{article}
\begin{document}

some \textbf{bold} text
and some \textit{italics}
\end{document}
>>>

Performance

mistletoe is the fastest CommonMark compliant implementation in Python. Try the benchmarks yourself by running:

$ python3 test/benchmark.py  # all results in seconds
Test document: test/samples/syntax.md
Test iterations: 1000
Running tests with markdown, mistune, commonmark, mistletoe...
==============================================================
markdown: 33.28557115700096
mistune: 8.533771439999327
commonmark: 84.54588776299897
mistletoe: 23.5405140980001

We notice that Mistune is the fastest Markdown parser, and by a good margin, which demands some explanation. mistletoe's biggest performance penalty comes from stringently following the CommonMark spec, which outlines a highly context-sensitive grammar for Markdown. Mistune takes a simpler approach to the lexing and parsing process, but this means that it cannot handle more complex cases, e.g., precedence of different types of tokens, escaping rules, etc.

To see why this might be important to you, consider the following Markdown input (example 392 from the CommonMark spec):

***foo** bar*

The natural interpretation is:

<p><em><strong>foo</strong> bar</em></p>

... and it is indeed the output of Python-Markdown, Commonmark-py and mistletoe. Mistune (version 0.8.3) greedily parses the first two asterisks in the first delimiter run as a strong-emphasis opener, the second delimiter run as its closer, but does not know what to do with the remaining asterisk in between:

<p><strong>*foo</strong> bar*</p>

The implication of this runs deeper, and it is not simply a matter of dogmatically following an external spec. By adopting a more flexible parsing algorithm, mistletoe allows us to specify a precedence level to each token class, including custom ones that you might write in the future. Code spans, for example, has a higher precedence level than emphasis, so

*foo `bar* baz`

... is parsed as:

<p>*foo <code>bar* baz</code></p>

... whereas Mistune parses this as:

<p><em>foo `bar</em> baz`</p>

Of course, it is not impossible for Mistune to modify its behavior, and parse these two examples correctly, through more sophisticated regexes or some other means. It is nevertheless highly likely that, when Mistune implements all the necessary context checks, it will suffer from the same performance penalties.

Contextual analysis is why Python-Markdown is slow, and why CommonMark-py is slower. The lack thereof is the reason mistune enjoys stellar performance among similar parser implementations, as well as the limitations that come with these performance benefits.

If you want an implementation that focuses on raw speed, mistune remains a solid choice. If you need a spec-compliant and readily extensible implementation, however, mistletoe is still marginally faster than Python-Markdown, while supporting more functionality (lists in block quotes, for example), and significantly faster than CommonMark-py.

One last note: another bottleneck of mistletoe compared to mistune is the function overhead. Because, unlike mistune, mistletoe chooses to split functionality into modules, function lookups can take significantly longer than mistune. To boost the performance further, it is suggested to use PyPy with mistletoe. Benchmark results show that on PyPy, mistletoe's performance is on par with mistune:

$ pypy3 test/benchmark.py mistune mistletoe
Test document: test/samples/syntax.md
Test iterations: 1000
Running tests with mistune, mistletoe...
========================================
mistune: 13.645681533998868
mistletoe: 15.088351159000013

Developer's Guide

Here's an example to add GitHub-style wiki links to the parsing process, and provide a renderer for this new token.

A new token

GitHub wiki links are span-level tokens, meaning that they reside inline, and don't really look like chunky paragraphs. To write a new span-level token, all we need to do is make a subclass of SpanToken:

from mistletoe.span_token import SpanToken

class GithubWiki(SpanToken):
    pass

mistletoe uses regular expressions to search for span-level tokens in the parsing process. As a refresher, GitHub wiki looks something like this: [[alternative text | target]]. We define a class variable, pattern, that stores the compiled regex:

class GithubWiki(SpanToken):
    pattern = re.compile(r"\[\[ *(.+?) *\| *(.+?) *\]\]")
    def __init__(self, match):
        pass

The regex will be picked up by SpanToken.find, which is used by the tokenizer to find all tokens of its kind in the document. If regexes are too limited for your use case, consider overriding the find method; it should return a list of all token occurrences.

Three other class variables are available for our custom token class, and their default values are shown below:

class SpanToken:
    parse_group = 1
    parse_inner = True
    precedence = 5

Note that alternative text can also contain other span-level tokens. For example, [[*alt*|link]] is a GitHub link with an Emphasis token as its child. To parse child tokens, parse_inner should be set to True (the default value in this case), and parse_group should correspond to the match group in which child tokens might occur (also the default value, 1, in this case).

Once these two class variables are set correctly, GitHubWiki.children attribute will automatically be set to the list of child tokens. Note that there is no need to manually set this attribute, unlike previous versions of mistletoe.

Lastly, the SpanToken constructors take a regex match object as its argument. We can simply store off the target attribute from match_obj.group(2).

from mistletoe.span_token import SpanToken

class GithubWiki(SpanToken):
    pattern = re.compile(r"\[\[ *(.+?) *\| *(.+?) *\]\]")
    def __init__(self, match_obj):
        self.target = match_obj.group(2)

There you go: a new token in 5 lines of code.

Side note about precedence

Normally there is no need to override the precedence value of a custom token. The default value is the same as InlineCode, AutoLink and HTMLSpan, which means that whichever token comes first will be parsed. In our case:

`code with [[ text` | link ]]

... will be parsed as:

<code>code with [[ text</code> | link ]]

If we set GitHubWiki.precedence = 6, we have:

`code with <a href="link">text`</a>

A new renderer

Adding a custom token to the parsing process usually involves a lot of nasty implementation details. Fortunately, mistletoe takes care of most of them for you. Simply pass your custom token class to super().__init__() does the trick:

from mistletoe.html_renderer import HTMLRenderer

class GithubWikiRenderer(HTMLRenderer):
    def __init__(self):
        super().__init__(GithubWiki)

We then only need to tell mistletoe how to render our new token:

def render_github_wiki(self, token):
    template = '<a href="{target}">{inner}</a>'
    target = token.target
    inner = self.render_inner(token)
    return template.format(target=target, inner=inner)

Cleaning up, we have our new renderer class:

from mistletoe.html_renderer import HTMLRenderer, escape_url

class GithubWikiRenderer(HTMLRenderer):
    def __init__(self):
        super().__init__(GithubWiki)

    def render_github_wiki(self, token):
        template = '<a href="{target}">{inner}</a>'
        target = escape_url(token.target)
        inner = self.render_inner(token)
        return template.format(target=target, inner=inner)

Take it for a spin?

It is preferred that all mistletoe's renderers be used as context managers. This is to ensure that your custom tokens are cleaned up properly, so that you can parse other Markdown documents with different token types in the same program.

from mistletoe import Document
from contrib.github_wiki import GithubWikiRenderer

with open('foo.md', 'r') as fin:
    with GithubWikiRenderer() as renderer:
        rendered = renderer.render(Document(fin))

For more info, take a look at the base_renderer module in mistletoe. The docstrings might give you a more granular idea of customizing mistletoe to your needs.

Why mistletoe?

"For fun," says David Beazley.

Copyright & License

Comments
  • Add __repr__ methods to all token classes

    Add __repr__ methods to all token classes

    This pull request adds a __repr__ method to all token classes. To simplify implementation it introduces a common base class token.Token for block_token.BlockToken and span_taken.SpanToken. For the following example program:

    from mistletoe import block_token, span_token, utils
    d = block_token.Document(open("test/samples/quotes.md", "r").read())
    print([tr.node for tr in utils.traverse(d)])
    

    The output looks like this without __repr__ methods:

    [<mistletoe.block_token.Heading at 0x10fa5aec0>,
     <mistletoe.block_token.Quote at 0x10f99e2c0>,
     <mistletoe.block_token.Paragraph at 0x10fb95a50>,
     <mistletoe.block_token.Quote at 0x10fb94130>,
     <mistletoe.block_token.Paragraph at 0x10fb97970>,
     <mistletoe.block_token.Quote at 0x10fb94c10>,
     <mistletoe.block_token.Paragraph at 0x10fbc00d0>,
     <mistletoe.block_token.Quote at 0x10fbc0d30>,
     <mistletoe.block_token.Paragraph at 0x10fbc1270>,
     <mistletoe.span_token.RawText at 0x10fa5b4f0>,
     <mistletoe.block_token.Paragraph at 0x10f99d8a0>,
     <mistletoe.span_token.RawText at 0x10fb95a20>,
     <mistletoe.block_token.Paragraph at 0x10fb97910>,
     <mistletoe.block_token.Paragraph at 0x10fb97c70>,
     <mistletoe.span_token.RawText at 0x10fb97ca0>,
     <mistletoe.block_token.List at 0x10fbc0c40>,
     <mistletoe.span_token.RawText at 0x10fbc0580>,
     <mistletoe.block_token.Quote at 0x10fbc0850>,
     <mistletoe.block_token.Paragraph at 0x10fbc0fd0>,
     <mistletoe.block_token.Quote at 0x10fbc1090>,
     <mistletoe.span_token.RawText at 0x10fbc12a0>,
     <mistletoe.span_token.RawText at 0x10f99cd60>,
     <mistletoe.span_token.RawText at 0x10fb94160>,
     <mistletoe.span_token.RawText at 0x10fb948b0>,
     <mistletoe.block_token.ListItem at 0x10fbc0880>,
     <mistletoe.block_token.ListItem at 0x10fbc03d0>,
     <mistletoe.block_token.Paragraph at 0x10fbc0f10>,
     <mistletoe.span_token.RawText at 0x10fbc1000>,
     <mistletoe.block_token.Paragraph at 0x10fbc10f0>,
     <mistletoe.block_token.Paragraph at 0x10fbc11b0>,
     <mistletoe.block_token.Paragraph at 0x10fbc04f0>,
     <mistletoe.block_token.Paragraph at 0x10fbc0460>,
     <mistletoe.span_token.RawText at 0x10fbc0f70>,
     <mistletoe.span_token.RawText at 0x10fbc1120>,
     <mistletoe.span_token.RawText at 0x10fbc11e0>,
     <mistletoe.span_token.RawText at 0x10fbc0640>,
     <mistletoe.span_token.RawText at 0x10fbc0040>]
    

    and like this with __repr__ methods:

    [<mistletoe.block_token.Heading with 1 child content='Quotes' level=2 at 0x10f748c40>,
     <mistletoe.block_token.Quote with 1 child at 0x10f748250>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f776830>,
     <mistletoe.block_token.Quote with 2 children at 0x10f7748b0>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f777310>,
     <mistletoe.block_token.Quote with 1 child at 0x10f774220>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f776440>,
     <mistletoe.block_token.Quote with 3 children at 0x10f7b9c90>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f7b82e0>,
     <mistletoe.span_token.RawText content='Quotes' at 0x10f748340>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f664a90>,
     <mistletoe.span_token.RawText content='A response to single quote.' at 0x10f776ec0>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f774370>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f777490>,
     <mistletoe.span_token.RawText content='Quote with a list inside:' at 0x10f7776d0>,
     <mistletoe.block_token.List with 2 children loose=False start=1 at 0x10f7762f0>,
     <mistletoe.span_token.RawText content='Nested quotes:' at 0x10f7b9150>,
     <mistletoe.block_token.Quote with 1 child at 0x10f7ba410>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f7b8580>,
     <mistletoe.block_token.Quote with 2 children at 0x10f7b8b50>,
     <mistletoe.span_token.RawText content='Another paragraph.' at 0x10f7b8be0>,
     <mistletoe.span_token.RawText content='A single quote' at 0x10f665060>,
     <mistletoe.span_token.RawText content='A quote spreading...' at 0x10f777be0>,
     <mistletoe.span_token.RawText content='... multiple paragraphs' at 0x10f776bc0>,
     <mistletoe.block_token.ListItem with 1 child leader='1.' prepend=3 loose=False at 0x10f775d50>,
     <mistletoe.block_token.ListItem with 1 child leader='2.' prepend=3 loose=False at 0x10f776d40>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f7ba3e0>,
     <mistletoe.span_token.RawText content='Quoted paragraph.' at 0x10f7b8460>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f7b9570>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f7b9ae0>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f7749a0>,
     <mistletoe.block_token.Paragraph with 1 child at 0x10f776b30>,
     <mistletoe.span_token.RawText content='Nested line quote' at 0x10f7b8640>,
     <mistletoe.span_token.RawText content='Nested block quote.' at 0x10f7b86a0>,
     <mistletoe.span_token.RawText content='Jira does not seem to support '...+51 at 0x10f7b93f0>,
     <mistletoe.span_token.RawText content='first' at 0x10f776260>,
     <mistletoe.span_token.RawText content='second' at 0x10f7771c0>]
    
    enhancement 
    opened by doerwalter 14
  • Extracting content by intercepting render_raw_text

    Extracting content by intercepting render_raw_text

    Thanks for this nice project.

    I may be a noob in this, but I was able to parse a readme get the returned content wrapper in their html elements. But the returned content is one giant text. So, my question is: is there any built-in function to extract only contents but not codes? as in get only data from ..<p></p> or <h5></h5> ...

    question 
    opened by samayo 14
  • Ensure LaTeX renderer uses valid \verb delimiter

    Ensure LaTeX renderer uses valid \verb delimiter

    The LaTeX renderer uses \verb for inline code, but the delimiter is always a vertical bar, which produces incorrect output when the inline code also contains a vertical bar (e.g., example | pipe).

    Rather than using a single static character (i.e., a vertical bar), this change modifies render_inline_code to search for a non-letter delimiter that does not appear in the inline code. If no such delimiter can be found, a RuntimeError is raised to avoid incorrect output.

    Note that the list of possible delimiters is not exhaustive. For example, numbers (0, 1, 2, etc.) are all valid delimiters for \verb but are omitted from the search.

    Fixes #149

    opened by joel-coffman 13
  • Problem rendering pipe characters in code blocks within tables

    Problem rendering pipe characters in code blocks within tables

    Hello, thank you heartily for this great library. It took me quite few months to encounter a bug, which I am reporting here.

    Mistletoe seems unable to render pipe characters in code blocks within tables. Here is an example of observed behavior:

    mistletoe [version 0.7.2] (interactive)
    Type Ctrl-D to complete input, or Ctrl-C to exit.
    >>> | Table | Header |
    ... |---    |---     |
    ... | `<|>` | `<|>`  |
    ...
    ... ^Z
    
    <table>
    <thead>
    <tr>
    <th align="left">Table</th>
    <th align="left">Header</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <td align="left">`&lt;</td>
    <td align="left">&gt;`</td>
    <td align="left">`&lt;</td>
    <td align="left">&gt;`</td>
    </tr>
    </tbody>
    </table>
    >>>
    

    The following was my expected rendering:

    | Table | Header | |--- |--- | | <\|> | <\|> |

    Interestingly, this behavior can not be expected in GFM, which requires escapes for pipes: https://github.com/github/markup/issues/1078

    However, escaping pipes is not working in mistletoe:

    >>> | Table | Header |
    ... |---    |---     |
    ... |`<\|>` | `<\|>` |
    ... ^Z
    
    <table>
    <thead>
    <tr>
    <th align="left">Table</th>
    <th align="left">Header</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <td align="left">`&lt;\</td>
    <td align="left">&gt;`</td>
    <td align="left">`&lt;\</td>
    <td align="left">&gt;`</td>
    </tr>
    </tbody>
    </table>
    
    enhancement 
    opened by huettenhain 9
  • Mistletoe plant and logo

    Mistletoe plant and logo

    Hi,

    first of all, thanks for the tool: it's a piece of cake for markdown parsing (and also for custom rendering)!

    Just for the sake of precision, I'd like to point out a common misconception about mistletoe, which is not the plant depicted in the logo. The plant of the logo is either the Ruscus aculeatus, also called butcher's-broom or christmas berry or the Ilex aquifolium, called christmas holly. Instead, the mistletoe is the Viscum album, which has white berries and plain thick leaves (not spiny as the other two plants).

    I'd like also to make clear that letting you change the logo is not my purpose, but I want only to friendly make you aware of which is the “right“ mistletoe. :-)

    All the best,

    Luca

    question 
    opened by liuq 9
  • Support JIRA renderer

    Support JIRA renderer

    Feature request - add support for rendering Markdown (in particular GFM) to JIRA markup.

    JIRA markup documentation at this link: https://jira.atlassian.com/secure/WikiRendererHelpAction.jspa?section=all

    help wanted feature 
    opened by kickingvegas 7
  • Fix for part of #108, Update to CommonMark v0.30

    Fix for part of #108, Update to CommonMark v0.30

    This PR fixes nine of the failing examples in the CommonMark 0.30 specification. They all had in common that content inside code spans was not handled according to the spec.

    This was solved by preserving space and escape sequences during parsing, and by removing leading and trailing space according to the spec during HTML rendering.

    The PR also includes a fix for a warning, a simplified way to download the spec tests examples, and an improvement to the spec test runner.

    opened by anderskaplan 6
  • Mistletoe hangs when parsing some specifically formatted Footnotes

    Mistletoe hangs when parsing some specifically formatted Footnotes

    >>> import mistletoe
    >>> input = "foo bar [1]:\r\nfoo bar\r\n\r\n[1]: https://example.org/\r\nhttps://example.org"
    >>> mistletoe.markdown(input)
    

    This never returns, or at least does not return within the limits of my patience.

    bug has-workaround 
    opened by ddevault 6
  • Document side-effects of renderers' initialisation

    Document side-effects of renderers' initialisation

    Hello, this is possibly an issue concerning the doc and not the code.

    • Parsing outside of the renderer's context manager:
    d = Document('a <b> c')
    with HTMLRenderer() as r:
        print(r.render(d))  # <p>a &lt;b&gt; c</p>
    
    • Parsing inside of the renderer's context manager:
    with HTMLRenderer() as r:
        d = Document('a <b> c')
        print(r.render(d))  # <p>a <b> c</p>
    

    Not sure where the difference in output comes from. CommonMark asks for the second output though, which seems to be what is performed in mistletoe.markdown and by the mistletoe command line.


    $ python -V
    Python 3.7.0
    $ pip freeze
    mistletoe==0.7.1
    
    documentation 
    opened by Rogdham 6
  • FootnoteLink removing trailing spaces

    FootnoteLink removing trailing spaces

    First of all, this is a great project and incidentally it offers the only way to have a decent Markdown to Jira converter. I encountered one tiny bug in the converter:

    Assume the following markdown document:

    Test [link] will remove space.
    
    [link]: http://www.nullteilerfrei.de/
    

    Then the output of md2jira will be the following:

    Test [link|http://www.nullteilerfrei.de/]will remove space.
    

    As you can see, there should be a space right after the link.

    bug 
    opened by huettenhain 6
  • fix: Make filtering children by class in traverse() actually work

    fix: Make filtering children by class in traverse() actually work

    The traverse() utility function was previously written to filter children by class using issubclass(child, klass):. But the first argument to issubclass() must be a class so this will always raise an exception when trying to use the functionality. This patch corrects the call to isinstance(child, klass) and adds a test.

    bug 
    opened by asb 5
  • Fix for #89, List Items separated by tab character not parsed correctly.

    Fix for #89, List Items separated by tab character not parsed correctly.

    Also fixes failing examples 312 and 313 in the CommonMark 0.30 spec. The problem was that tabs were expanded to four spaces, not to tab stops as specified in the spec. The solution was to expand tabs at the beginning of lines. However, this meant that plain string indexing could no longer be used. The extraction of line content was therefore moved into the parse_marker and parse_continuation methods.

    opened by anderskaplan 0
  • Inconsistencies in the block tokens

    Inconsistencies in the block tokens

    There are some inconsistencies among the block tokens that maybe should be fixed before stepping up to version 1.0:

    1. Trailing newlines are sometimes preserved and sometimes not. CodeFence and BlockCode preserve them; Paragraph and HTMLBlock do not.
    2. CodeFence and BlockCode keep their content in a single RawText child node, whereas the HTMLBlock keeps it in the content property. In fact, the HTMLBlock is the only block token to have a content property. This is typically used with span tokens.

    So what to do about it?

    My suggestion would be to remove the trailing newlines from all block tokens. The other consistent option, to keep them for all block tokens, would add a trailing LineBreak to all Paragraph's, and that would just be a pain. Of course there's also the option to leave it as it is.

    I would also suggest to place the HTMLBlock content in a single RawText node, so it would be consistent with the other block tokens. Maybe keep its content property, too, in order to not break the API. The content property could be turned into a property getter and marked as deprecated.

    Thoughts?

    opened by anderskaplan 0
  • DRAFT markdown renderer

    DRAFT markdown renderer

    This is a DRAFT pull request for a markdown-to-markdown renderer (#4). It's not intended to be merged like this, the reason for creating the pull request is to get some feedback on the design and to let people try it out.

    NOTE: the PR builds on PR #160, which is still pending review.

    A few notes on the implementation:

    1. It's all in a single commit for now. It will be split up later.

    2. Since the aim is to preserve as much of the formatting as possible, and also steer clear of issues like how to make sure the sequence _**Hello**_ isn't rendered as ***Hello*** (which changes its interpretation), I had to save some formatting information from the parsing.

      This extra information may or may not be interesting for users of the AST. So, for now I have used attribute names starting with tag for them and excluded them in the AST renderer.

    3. HTML blocks were stripped of their trailing endline, unlike all other block tokens. I changed that behavior for symmetry. This change might be better off as a separate PR. (UPDATE: Added an issue to discuss this topic.)

    4. The test cases in TestMarkdownRenderer give a pretty good idea of what the renderer is capable of. In particular that it can handle nested blocks. You can put a code block inside a list inside a block quote if you like. (This is explicitly allowed by the CommonMark spec.)

    5. Three new block tokens! Well actually, only one which is completely new and that's the BlankLine. The other two represent link reference definitions (a.k.a. footnotes) and inherit from the Footnote token. These tokens are used to represent content that is otherwise not included in the AST.

      An open question is whether they belong in the same module as the markdown renderer class, or if they should be moved somewhere else.

    6. The markdown renderer handles span tokens and block tokens differently. Block tokens are rendered into chunks of lines. Span tokens are first rendered into sequences of strings and line breaks, and then into chunks of lines. This may look complicated at first, but it makes handling of nested blocks sooo much easier.

    7. Tables are not supported yet. That's work in progress. (UPDATE: Now they are!)

    8. The markdown renderer has a main method. It can be convenient, since the renderers in contrib aren't easily run from the regular mistletoe cli. Not sure if it should be kept. (Or whether the markdown renderer should stay in the contrib directory or be part of core?)

    Hope this can be useful and looking forward to some review comments.

    opened by anderskaplan 1
  • 💡 implement link titles in the JIRARenderer (contrib)

    💡 implement link titles in the JIRARenderer (contrib)

    Sample Markdown:

    Sample: [inner](https://target "title")
    

    In Linux, test with:

    mistletoe -r contrib.jira_renderer.JIRARenderer <(echo 'Sample: [inner](https://target "title")')
    

    Jira result, before this PR changes:

    Sample: [inner|https://target]
    

    Jira result, after this PR changes:

    Sample: [inner|https://target|title]
    
    opened by franferrax 3
  • pikchr integration?

    pikchr integration?

    Hi, I'm interested in contributing a Pikchr integration. Pikchr is written in C, so it looks like I'll need to call C from Python (pikchr integration docs). I don't work much with either python or C, but based on 10 minutes of research it looks like the easiest way to call C from Python is to use something like SWIG. This looks fairly easy, but I think it will involve changing the build process and I'm not sure about the implications in other areas (especially publishing to PyPi).

    Please let me know:

    1. if the pikchr integration seems feasible and desirable
    2. is this the easiest way to accomplish it, or is there a better way?
    feature 
    opened by ethanmdavidson 1
  • Record line numbers on tokens

    Record line numbers on tokens

    I'm writing a tool to parse markdown files and verify that links are valid. To be able to provide the most valuable, accurate feedback, I need to be able to display line numbers to the user, so that they know exactly where in a file the broken link is.

    Right now this isn't possible, because line numbers aren't recorded on token objects.

    feature 
    opened by djmattyg007 1
Releases(v0.9.0)
  • v0.9.0(Aug 18, 2022)

    WARNING - Backwards compatibility changes:

    • Python versions below 3.5 are no longer supported (Python 3.6 end-of-life: December 2021)
      • html module (available since Python 3.4) is no longer included
    • As unescaping of HTML character references (entities) is now correctly done in parsing phase already, prospective custom renderers should be altered accordingly provided they do the unescaping themselves now.
    • HTMLRenderer: single quote is no longer rendered as ', but as &#x27; (see #115; let us know if you would need the old behavior)
    • BaseRenderer.__getattr__() is removed and replaced by explicit render_*() methods definitions for clearer API (#133)

    Added:

    • Add __repr__() methods to all token classes (#140)
    • Add type hints for HTMLRenderer methods (#133; supported since Python 3.0)

    Fixed:

    • Correctly unescape HTML character references (entities) for LaTeXRenderer - refactored globally (#135)
    • Ensure LaTeX renderer uses valid \verb delimiter - not always just | (#149)
    • GithubWiki unit test failing when run via pytest (#142)

    Others:

    • Simplify implementation of escaping special HTML characters (#135)
    • Remove unused imports and variables (#146)
    • Document (for maintainers) how to create a new mistletoe release
    • Fix and extend docstring documentation of various token types (#154)
    • This version is about 0,5% faster according to the benchmark test. :)
    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.9.0-py3-none-any.whl(29.27 KB)
  • v0.8.2(Feb 9, 2022)

  • v0.8.1(Dec 18, 2021)

  • v0.8.0(Oct 9, 2021)

    Added:

    • Support escaped pipes in table cells (#85)
    • traverse() function, to recursively yield children of a token (breadth-first traverse) (#94)
    • XWiki20Renderer - supports XWiki syntax 2.0 (#113)

    Fixed:

    • JIRARenderer is basically ready for real life scenarios now
      • Fixed output of empty lines in lists and others (#100)
      • Don't HTML-escape special chars (#100)
      • Fixed output of table headers (#105)
      • Escape special Jira chars (#111)
      • Fixed output of empty cells (#109; see JRASERVER-70048)
    • Read and write to files / console in UTF-8, so that UnicodeDecodeError-s and UnicodeEncodeError-s are avoided (#100)
    • Various Markdown parsing problems (#86, #91)
    • Removed over-escaping of URLs in HTML and Jira renderers (#102)
    • TOCRenderer: The resulting toc property is properly generated (#88)
    • LaTeXRenderer: Escape underscores and percentages + don't escape in inline code (#93 / #112)

    Testing:

    • Don't limit diffs from assertEquals, so that all differences are visible (#100)
    • Introduced filesBasedTest decorator for simple tests via conventionally named test files (#100)
    Source code(tar.gz)
    Source code(zip)
  • v0.7.2(Jun 8, 2019)

    Fixed:

    • Fixed incorrect handling of loose list (#54, #65, thanks @Rogdham and @Vallentin)
    • Fixed FileWrapper backstep after StopIteration (#58, thanks @Rogdham)
    • Allow more than one level of token subclass (#62, thanks @Rogdham)
    • Tables can handle rows with missing columns (#67, thanks @Grollicus)
    • Fixed unresolved reference (#73, thanks @Vallentin)
    • Fixed EOL markers in LaTeX tables (#79, thanks @liuq)

    Testing:

    • Add Python 3.7 to integration testing (#63, thanks @nikolas)
    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.7.2-py3-none-any.whl(27.92 KB)
  • v0.7.1(Jun 25, 2018)

  • v0.7(Jun 11, 2018)

    Warning: this is a release that breaks backwards compatibility in non-trivial ways (hopefully for the last time!) Read the full release notes if you are updating from a previous version.

    Features:

    • all tests passing in CommonMark test suite (finally! :tada:)
    • allow specifying span token precedence levels;
    • new and shiny span_tokenizer.tokenize.

    Fixed:

    • well, all the CommonMark test cases..
    • ASTRenderer crashes on tables with headers (#48, thanks @timfox456!)

    Where I break backwards compatibility:

    Previously span-level tokens need to have their children attribute manually specified. This is no longer the case, as the children attribute will automatically be set based on the class variable parse_group, which correspond to the regex match group in which child tokens might occur.

    As an example, previously GithubWiki is implemented as this:

    from mistletoe.span_token import SpanToken, tokenize_inner
    import re
    
    class GithubWiki(SpanToken):
        pattern = re.compile(r'...')
        def __init__(self, match_obj):
            super().__init__(match_obj) 
            # alternatively, self.children = tokenize_inner(match_obj.group(1))
            self.target = match_obj.group(2)
    

    Now we can write:

    from mistletoe.span_token import SpanToken
    import re
    
    class GithubWiki(SpanToken):
        pattern = re.compile(r'...')
        parse_inner = True  # default value, can be omitted
        parse_group = 1  # default value, can be omitted
        precedence = 5  # default value, can be omitted
        def __init__(self, match_obj):
            self.target = match_obj.group(2)
    

    If we have a span token that does not need further parsing, we can write:

    class Foo(SpanToken):
        pattern = re.compile(r'(foo)')
        parse_inner = False
        def __init__(self, match_obj):
            self.content = match_obj.group(1)
    

    See the readme for more details.

    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.7-py3-none-any.whl(26.95 KB)
  • v0.6.2(May 27, 2018)

    Features:

    • CommonMark compliant CodeFence;
    • CommonMark compliant BlockCode;
    • CommonMark compliant HTMLBlock;
    • CommonMark compliant HTMLSpan;
    • CommonMark compliant AutoLink;
    • CommonMark compliant InlineCode;
    • CommonMark compliant Heading;
    • CommonMark compliant SetextHeading;
    • added span-level token LineBreak;
    • better handling of lazy-continuation in Quote;
    • Footnotes can be defined in any block-level containers.

    Fixes:

    • loose lists conform to CommonMark spec (#44, thanks @huettenhain);
    • not parsing sub-lists deeper than two levels (#46, thanks @daerhu);
    • FileWrapper._index should not go below -1.

    Development:

    • refactored handling of SetextHeading;
    • removed block_tokenizer.MismatchException;
    • removed _children attribute, using children directly; (potentially breaking change?)
    • renamed Separator to ThematicBreak;
    • renamed FootnoteBlock to Footnote;
    • tokenize and tokenize_inner returns lists of tokens;
    • refactored CommonMark testing script.
    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.6.2-py3-none-any.whl(20.31 KB)
  • v0.6.1(May 13, 2018)

    Features:

    • CommonMark compliant CodeFence (#41);
    • allow multiple backticks for InlineCode;
    • strips whitespace around InlineCode;

    Fixed:

    • Separator needs at least three characters;
    • indented code blocks should not interrupt paragraphs (#40, thanks @joncass);
    • crashes when sublists have different marker type (#42, thanks @JBartlett86);
    • typo in Paragraph.read (#43, thanks @NatTupper);
    • preliminary fixes for handling loose lists (#44, thanks @huettenhain);
    • removed corrupted block_token.until function;
    • html code language tags starts with "language-".
    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.6.1-py3-none-any.whl(19.43 KB)
  • v0.6(May 2, 2018)

    Features:

    • added Pygments renderer to contrib (#35, thanks to @Bridouz);
    • HTMLSpan now supports comments (#37);
    • (more or less) Commonmark compliant List implementation (#40).

    Fixes:

    • changed logo to an actual mistletoe (#21, thanks to @liuq);
    • allow lists after block tokens without newlines (#34, thanks to @huettenhain);
    • recognize headings within paragraphs (#36);
    • disallow opening space in html tag (#37).

    Performance:

    • removed FileWrapper.normalize;
    • utilized universal newline mode.

    Breaking changes:

    • BlockToken.start does not advance file iterator.

    Special shout-out to @joncass for raising the unattributed issues above, and giving me the motivation to finally fix the list implementation!

    Note that this is a release with major changes. If you notice any rough edges (as there will certainly be), please do not hesitate to open an issue.

    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.6-py3-none-any.whl(19.25 KB)
  • v0.5.5(Apr 15, 2018)

    Features:

    • added default render methods for all tokens;
    • added reset_tokens function to block_token and span_token;
    • allowed BlockToken.read to return any iterable;
    • BaseRenderer is now available at mistletoe.BaseRenderer;
    • added Scheme.

    Fixes:

    • throw better AttributeError when accessing RawText.children (#31, thanks @jabdoa2);
    • disallow whitespace in span_token.Link (#32, thanks @DMRobertson);
    • allowed empty alt text in Image and FootnoteImage (#33, thanks @joncass).
    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.5.5-py3-none-any.whl(19.34 KB)
  • v0.5.4(Mar 27, 2018)

    Features:

    • md2jira: read from stdin if no input file is given (#27, thanks @alexkolson!);
    • better command line options and help messages;
    • auto-splitlines when mistletoe.markdown is given a string;
    • inline tokens can span multiple lines (#30, thanks @duckwork!).

    Fixes:

    • TableRow now supports table shorthand (#29, thanks @huettenhain!);
    • normalize line breaks.

    ... plus various refactors and documentation improvements.

    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.5.4-py3-none-any.whl(19.80 KB)
  • v0.5.3(Feb 5, 2018)

    Features:

    • shortened mistletoe.markdown keyword argument name (renderer_cls to renderer);
    • removed List reference lookup;
    • list items can contain paragraphs (CM5.2);
    • shorthand syntax added for tables (#26).

    Fixed:

    • ignored invisible characters at line end for CodeFence (#24);
    • fixed extra newlines for headings in JIRARenderer (#25, thanks @huettenhain!);

    Development:

    • moved documentation to docs directory;
    • solved the biggest mystery in the codebase.
    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.5.3-py3-none-any.whl(19.78 KB)
  • v0.5.2(Jan 30, 2018)

  • v0.5.1(Jan 24, 2018)

    Features:

    • added JIRA Markdown support (thanks to @cctile);
    • Strong / Emphasis elements must open with non-whitespace characters;
    • no more than 6 levels of Heading;

    Fixed:

    • render_table crashing when iterating token.children (#12);
    • FootnoteLink engulfing trailing spaces (#14);
    • Paragraph.read not stopping before CodeFence (#15);

    Development:

    • added testing for CommonMark compliance;
    • merged plugins directory into contrib (thanks to @huettenhain);

    Lastly, I miss cheeseburgers. 🍔

    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.5.1-py3-none-any.whl(19.80 KB)
  • v0.5(Jan 9, 2018)

    Features:

    • BlockToken is a hell lot more flexible now;
    • add_token accepts an additional position argument;
    • Newlines are now preserved in Paragraph tokens.

    Fixed:

    • ASTRenderer fails to serialize FootnoteAnchor.

    Where I broke backwards compatibility:

    • BlockToken now has start and read methods, instead of match method. This allows for much more granular control of parsing when defining custom block-level tokens.
    • Heading and SetextHeading are now different token classes, though their renderer functions are still the same.
    • CodeFence and BlockCode are now different token classes, though their renderer functions are still the same.

    What has been in my life for the past few weeks:

    ❄️

    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.5-py3-none-any.whl(19.56 KB)
  • v0.4.1(Dec 25, 2017)

  • v0.4(Nov 18, 2017)

  • v0.3.1(Sep 1, 2017)

    Features:

    • auto-closes unclosed code fences;
    • adds support for "[footnote]"-style links;
    • interactive mode adds keyboard control support;
    • accepts multiple filenames from the command line.

    Fixed:

    • render_image function missing argument;
    • mistletoe crashes with empty list items;
    • removes redundant whitespace for empty lines in code fences;
    • fixed performance issues on PyPy: very, very fast now.
    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.3.1-py3-none-any.whl(18.36 KB)
  • v0.3(Aug 27, 2017)

  • v0.2.1(Aug 14, 2017)

  • v0.2(Aug 7, 2017)

    Features:

    • added support for footnote-style images and links;
    • added support for LaTeX renderer;
    • renderer classes are now context managers (see README).

    Development:

    • added test suite for LaTeX renderer;
    • added benchmarking script for performance comparison;
    • added scripts to compare render output across commits;
    • added CI testing for versions up to Python 3.3.

    Fixed:

    • a bunch of regex craziness;
    • outdated documentation.
    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.2-py3-none-any.whl(17.12 KB)
  • v0.1.1(Jul 26, 2017)

    This release is mainly to celebrate that I shaved. Other than that:

    Block-level token support:

    • heading (ATX and setext);
    • quote;
    • paragraph;
    • block code (code fence and indented code);
    • lists and nested lists;
    • tables;
    • horizontal rule;

    Span-level token support:

    • strong (with asterisks or underscore);
    • emphasis (with asterisks or underscore);
    • inline code;
    • strikethrough;
    • images (inline link only);
    • links (inline link only) and autolinks;

    Output format support:

    • render to HTML;
    • render to mdast-like AST;
    • render to really janky LaTeX;

    Lastly, hello world!

    Source code(tar.gz)
    Source code(zip)
    mistletoe-0.1.1-py3-none-any.whl(24.67 KB)
Owner
Mi Yu
Mi Yu
A fast yet powerful Python Markdown parser with renderers and plugins.

Mistune v2 A fast yet powerful Python Markdown parser with renderers and plugins. NOTE: This is the re-designed v2 of mistune. Check v1 branch for ear

Hsiaoming Yang 2.1k Sep 30, 2022
Provides syntax for Python-Markdown which allows for the inclusion of the contents of other Markdown documents.

Markdown-Include This is an extension to Python-Markdown which provides an "include" function, similar to that found in LaTeX (and also the C pre-proc

Chris MacMackin 83 Sep 28, 2022
Mdformat is an opinionated Markdown formatter that can be used to enforce a consistent style in Markdown files

Mdformat is an opinionated Markdown formatter that can be used to enforce a consistent style in Markdown files. Mdformat is a Unix-style command-line tool as well as a Python library.

Executable Books 151 Sep 22, 2022
Markdown parser, done right. 100% CommonMark support, extensions, syntax plugins & high speed. Now in Python!

markdown-it-py Markdown parser done right. Follows the CommonMark spec for baseline parsing Configurable syntax: you can add new rules and even replac

Executable Books 366 Sep 24, 2022
Pure-python-server - A blogging platform written in pure python for developer to share their coding knowledge

Pure Python web server - PyProject A blogging platform written in pure python (n

Srikar Koushik Satya Viswanadha 10 Jul 23, 2022
markdown2: A fast and complete implementation of Markdown in Python

Markdown is a light text markup format and a processor to convert that to HTML. The originator describes it as follows: Markdown is a text-to-HTML con

Trent Mick 2.4k Sep 20, 2022
A lightweight and fast-to-use Markdown document generator based on Python

A lightweight and fast-to-use Markdown document generator based on Python

快乐的老鼠宝宝 1 Jan 10, 2022
CiteURL is an extensible tool that parses legal citations and makes links to websites where you can read the cited language for free.

CiteURL is an extensible tool that parses legal citations and makes links to websites where you can read the cited language for free. It can be used t

null 12 Sep 5, 2022
Static site generator that supports Markdown and reST syntax. Powered by Python.

Pelican Pelican is a static site generator, written in Python. Write content in reStructuredText or Markdown using your editor of choice Includes a si

Pelican dev team 11.2k Sep 30, 2022
A Python implementation of John Gruber’s Markdown with Extension support.

Python-Markdown This is a Python implementation of John Gruber's Markdown. It is almost completely compliant with the reference implementation, though

Python-Markdown 3k Sep 27, 2022
A Python implementation of John Gruber’s Markdown with Extension support.

Python-Markdown This is a Python implementation of John Gruber's Markdown. It is almost completely compliant with the reference implementation, though

Python-Markdown 3k Oct 1, 2022
Extensions for Python Markdown

PyMdown Extensions Extensions for Python Markdown. Documentation Extension documentation is found here: https://facelessuser.github.io/pymdown-extensi

Isaac Muse 641 Sep 27, 2022
Lightweight Markdown dialect for Python desktop apps

Litemark is a lightweight Markdown dialect originally created to be the markup language for the Codegame Platform project. When you run litemark from the command line interface without any arguments, the Litemark Viewer opens and displays the rendered demo.

null 10 Apr 23, 2022
A markdown template manager for writing API docs in python.

DocsGen-py A markdown template manager for writing API docs in python. Contents Usage API Reference Usage You can install the latest commit of this re

Ethan Evans 1 May 10, 2022
Livemark is a static page generator that extends Markdown with interactive charts, tables, and more.

Livermark This software is in the early stages and is not well-tested Livemark is a static site generator that extends Markdown with interactive chart

Frictionless Data 81 Aug 31, 2022
Read a list in markdown and do something with it!

Markdown List Reader A simple tool for reading lists in markdown. Usage Begin by running the mdr.py file and input either a markdown string with the -

Esteban Garcia 3 Sep 13, 2021
Yuque2md - Offline download the markdown file and image from yuque

yuque2md 按照语雀知识库里的目录,导出语雀知识库中所有的markdown文档,并离线图片到本地 使用 安装 Python3.x clone 项目 下载依

JiaJianHuang 3 Apr 17, 2022
Convert HTML to Markdown-formatted text.

html2text html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to

Alireza Savand 1.3k Sep 22, 2022
Comprehensive Markdown plugin built for Django

Django MarkdownX Django MarkdownX is a comprehensive Markdown plugin built for Django, the renowned high-level Python web framework, with flexibility,

neutronX 735 Sep 23, 2022