The short version:
As part of version 3.0 (see #391) should Python-Markdown perhaps abandon ElementTree for a different document object like Docutils' node tree or use a modified ElementTree for internally representing the Parsed HTML document?
Any and all feedback is welcome.
The long version:
Starting in Python-Markdown version 2.0, internally parsed documents have been represented as ElementTree objects. While this mostly works, there are a few irritations. ElementTree (hereinafter ET) was designed for XML, not HTML and therefore a few of its design choices are less than ideal when working with HTML.
For example, by design, XML does not generally have text and child nodes interspersed like HTML does. While ET provides text
and tail
attributes on each element, it is not as easy to work with as it would be if the text was contained in child "TextNodes" (much like JavaScript's DOM). Additionally, ET nodes have no knowledge of their parent(s), which can be a problem in certain HTML specific situations (some elements cannot contain other elements as children or grandchildren or great-grandchildren...).
I see two possible workarounds to this: Modify ET or use a different type of object.
Modifying ElementTree
We already have a modified serializer which gives us better HTML output (it is actually a modified HTML serializer from ET) and we already import ET and document that all extensions should import ET from Markdown. Therefore, if we were to change anything (via subclasses, etc) those changes would propagate throughout all extensions without too much change.
In fact, some time ago, I played around with the idea of making ET nodes aware of their parents. While it worked, I quickly abandoned it as I realized that it would not work for cElementTree. However, on further consideration, we don't really need cElementTree (most of the benefits are in a faster XML parser which we don't use).
Interestingly, in Python 3.3 cElementTree is deprecated. What actually happens is that ET defines the Python implementation and then at the bottom of the module, it tries to import the C implementation, which upon success, overrides the Python objects of the same name. What is interesting about this is that the Python implementation of the Element
class (ET's node object) is preserved as _Element_Py
for external code which needs access to it (as explained in the comments).
I envision a modified ET lib to basically subclass the Python Element
object to enforce knowledge of parents for all nodes. Then a TextNode
would be created which works essentially like Comments work now:
def TextElement(text=None):
element = Element(TextElement)
element.text = text
return element
The serializer would then be updated to properly output TextElement
s. In fact, at some point, the serializer might even be able to loose knowledge of the text
and tail
attributes on regular nodes. However, that last bit could wait for all extensions to adopt the new stuff.
In addition to TextElement
we could also have RawTextElement
and AtomicTextElement
. Both would be ignored by the parser (no additional parsing would take place). However, a RawTextElement
would be given special treatment by the serializer in that no escaping would take place (raw HTML could be stored inline in the document rather than in a separate store with placeholders in the document), whereas an AtomicTextElement
would be serialized like a regular TextElement
.
The advantage of an AtomicTextElement
(over the existing AtomicString) is that a single node could have multiple child text nodes. Today, each node only gets one text
attribute. Therefore, when a AtomicString is concatenated with an existing text
string, we lose the 'atomic' quality of the sub-string. However, with this change each sub-string can reside in its own separate text node and maintain the 'atomic' quality when necessary.
Using Docutils
Rather that creating our own one-off hacked version of ET, we could instead use an already existing library which gives us all of the same features (and more). Today, the only widely supported and stable library I'm aware of is Docutils' Document Tree. While the Document Tree is described as an XML representation of a document, Docutils provides a Python API to work with the Document Tree which is very similar to the modified ET API I described above (known parents, TextElement, FixedTextElement...). Unfortunately that API is not documented. Although, the the source code is easy enough to follow.
Until recently, I was of the assumption that to implement something that used Docutils, one would need to define a bunch of directives (etc) which more-or-less modify the ReST parser. However, take a look at the Overview of the Docutils Architecture. A parser simply needs to create a node tree. In fact, the base Parser class is only a few simple methods. The entire directives thing is in a separate directory under the ReST Parser only. Theoretically, one could subclass the base Parser class, and build a node tree using whatever parsing method desired and Docutils wouldn't care.
For that matter, Python-Markdown would not have to replicate Docutils "Parser" API. We could just use the node tree internally. As a plus, this would give us access to all of the built-in and third party Docutils writers (serializers). In other words, we would get all of Docutils output formats for free.
Additionally, Docutils' node tree also provides for various meta-data to be stored within the node tree. For example, each node can contain the line and column at which its contents were found in the original source document. This provides an easy way for a spellchecker to run against the parser and report misspelled words in the document without first converting it to HTML, among other uses which do not require serialized output.
No, this would not make Python-Markdown suddenly able to be supported by Sphinx. Sphinx is mostly a collection of custom directives built on top of the ReST parser. ReST directives do not make sense in Markdown. However, we could convert Markdown to ReST as many other third party parsers convert various formats to ReST via a ReST writer. There is also at least one third party writer which outputs Markdown from a node tree. By adopting Docutils node tree, Python-Markdown could become part of an ecosystem for converting between all sorts of various document formats (an expandable competitor to Pandoc?).
The downsides to using Docutils are that we are then relying on a third party library (up till now, Python-Markdown has not) and all extensions would absolutely be forced to change to support the new version. It is also possible that we wouldn't be able to use the available HTML writer as the default because of some inherent differences with Markdown and ReST (ReST is much more verbose and we might need to hack the node tree or the writer to get the writer to output correct HTML from a Markdown perspective -- I have not investigated this).
As it stands now, there are various small changes required of extensions between version 2 and 3, but I expect that most extensions would be able to support both without much effort. If we went with Docutils, that would no longer be the case.
Or, maybe this whole thing is a bad idea and we should just continue to use ET as-is.
Any and all feedback is welcome.
feature core needs-decision