Show HN: JustHTML – A pure Python HTML5 parser that just works
github.comI got frustrated with HTML parsing in Python.
I wanted a Python HTML parser that was both correct and easy to install. The C-based ones (lxml, selectolax) are fast but not HTML5 compliant. The pure Python ones (html.parser, BeautifulSoup's default) are easy to install but choke on real-world HTML. html5lib is 80% correct but painfully slow.
So I wrote JustHTML. It's:
• 100% HTML5 compliant – passes all 8,500+ html5lib tests. If a browser can parse it, JustHTML can.
• Pure Python, zero dependencies – pip install and go. Works on PyPy, Pyodide, anywhere.
• Fast enough – ~0.1s to parse Wikipedia's homepage. Not C-fast, but 50% faster than html5lib.
• Simple API – doc.query("div.foo > p") with CSS selectors. One method to learn.
Example:
from justhtml import JustHTML
doc = JustHTML("<div><p class='intro'>Hello!</p></div>")
print(doc.query(".intro")[0].to_html())
I've fuzz-tested it with 3 million malformed documents.Would love feedback, especially on the API design.