Skip to main content

HTML Parser Internals

The wgpu-html parser converts raw HTML strings into typed Tree<Node<Element>> structures. It operates in two phases: tokenization and tree building.

Architecture

"<div class='card'>Hello</div>"

▼ tokenizer::tokenize()
[OpenTag("div", [("class","card")], false), Text("Hello"), CloseTag("div")]

▼ tree_builder::build()
Tree { root: Some(Node::new(Div { class: "card", .. }).with_children([
Node::new(Element::Text("Hello"))
])) }

Tokenizer

The tokenizer (wgpu-html-parser/src/tokenizer.rs) scans the input character-by-character and emits a flat list of Token values:

pub enum Token {
Doctype(String),
OpenTag {
name: String,
attrs: Vec<(String, String)>,
self_closing: bool,
},
CloseTag(String),
Text(String),
Comment(String),
}

Tag Recognition

  • <tag> — Open tag with optional attributes
  • </tag> — Close tag
  • <tag/> — Self-closing tag (sets self_closing: true)
  • <!-- ... --> — Comment (tokenized, discarded by tree builder)
  • <!doctype ...> — DOCTYPE (tokenized, discarded by tree builder)

Attribute Parsing

Attributes are parsed inside open and self-closing tags. Three forms are supported:

<!-- Quoted attributes -->
<input type="text" value="hello">

<!-- Unquoted attributes -->
<input type=text>

<!-- Boolean attributes (value = empty string) -->
<input disabled required>

Attribute values undergo entity decoding (see below).

Raw-Text Elements

For <style>, <script>, <textarea>, and <title>, the tokenizer captures everything between the open and close tags as a single Text token — no further tokenization occurs inside:

<style>
/* This entire block is one Text token */
.card { color: red; }
</style>

<textarea>
Line 1
Line 2
This is <strong>not a tag</strong> — it's all text
</textarea>

Entity Decoding

The tokenizer decodes HTML entities in text content and attribute values:

EntityDecoded
&amp;amp;&
&amp;lt;<
&amp;gt;>
&amp;quot;"
&amp;apos;'
&amp;nbsp;\u{00A0} (non-breaking space)
&amp;#NN;Unicode codepoint NN (decimal)
&amp;#xNN;Unicode codepoint NN (hex)
<p>Hello &amp;amp; welcome &amp;mdash; click &amp;lt;here&amp;gt;</p>

Result: Hello & welcome — click <here>

Other named entities beyond &amp;, &lt;, &gt;, &quot;, &apos;, &nbsp; are not decoded — the parser recognizes only these five named entities plus numeric character references.

Tree Builder

The tree builder (wgpu-html-parser/src/tree_builder.rs) consumes the token stream and constructs the DOM tree.

Void Elements

14 elements are void (cannot have children, never need a closing tag):

area, base, br, col, embed, hr, img, input,
link, meta, param, source, track, wbr

When a void element is opened (or self-closed), it is immediately pushed and popped — no children are collected.

<br> <!-- immediately popped, no children -->
<img src="x.png"> <!-- immediately popped, no children -->
<hr/> <!-- self-closing, also immediately popped -->

Self-Closing Recognition

Any tag ending with /> sets self_closing: true. For void elements this is redundant; for non-void elements it functions as an immediate close:

<div/> <!-- treated as <div></div> -->
<span/> <!-- treated as <span></span> -->

Auto-Close Rules

The tree builder implements auto-close for several element groups to handle HTML where closing tags are omitted:

Opened TagAuto-closes on
<p>Next <p>, <div>, heading, <ul>, <ol>, <dl>, <table>, <form>, <header>, <footer>, <nav>, <section>, <article>, <aside>, <main>, <details>, <fieldset>, <figure>, <hr>, <pre>, <blockquote>, <address>, or end of parent
<li>Next <li>
<dt>, <dd>Next <dt> or <dd>
<thead>Next <tbody> or <tfoot>
<tbody>Next <thead> or <tfoot>
<tfoot>Next <tbody>
<tr>Next <tr>
<th>, <td>Next <th> or <td>
<option>Next <option> or <optgroup>
<optgroup>Next <optgroup>
<rt>, <rp>Next <rt> or <rp>

At end-of-file, all remaining open elements are auto-closed.

Unknown Tags

Tags not matching any of the ~98 recognized element types are dropped silently along with their entire subtree. The tree builder pushes a None slot on the stack, collects children normally (for nested recognized elements), and discards everything on close:

<custom-element>
<p>This paragraph survives</p> <!-- Actually, it won't →
</custom-element>
<!-- Neither <custom-element> nor <p> appear in the tree -->

Synthetic <body> Wrapping

If the token stream produces zero or one top-level node, it becomes the tree root directly. If multiple top-level nodes are produced, they are wrapped in a synthetic <body>:

<!-- Input: -->
<h1>Title</h1>
<p>Paragraph</p>
<!-- Tree root: <body><h1>Title</h1><p>Paragraph</p></body> -->

If one of the top-level nodes is already a <body>, siblings are merged into it instead of creating a second wrapper.

Inline CSS Extraction

style Attributes

The value of every style="..." attribute is parsed by parse_inline_style_decls():

use wgpu_html_parser::parse_inline_style;
use wgpu_html_models::Style;

let style: Style = parse_inline_style("color: red; font-size: 16px; display: flex;");

!important declarations are recognized and respected in cascade ordering. Custom property references (var(--x)) are resolved during cascade.

<style> Blocks

The text content of <style> elements is parsed by parse_stylesheet():

use wgpu_html_parser::Stylesheet;

let css = r#"
.card { background: #fff; border-radius: 8px; }
.card.active { border-color: blue; }
"#;
let sheet: Stylesheet = wgpu_html_parser::parse_stylesheet(css);

The parser supports:

  • Tag, #id, .class, universal *, and comma-separated selector lists
  • Descendant combinator (space) — e.g., .card p
  • /* CSS comments */
  • Specificity calculation: (id << 16) | (class << 8) | tag
  • !important flag

Child (>), sibling (+, ~), and attribute selectors ([attr]) are not yet supported in the stylesheet parser (they work in the query_selector API).

Serialization

The tree can be serialized back to HTML for debugging:

use wgpu_html_parser::parse;

let tree = parse("<div id='main'><p>Hello</p></div>");

// Full document with <!DOCTYPE> prefix
let html: String = tree.to_html();
// => <!DOCTYPE html>\n<div id="main"><p>Hello</p></div>

// Single node as HTML fragment
let node_html: Option<String> = tree.node_to_html(&[0, 0]);
// => Some("<p>Hello</p>")

// Any Node can serialize itself
let p_node = &tree.root.as_ref().unwrap().children[0];
let p_html: String = p_node.to_html();
// => <p>Hello</p>

Serialization escapes &, <, > in text content, and escapes ", &, <, > in attribute values. Void elements omit closing tags. data-* and aria-* attributes are included. Raw-text elements (<style>, <script>) serialize their content unescaped.

See Also