Build a Simple HTML Parser in JavaScript: Step-by-Step Guide
This article explains how to create a basic HTML parser in JavaScript by using regular expressions and a stack to convert an HTML string into a tree of nodes, complete with code examples, initialization steps, and handling of opening, closing, and self‑closing tags.
Principle Overview
We need to implement a parse function that receives an HTML string and returns a tree structure representing the DOM.
const root = parse(`<div id="test" class="container" c="b"><div class="text-block"><span id="xxx">Hello World</span></div><img src="xx.jpg" /></div>`);
console.log(root);Core Principles
Use regular expressions to match tags such as <tag class="tag" aa=""> and </tag>.
Match opening and closing tag pairs with a stack (LIFO) approach.
Initialization
Define simple node types and helper functions:
// Initialize two node types
const nodeType = {
TEXT: 'text',
ELEMENT: 'element'
};
const frameflag = 'rootnode'; // simulated root tag
const createRange = (startPos, endPos) => {
const frameFlagOffset = frameflag.length + 2;
return [startPos - frameFlagOffset, endPos - frameFlagOffset];
};
function arrBack(arr) { return arr[arr.length - 1]; }
function parse(data) {
const root = { tagName: '', children: [] };
let currentParent = root;
const stack = [root];
let lastTextPos = -1;
data = `<${frameflag}>${data}</${frameflag}>`;
// ... parsing logic ...
return stack;
}Traversing and Extracting Tag Strings
Example HTML fragment to be parsed:
<div id="test" class="container" c="b">
<div class="text-block">
<span id="xxx">Hello World</span>
</div>
<img src="xx.jpg" />
</div>The parser extracts each tag string in order using RegExp.prototype.exec:
const regex = /foo/g;
const str = 'table football, foosball';
let matchArray;
while ((matchArray = regex.exec(str)) !== null) {
console.log(`Found ${matchArray[0]}. Next starts at ${regex.lastIndex}.`);
}Markup pattern for HTML tags:
const kMarkupPattern = /<(\/?)([a-zA-Z][-.:0-9_a-zA-Z]*)((?:\s+[^>]*?(?:(?:'[^']*')|(?:"[^"]*"))?)*)\s*(\/?)>/g;
while ((match = kMarkupPattern.exec(data))) {
const {0: matchText, 1: leadingSlash, 2: tagName, 3: attributes, 4: closingSlash} = match;
// process matchText, tagName, attributes, etc.
}Handling Opening Tags
if (!leadingSlash) {
const attrs = {};
const kAttributePattern = /(?:^|\s)(id|class)\s*=\s*((?:'[^']*')|(?:"[^"]*")|\S+)/gi;
for (let attMatch; (attMatch = kAttributePattern.exec(attributes));) {
const {1: key, 2: val} = attMatch;
const isQuoted = val[0] === '\'' || val[0] === '"';
attrs[key.toLowerCase()] = isQuoted ? val.slice(1, -1) : val;
}
const currentNode = {
tagName,
attrs,
rawAttrs: attributes.slice(1),
type: nodeType.ELEMENT,
range: createRange(tagStartPos, tagEndPos),
children: []
};
currentParent.children.push(currentNode);
currentParent = currentNode;
stack.push(currentParent);
}The stack is crucial: it uses a LIFO principle to match each opening tag with its corresponding closing tag.
Handling Closing and Self‑Closing Tags
const kSelfClosingElements = { area: true, img: true /* ... */ };
if (leadingSlash || closingSlash || kSelfClosingElements[tagName]) {
if (currentParent.tagName === tagName) {
currentParent.range[1] = createRange(-1, Math.max(lastTextPos, tagEndPos))[1];
stack.pop();
currentParent = arrBack(stack);
} else {
stack.pop();
currentParent = arrBack(stack);
}
}Conclusion
The article demonstrates how to build a basic HTML parser with JavaScript, covering regex matching, stack management, and node construction. It notes that further work is needed to handle script/style tags and to implement proper Node subclasses such as Element, HTMLElement, Text, and Comment according to the W3C specification.
Taobao Frontend Technology
The frontend landscape is constantly evolving, with rapid innovations across familiar languages. Like us, your understanding of the frontend is continually refreshed. Join us on Taobao, a vibrant, all‑encompassing platform, to uncover limitless potential.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
