Implementing a Simple HTML Parser in JavaScript
The article walks through building a simple JavaScript HTML parser by explaining browser parsing basics, using regular expressions to detect tags, managing a stack to match opening and closing elements, creating element and text node objects, and outlining code snippets while noting omitted features like script and style handling.
This article explains how to build a basic HTML parser using JavaScript. It starts by describing the role of the browser's HTML parser and then walks through the core principles of parsing HTML strings into a tree of nodes.
Core Principles
1. Use regular expressions to match opening tags, closing tags, and self‑closing tags. 2. Manage a stack (LIFO) to keep track of open elements and match them with their corresponding closing tags. 3. Record text nodes that appear between tags.
Key Code Snippets
const root = parse(`<div id="test" class="container" c="b"><div class="text-block"><span id="xxx">Hello World</span></div><img src="xx.jpg" /></div>`);
console.log(root);
// [{"tagName":"","children":[{"tagName":"div","attrs":{"id":"test","class":"container"},"rawAttrs":"id=\"test\" class=\"container\" c=\"b\"","type":"element","range":[0,128],"children":[{"tagName":"div","attrs":{"class":"text-block"},"rawAttrs":"class=\"text-block\"","type":"element","range":[39,102],"children":[{"tagName":"span","attrs":{"id":"xxx"},"rawAttrs":"id=\"xxx\"","type":"element","range":[63,96],"children":[{"type":"text","range":[78,89],"value":"Hello World"}]}]},{"tagName":"img","attrs":{},"rawAttrs":"src=\"xx.jpg\" ","type":"element","range":[102,122],"children":[]}]}]}] // 初始化 2 种 Node 类型
const nodeType = {
TEXT: 'text',
ELEMENT: 'element',
};
const frameflag = 'rootnode';
const createRange = (startPos, endPos) => {
const frameFlagOffset = frameflag.length + 2;
return [startPos - frameFlagOffset, endPos - frameFlagOffset];
};
function arrBack(arr) { return arr[arr.length - 1]; }
function parse(data) {
const root = { tagName: '', children: [] };
let currentParent = root;
const stack = [root];
let lastTextPos = -1;
data = `<${frameflag}>${data}</${frameflag}>`;
// ...开始遍历/解析
// 通过处理,将 stack 返回就是最终的结果
return statck;
} const regex = /foo/g;
const str = 'table football, foosball';
let matchArray;
while ((matchArray = regex.exec(str)) !== null) {
console.log(`Found ${matchArray[0]}. Next starts at ${regex.lastIndex}.`);
} const kMarkupPattern = /<(\/?)([a-zA-Z][-.:0-9_a-zA-Z]*)((?:\s+[^>]*?(?:(?:'[^']*')|(?:"[^"]*"))?)*)\s*(\/?)>/g;
while ((match = kMarkupPattern.exec(data))) {
let { 0: matchText, 1: leadingSlash, 2: tagName, 3: attributes, 4: closingSlash } = match;
const matchLength = matchText.length;
const tagStartPos = kMarkupPattern.lastIndex - matchLength;
const tagEndPos = kMarkupPattern.lastIndex;
// handle text nodes, attributes, etc.
} // 自闭合元素
const kSelfClosingElements = {
area: true,
img: true,
// ...省略了部分标签
};
if (leadingSlash || closingSlash || kSelfClosingElements[tagName]) {
if (currentParent.tagName === tagName) {
currentParent.range[1] = createRange(-1, Math.max(lastTextPos, tagEndPos))[1];
stack.pop();
currentParent = arrBack(stack);
} else {
stack.pop();
currentParent = arrBack(stack);
}
} if (!leadingSlash) {
const attrs = {};
const kAttributePattern = /(?:^|\s)(id|class)\s*=\s*((?:'[^']*')|(?:"[^"]*")|\S+)/gi;
for (let attMatch; (attMatch = kAttributePattern.exec(attributes));) {
const { 1: key, 2: val } = attMatch;
const isQuoted = val[0] === '\'' || val[0] === '"';
attrs[key.toLowerCase()] = isQuoted ? val.slice(1, val.length - 1) : val;
}
const currentNode = {
tagName,
attrs,
rawAttrs: attributes.slice(1),
type: nodeType.ELEMENT,
range: createRange(tagStartPos, tagEndPos),
children: [],
};
currentParent.children.push(currentNode);
currentParent = currentNode;
stack.push(currentParent);
}The tutorial also mentions that the implementation omits handling of script , style , and other node types, and suggests extending the parser to fully comply with the W3C HTML specification.
Finally, it provides references for further reading and credits the author.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.