Build a Simple HTML Parser in JavaScript: Step-by-Step Guide

This article explains how to create a basic HTML parser in JavaScript by using regular expressions and a stack to convert an HTML string into a tree of nodes, complete with code examples, initialization steps, and handling of opening, closing, and self‑closing tags.

Taobao Frontend Technology
Taobao Frontend Technology
Taobao Frontend Technology
Build a Simple HTML Parser in JavaScript: Step-by-Step Guide

Principle Overview

We need to implement a parse function that receives an HTML string and returns a tree structure representing the DOM.

const root = parse(`<div id="test" class="container" c="b"><div class="text-block"><span id="xxx">Hello World</span></div><img src="xx.jpg" /></div>`);
console.log(root);

Core Principles

Use regular expressions to match tags such as <tag class="tag" aa=""> and </tag>.

Match opening and closing tag pairs with a stack (LIFO) approach.

Initialization

Define simple node types and helper functions:

// Initialize two node types
const nodeType = {
  TEXT: 'text',
  ELEMENT: 'element'
};
const frameflag = 'rootnode'; // simulated root tag
const createRange = (startPos, endPos) => {
  const frameFlagOffset = frameflag.length + 2;
  return [startPos - frameFlagOffset, endPos - frameFlagOffset];
};
function arrBack(arr) { return arr[arr.length - 1]; }
function parse(data) {
  const root = { tagName: '', children: [] };
  let currentParent = root;
  const stack = [root];
  let lastTextPos = -1;
  data = `<${frameflag}>${data}</${frameflag}>`;
  // ... parsing logic ...
  return stack;
}

Traversing and Extracting Tag Strings

Example HTML fragment to be parsed:

<div id="test" class="container" c="b">
  <div class="text-block">
    <span id="xxx">Hello World</span>
  </div>
  <img src="xx.jpg" />
</div>

The parser extracts each tag string in order using RegExp.prototype.exec:

const regex = /foo/g;
const str = 'table football, foosball';
let matchArray;
while ((matchArray = regex.exec(str)) !== null) {
  console.log(`Found ${matchArray[0]}. Next starts at ${regex.lastIndex}.`);
}

Markup pattern for HTML tags:

const kMarkupPattern = /<(\/?)([a-zA-Z][-.:0-9_a-zA-Z]*)((?:\s+[^>]*?(?:(?:'[^']*')|(?:"[^"]*"))?)*)\s*(\/?)>/g;
while ((match = kMarkupPattern.exec(data))) {
  const {0: matchText, 1: leadingSlash, 2: tagName, 3: attributes, 4: closingSlash} = match;
  // process matchText, tagName, attributes, etc.
}

Handling Opening Tags

if (!leadingSlash) {
  const attrs = {};
  const kAttributePattern = /(?:^|\s)(id|class)\s*=\s*((?:'[^']*')|(?:"[^"]*")|\S+)/gi;
  for (let attMatch; (attMatch = kAttributePattern.exec(attributes));) {
    const {1: key, 2: val} = attMatch;
    const isQuoted = val[0] === '\'' || val[0] === '"';
    attrs[key.toLowerCase()] = isQuoted ? val.slice(1, -1) : val;
  }
  const currentNode = {
    tagName,
    attrs,
    rawAttrs: attributes.slice(1),
    type: nodeType.ELEMENT,
    range: createRange(tagStartPos, tagEndPos),
    children: []
  };
  currentParent.children.push(currentNode);
  currentParent = currentNode;
  stack.push(currentParent);
}
The stack is crucial: it uses a LIFO principle to match each opening tag with its corresponding closing tag.

Handling Closing and Self‑Closing Tags

const kSelfClosingElements = { area: true, img: true /* ... */ };
if (leadingSlash || closingSlash || kSelfClosingElements[tagName]) {
  if (currentParent.tagName === tagName) {
    currentParent.range[1] = createRange(-1, Math.max(lastTextPos, tagEndPos))[1];
    stack.pop();
    currentParent = arrBack(stack);
  } else {
    stack.pop();
    currentParent = arrBack(stack);
  }
}

Conclusion

The article demonstrates how to build a basic HTML parser with JavaScript, covering regex matching, stack management, and node construction. It notes that further work is needed to handle script/style tags and to implement proper Node subclasses such as Element, HTMLElement, Text, and Comment according to the W3C specification.

JavaScriptStackDOMHTML parserregex
Taobao Frontend Technology
Written by

Taobao Frontend Technology

The frontend landscape is constantly evolving, with rapid innovations across familiar languages. Like us, your understanding of the frontend is continually refreshed. Join us on Taobao, a vibrant, all‑encompassing platform, to uncover limitless potential.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.