Removing Useless Nodes From The DOM

For the third article in this series of short-and-sweet functions, I'd like to show you a simple function that I find indispensable, when working with static HTML in the DOM. The function is called clean(), and its purpose is to remove comments and whitespace-only text nodes.

The function takes a single element reference as its argument, and removes all those unwanted nodes from inside it. The function operates directly on the element in question, because objects in JavaScript are pass-by-reference — which means the function receives a reference to the original object, not a copy of it.

Here's the clean function's code:

function clean(node)
{
	for(var n = 0; n < node.childNodes.length; n ++)
	{
		var child = node.childNodes[n];
		if
		(
			child.nodeType === 8 
			|| 
			(child.nodeType === 3 && !/\S/.test(child.nodeValue))
		)
		{
			node.removeChild(child);
			n --;
		}
		else if(child.nodeType === 1)
		{
			clean(child);
		}
	}
}

So to clean those unwanted nodes from inside the <body>, you would simply do this:

clean(document.body);

Or to clean the entire document, you could do this:

clean(document);

Although the usual reference would be an Element node, it could also be another kind of element-containing node, such as a #document. The function is also not restricted to working with HTML, and can operate on any other kind of XML DOM.

What The Function's For

When working with the DOM in JavaScript, we use standard properties like firstChild and nextSibling to get relative node references. But a complication arises because of the presence of whitespace in the DOM, such as we can see in this example:

<div>
	<h2>Shopping list</h2>
	<ul>
		<li>Washing-up liquid</li>
		<li>Zinc nails</li>
		<li>Hydrochloric acid</li>
	</ul>
</div>

For most modern browsers (all apart from IE8 or earlier), that would have the following DOM structure:

DIV
#text ("\n\t")
+ H2
| + #text ("Shopping list")
+ #text ("\n\t")
+ UL
| + #text ("\n\t\t")
| + LI
| | + #text ("Washing-up liquid")
| + #text ("\n\t\t")
| + LI
| | + #text ("Zinc nails")
| + #text ("\n\t\t")
| + LI
| | + #text ("Hydrochloric acid")
| + #text ("\n\t")
+ #text ("\n")

The line-breaks and tabs inside that tree appear as whitespace #text nodes. So for example, if we started with a reference to the <h2> element, then h2.nextSibling would not refer to the <ul> element, it would refer to the whitespace #text node (the line-break and tab) that comes before it. Or if we started with a reference to the <ul>, then ul.firstChild would not be the first <li>, it would be the whitespace before it.

HTML comments are also nodes, and most browsers also preserve them in the DOM — as they should, because it's not up to browsers to decide which nodes are important and which are not. But it's very rare for scripts to actually want the data in comments; it's far more likely that comments (and intervening whitespace) are unwanted “junk” nodes.

There are several ways of dealing with these nodes, for example, by iterating past them:

var ul = h2.nextSibling;
while(ul.nodeType !== 1)
{
	ul = ul.nextSibling;
}

But by far the simplest, most practical approach, is simply to remove them. So that's what the clean function does — effectively normalizing the element's subtree, to create a model that matches our practical use of it, and is the same between browsers.

Once that <div> is cleaned then, those h2.nextSibling and ul.firstChild references will point to the expected elements, and its DOM will look like this:

SECTION
+ H2
| + #text ("Shopping list")
+ UL
| + LI
| | + #text ("Washing-up liquid")
| + LI
| | + #text ("Zinc nails")
| + LI
| | + #text ("Hydrochloric acid")

How The Function Works

This is an example of a recursive function — a function that calls itself. Recursion is a very powerful feature, and means that the function can clean a subtree of any size and depth. The key to that behavior is the final condition:

else if(child.nodeType === 1)
{
	clean(child);
}

So each of the element's child-elements is passed back through the function, then each of their child-elements is passed back through as well ... and so on, for as many descendents as there are.

Within each instance, the function iterates through the element's childNodes collection, removing any #comment nodes (which have the nodeType 8), or any #text nodes (with nodeType 3) whose value is nothing but whitespace. The regular-expression is actually in inverse test — looking for nodes which don't contain non-whitespace characters — because that's the simplest way of expressing the condition.

The function doesn't remove all whitespace, of course — any whitespace that's part of a #text node which also contains non-whitespace text, is preserved. So the only #text nodes to be affected are those which are only whitespace.

Note that the iterator has to query childeNodes.length every time, rather than saving the length in advance, as is usually more efficient. We have do this because we're removing nodes as we go along, which obviously changes the length of the collection.