2008-11-10-the-landmine-of-parsing-html-and-stripping-html-comments.aspx.markdown

/_posts/archived/2008-11-10-the-landmine-of-parsing-html-and-stripping-html-comments.aspx.markdown

https://gitlab.com/Blueprint-Marketing/haacked.com · Markdown · 127 lines · 105 code · 22 blank · 0 comment · 0 complexity · 45846fc0ff8db9f26d385a5d096fad31 MD5 · raw file

---
layout: post
title: "The Landmine of Parsing HTML and Stripping HTML Comments"
date: 2008-11-10 -0800
comments: true
disqus_identifier: 18551
categories: [code, regex]
---

A while ago I wrote a blog post about how painful it is to [properly
parse an email
address](http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx "Validating an email addres").
This post is kind of like that, except that this time, I take on HTML.

I’ve written about [parsing HTML with a regular
expression](http://haacked.com/archive/2005/04/22/Matching_HTML_With_Regex.aspx "Matching HTML with regular expressions")
in the past and pointed out that it’s extremely tricky and probably not
a good idea to use regular expressions in this case. In this post, I
want to strip out HTML comments. Why?

I had some code that uses a regular expression to strip comments from
HTML, but found one of those feared “pathological” cases in which it
seems to never complete and pegs my CPU at 100% in the meanwhile. I
figure I might as well look into trying a character by character
approach to stripping HTML.

It sounds easy at first, and my first attempt was roughly 34 lines of
procedural style code. But then I started digging into the edge cases.
Take a look at this:

```csharp
<p title="<!-- this is a comment-->">Test 1</p>
```

Should I strip that comment within the attribute value or not?
Technically, this isn’t valid HTML since the first angle bracket within
the attribute value should be encoded. However, the three browsers I
checked (IE 8, FF3, Google Chrome) all honor this markup and render the
following.

![funky
comment](http://haacked.com/images/haacked_com/WindowsLiveWriter/TheLandmineofParsingHTMLandStrippingHTML_E73B/funky-comment_3.png "funky comment")

Notice that when I put the mouse over “Test 1” and the browser rendered
the value of the *title* attribute as a tooltip. That’s not even the
funkiest case. Check this bit out in which my comment is an unquoted
attribute value. Ugly!

```csharp
<p title=<!this-comment>Test 2</p>
```

Still, the browsers dutifully render it:

![funkier-comment](http://haacked.com/images/haacked_com/WindowsLiveWriter/TheLandmineofParsingHTMLandStrippingHTML_E73B/funkier-comment_3.png "funkier-comment") 

At this point, It might seem like I’m spending too much time worrying
about crazy edge cases, which is probably true. Should I simply strip
these comments even if they happen to be within attribute values because
they’re technically invalid. However, it worries me a bit to impose a
different behavior than the browser does.

Just thinking out loud here, but what if the user can specify a style
attribute (bad idea) for an element and they enter:

`<!>color: expression(alert('test'))`

Which fully rendered yields:
`<p style="<!>color: expression(alert('test'))">`

If we strip out the comment, then suddenly, the style attribute might
lend itself to an [attribute based XSS
attack](http://jeremiahgrossman.blogspot.com/2007/07/attribute-based-cross-site-scripting.html "Attribute Based XSS").

I tried this on the three browsers I mentioned and nothing bad happened,
so maybe it’s a non issue. But I figured it would probably make sense to
go ahead and strip the HTML comments in the cases that the browser. So I
decided to not strip any comments within an HTML tag, which means I have
to identify HTML tags. That starts to get a bit ugly as \<foo \> is
assumed to be an HTML tag and not displayed while \<çoo /\> is just
content and displayed.

Before I show the code, I should clarify something. I’ve been a bit
imprecise here. Technically, a comment starts with a – character, but
I’ve referred to markup such as `<!>` as being a comment. Technically
it’s not, but it behaves like one in the sense that the browser DOM
recognizes it as such. With HTML you can have multiple comments between
the \<! and the \> delimiters according to [section 3.2.5 of RFC
1866](http://www.freesoft.org/CIE/RFC/1866/15.htm "Section 3.2.5 RFC 1866").

>     3.2.5. Comments
>
>        To include comments in an HTML document, use a comment declaration. A
>        comment declaration consists of `<!' followed by zero or more
>        comments followed by `>'. Each comment starts with `--' and includes
>        all text up to and including the next occurrence of `--'. In a
>        comment declaration, white space is allowed after each comment, but
>        not before the first comment.  The entire comment declaration is
>        ignored.
>
>           NOTE - Some historical HTML implementations incorrectly consider
>           any `>' character to be the termination of a comment.
>
>        For example:
>
>         <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
>         <HEAD>
>         <TITLE>HTML Comment Example</TITLE>
>         <!-- Id: html-sgml.sgm,v 1.5 1995/05/26 21:29:50 connolly Exp  -->
>         <!-- another -- -- comment -->
>         <!>
>         </HEAD>
>         <BODY>
>         <p> <!- not a comment, just regular old data characters ->
>         

The code I wrote today was straight up old school procedural code with
no attempt to make it modular, maintainable, object oriented, etc… I
posted it [to refactormycode.com
here](http://refactormycode.com/codes/597-strip-html-comments "Refactor My Code")
with the unit tests I defined.

In the end, I might not use this code as I realized later that what I
really should be doing in the particular scenario I have is simply
stripping all HTML tags and comments. In any case, I hope to never have
to parse HTML again. ;)