PageRenderTime 44ms CodeModel.GetById 20ms RepoModel.GetById 0ms app.codeStats 0ms

/_posts/archived/2008-11-10-the-landmine-of-parsing-html-and-stripping-html-comments.aspx.markdown

https://gitlab.com/Blueprint-Marketing/haacked.com
Markdown | 127 lines | 105 code | 22 blank | 0 comment | 0 complexity | 45846fc0ff8db9f26d385a5d096fad31 MD5 | raw file
  1. ---
  2. layout: post
  3. title: "The Landmine of Parsing HTML and Stripping HTML Comments"
  4. date: 2008-11-10 -0800
  5. comments: true
  6. disqus_identifier: 18551
  7. categories: [code, regex]
  8. ---
  9. A while ago I wrote a blog post about how painful it is to [properly
  10. parse an email
  11. address](http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx "Validating an email addres").
  12. This post is kind of like that, except that this time, I take on HTML.
  13. Ive written about [parsing HTML with a regular
  14. expression](http://haacked.com/archive/2005/04/22/Matching_HTML_With_Regex.aspx "Matching HTML with regular expressions")
  15. in the past and pointed out that its extremely tricky and probably not
  16. a good idea to use regular expressions in this case. In this post, I
  17. want to strip out HTML comments. Why?
  18. I had some code that uses a regular expression to strip comments from
  19. HTML, but found one of those feared pathological cases in which it
  20. seems to never complete and pegs my CPU at 100% in the meanwhile. I
  21. figure I might as well look into trying a character by character
  22. approach to stripping HTML.
  23. It sounds easy at first, and my first attempt was roughly 34 lines of
  24. procedural style code. But then I started digging into the edge cases.
  25. Take a look at this:
  26. ```csharp
  27. <p title="<!-- this is a comment-->">Test 1</p>
  28. ```
  29. Should I strip that comment within the attribute value or not?
  30. Technically, this isnt valid HTML since the first angle bracket within
  31. the attribute value should be encoded. However, the three browsers I
  32. checked (IE 8, FF3, Google Chrome) all honor this markup and render the
  33. following.
  34. ![funky
  35. comment](http://haacked.com/images/haacked_com/WindowsLiveWriter/TheLandmineofParsingHTMLandStrippingHTML_E73B/funky-comment_3.png "funky comment")
  36. Notice that when I put the mouse over Test 1 and the browser rendered
  37. the value of the *title* attribute as a tooltip. Thats not even the
  38. funkiest case. Check this bit out in which my comment is an unquoted
  39. attribute value. Ugly!
  40. ```csharp
  41. <p title=<!this-comment>Test 2</p>
  42. ```
  43. Still, the browsers dutifully render it:
  44. ![funkier-comment](http://haacked.com/images/haacked_com/WindowsLiveWriter/TheLandmineofParsingHTMLandStrippingHTML_E73B/funkier-comment_3.png "funkier-comment") 
  45. At this point, It might seem like Im spending too much time worrying
  46. about crazy edge cases, which is probably true. Should I simply strip
  47. these comments even if they happen to be within attribute values because
  48. theyre technically invalid. However, it worries me a bit to impose a
  49. different behavior than the browser does.
  50. Just thinking out loud here, but what if the user can specify a style
  51. attribute (bad idea) for an element and they enter:
  52. `<!>color: expression(alert('test'))`
  53. Which fully rendered yields:
  54. `<p style="<!>color: expression(alert('test'))">`
  55. If we strip out the comment, then suddenly, the style attribute might
  56. lend itself to an [attribute based XSS
  57. attack](http://jeremiahgrossman.blogspot.com/2007/07/attribute-based-cross-site-scripting.html "Attribute Based XSS").
  58. I tried this on the three browsers I mentioned and nothing bad happened,
  59. so maybe its a non issue. But I figured it would probably make sense to
  60. go ahead and strip the HTML comments in the cases that the browser. So I
  61. decided to not strip any comments within an HTML tag, which means I have
  62. to identify HTML tags. That starts to get a bit ugly as \<foo \> is
  63. assumed to be an HTML tag and not displayed while \<çoo /\> is just
  64. content and displayed.
  65. Before I show the code, I should clarify something. Ive been a bit
  66. imprecise here. Technically, a comment starts with a character, but
  67. Ive referred to markup such as `<!>` as being a comment. Technically
  68. its not, but it behaves like one in the sense that the browser DOM
  69. recognizes it as such. With HTML you can have multiple comments between
  70. the \<! and the \> delimiters according to [section 3.2.5 of RFC
  71. 1866](http://www.freesoft.org/CIE/RFC/1866/15.htm "Section 3.2.5 RFC 1866").
  72. > 3.2.5. Comments
  73. >
  74. > To include comments in an HTML document, use a comment declaration. A
  75. > comment declaration consists of `<!' followed by zero or more
  76. > comments followed by `>'. Each comment starts with `--' and includes
  77. > all text up to and including the next occurrence of `--'. In a
  78. > comment declaration, white space is allowed after each comment, but
  79. > not before the first comment. The entire comment declaration is
  80. > ignored.
  81. >
  82. > NOTE - Some historical HTML implementations incorrectly consider
  83. > any `>' character to be the termination of a comment.
  84. >
  85. > For example:
  86. >
  87. > <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
  88. > <HEAD>
  89. > <TITLE>HTML Comment Example</TITLE>
  90. > <!-- Id: html-sgml.sgm,v 1.5 1995/05/26 21:29:50 connolly Exp -->
  91. > <!-- another -- -- comment -->
  92. > <!>
  93. > </HEAD>
  94. > <BODY>
  95. > <p> <!- not a comment, just regular old data characters ->
  96. >
  97. The code I wrote today was straight up old school procedural code with
  98. no attempt to make it modular, maintainable, object oriented, etc I
  99. posted it [to refactormycode.com
  100. here](http://refactormycode.com/codes/597-strip-html-comments "Refactor My Code")
  101. with the unit tests I defined.
  102. In the end, I might not use this code as I realized later that what I
  103. really should be doing in the particular scenario I have is simply
  104. stripping all HTML tags and comments. In any case, I hope to never have
  105. to parse HTML again. ;)