Lessons on a Thursday night: Using Hpricot with the tag vomit that is Microsoft Word HTML

Yes, Microsoft Word is a disaster. Its markup is easily the ugliest ever created, and it is so widely used, it makes me weep.
 
Here's a small snippet of terribleness:

<p style='mso-margin-top-alt:12.0pt;margin-right:0cm;margin-bottom:12.0pt; margin-left:0cm;line-height:14.25pt'><span style='font-size:10.0pt;font-family:"Lucida Grande"'>
Microsoft word ruins my life hardcore.</span></p>

Unfortunately, the biggest problem with this snippet of code is that it uses double quotes inside singlequotes (shown in red), e.g. 'font-family:"Lucida Grande", "sans-serif"'. This may or may not be to spec -- I doubt it, though. Who the heck would use quotes without escaping? Well, Microsoft Word does.
 
Hpricot actually chokes on this. Here's what I learned -- before ever passing Microsoft Word/Outlook - generated HTML, be sure to run this regex on the markup so that we kill these double quotes and avoid big-time disasters.
 

   output = input.gsub(/style='[^']*(font-family:)[^']*'/mi) { |sub| sub.gsub(/"/, '') }
Hopefully through the magic of Google, this will save somebody some time. I know I searched far and wide and couldn't really find even one article on cleaning up Microsoft Word HTML, what with its disastrous mso tags and MsoNormal classes strewn all over the place.
 
It's still tag vomit, but at least it's tag vomit that won't end the life of your Hpricot parser.
views
4 responses
mso-margin top? margin-left:0cm? yeahh ok Microsoft
What if you ran it through HTMLtidy first?
1 visitor upvoted this post.