JSON / ActiveSupport Rails gotcha: Avoid XSS exploits when passing HTML in JSON

Ran into a tricky situation today -- we're working on the ability to support javascript on Posterous blogs. One problem we saw with the Theme Editor was that </script> tags were causing problems with JSON. Browsers would see a </script> block and actually interpret as the end of the entire script block, as opposed to merely an entity within the JSON string. When dealing with user generated content, that also opens your site up to a pretty serious JS XSS attack.

ActiveSupport actually has been modified to translate < and > into their unicode encodings and avoid this problem. However, if you like many people use / require the JSON gem, this ActiveSupport to_json implementation is stripped entirely.

The simple fix -- make your own String that contains the necessary String to_json method. Be sure to use the new string class in place of the standard string class when you want the appropriate behavior of escaping angle brackets.

Here's the code:
# this is needed so that we can still access the original ActiveSupport version of JSON encoding
# JSON gem is faster but does not support automatic unicode conversion for < and >, which can cause
# problems for </script> in JSON output (browser interprets as exiting the script area, and results in XSS exploit)
#
# e.g. EscapableJsonString.new('<no_xss>').to_json
# => "\u003Cno_xss\u003E"
#
class EscapableJsonString < String
def to_json(options = nil) #:nodoc:
json = '"' + gsub(ActiveSupport::JSON::Encoding.escape_regex) { |s|
ActiveSupport::JSON::Encoding::ESCAPED_CHARS[s]
}
json.force_encoding('ascii-8bit') if respond_to?(:force_encoding)
json.gsub(/([\xC0-\xDF][\x80-\xBF]|
[\xE0-\xEF][\x80-\xBF]{2}|
[\xF0-\xF7][\x80-\xBF]{3})+/nx) { |s|
s.unpack("U*").pack("n*").unpack("H*")[0].gsub(/.{4}/, '\\\\u\&')
} + '"'
end
end

Scaling Facebook vs Scaling Digg: It's a question of disk vs RAM

Facebook takes a Pull on Demand approach. To recreate a page or a display fragment they run the complete query. To find out if one of your friends has added a new favorite band Facebook actually queries all your friends to find what's new. They can get away with this but because of their awesome infrastructure.  

But if you've ever wondered why Facebook has a 5,000 user limit on the number of friends, this is why. At a certain point it's hard to make Pull on Demand scale.

Another approach to find out what's new is the Push on Change model. In this model when a user makes a change it is pushed out to all the relevant users and the changes (in some form) are stored with each user. So when a user want to view their updates all they need to access is their own account data. There's no need to poll all their friends for changes.

Really interesting article at High Scalability on ways to approach scaling your data store.

We use push on change as well, particularly for your reading list subscriptions. To be honest, it's cheaper. You use disk space to pre-compute things that would be expensive to ask the database repeatedly. It allows you to just add disk -- even though disk is orders of magnitudes slower than RAM.

The Facebook approach is *really hard* to get right. It's costly because so much info just has to live in RAM, and could be one reason why it's much harder for Facebook to reach profitability than most other sites. If you have to add RAM to keep all user data in cache, that's a lot of hardware to keep going.

But when it comes to realtime, Facebook is as realtime as they come. They are some real engineering badasses.

It makes no sense at all that John Resig's JS book underperforms that other turd of a JS book everyone buys

Tracking my ranking over the past year [Pro Javascript Techniques] been consistently in the 10-20,000 range, with occasional dips into the < 10,000 range. JavaScript: The Definitive Guide is always < 5,000 (for comparison).
--John Resig via ejohn.org

Really? That is some really really weak sauce. Pro Javascript Techniques rocks. It's a great book, and totally useful.

Javascript: The Definitive Guide, on the other hand, is a piece of turd. It routinely assumes you *already know* javascript and its prose is almost unreadable.

I wonder why Resig's awesome book is completely dominated by sales of the abysmal O'Reilly book. Probably for two reasons: a) the title (one is way more universal and likely to be bought by aspirational JS developers) and b) O'Reilly doesn't usually put out crap, so its very surprising when it does.