A story on finding an elusive security bug and managing it responsibly

Now that all the responsible disclosure processes have been followed through, I’d like to tell everyone a story of my very bad week last week. Don’t worry, it has a happy ending.

 

Part 1: Exposition

On May 5th we got a support request from a user who observed confusing behaviour in one of our systems. Our support staff immediately escalated it to me and my team sprung into action for what ended up being a 48-hour rollercoaster ride that ended with us reporting upstream to Django a security bug.

The bug, in a nutshell, is that when the following conditions lines up, a system could end up serving a request to one user that was meant for another:

– You are authenticating requests with cookies, OAuth or other authentication mechanisms
– The user is using any version of Internet Explorer or Chromeframe (to be more precise, anything with “MSIE” in the request user agent)
– You (or an ISP in the middle) are caching requests between Django and the internet (except Varnish’s default configuration, for reasons we’ll get to)
– You are serving the same URL with different content to different users

We rarely saw this combination of conditions because users of services provided by Canonical generally have a bias towards not using Internet Explorer, as you’d expect from a company who develops the world’s most used Linux distribution.

 

Part 2: Rising Action

Now, one may think that the bug is obvious, and wonder how it went unnoticed since 2008, but this really was one was one of those elusive “ninja-bugs” you hear about on the Internet and it took us quite a bit of effort to track it down.

In debugging situations such as this, the first step is generally to figure out how to reproduce the bug. In fact, figuring out how to reproduce it is often the lion’s share of the effort of fixing it.  However, no matter how much we tried we could not reproduce it. No matter what we changed, we always got back the right request. This was good, because it ruled out a widespread problem in our systems, but did not get us closer to figuring out the problem.

Putting aside reproducing it for a while, we then moved on to combing very carefully through our code, trying to find any hints of what could be causing this. Several of us looked at it with fresh eyes so we wouldn’t be tainted by having developed or reviewed the code, but we all still came up empty each and every time. Our code seemed perfectly correct.

We then went on to a close examination of all related requests to get new clues to where the problem was hiding. But we had a big challenge with this. As developers we don’t get access to any production information that could identify people. This is good for user privacy, of course, but made it hard to produce useful logs. We invested some effort to work around this while maintaining user privacy by creating a way to anonymise the logs in a way that would still let us find patterns in them. This effort turned up the first real clue.

We use Squid to cache data for each user, so that when they re-request the same data, it’s queued up right in memory and can be quickly served to them without having to recreate the data from the databases and other services. In those anonymized  Squid logs, we saw cookie-authenticated requests that didn’t contain an HTTP Vary header at all, where we expected it to have at the very least “Vary: Cookie” to ensure Squid would only serve the correct content all the time. So we then knew what was happening, but not why. We immediately pulled Squid out of the middle to stop this from happening.

Why was Squid not logging Vary headers? There were many possible culprits for this, so we got a *lot* of people were involved searching for the problem. We combed through everything in our frontend stack (Apache, Haproxy and Squid) that could sometimes remove Vary headers.

This was made all the harder because we had not yet fully Juju charmed every service, so could not easily access all configurations and test theories locally. Sometimes technical debt really gets expensive!

After this exhaustive search, we determined that nothing our code removed headers. So we started following the code up to Django middlewares, and went as far as logging the exact headers Django was sending out at the last middleware layer. Still nothing.

 

Part 3: The Climax

Until we got a break. Logs were still being generated, and eventually a pattern emerged. All the initial requests that had no Vary headers seemed for the most part to be from Internet Explorer. It didn’t make sense that a browser could remove headers that were returned from a server, but knowing this took us to the right place in the Django code, and because Django is open source, there was no friction in inspecting it deeply.  That’s when we saw it.

In a function called fix_IE_for_vary, we saw the offending line of code.

del response[‘Vary’]

We finally found the cause.

It turns out IE 6 and 7 didn’t have the HTTP Vary header implemented fully, so there’s a workaround in Django to remove it for any content that isn’t html or plain text. In hindsight, if Django would of implemented this instead as a middleware, even if default, it would have been more likely that this would have been revised earlier. Hindsight is always 20/20 though, and it easy to sit back and theorise on how things should have been done.

So if you’ve been serving any data that wasn’t html or plain text with a caching layer in the middle that implements Vary header management to-spec (Varnish doesn’t trust it by default, and checks the cookie in the request anyway), you may have improperly returned a request.

Newer versions if Internet Explorer have since fixed this, but who knew in 2008 IE 9 would come 3 years later?

 

Part 4: Falling Action

We immediately applied a temporary fix to all our running Django instances in Canonical and involved our security team to follow standard responsible disclosure processes. The Canonical security team was now in the driving seat and worked to assign a CVE number and email the Django security contact with details on the bug, how to reproduce it and links to the specific code in the Django tree.

The Django team immediately and professionally acknowledged the bug and began researching possible solutions as well as any other parts of the code where this scenario could occur. There was continuous communication among our teams for the next few days while we agreed on lead times for distributions to receive and prepare the security fix,

 

Part 5: Resolution

I can’t highlight enough how important it is to follow these well-established processes to make sure we keep the Internet at large a generally safe place.
To summarise, if you’re running Django, please update to the latest security release as quickly as possible, and disable any internal caching until then to minimise the chances of hitting this bug.

If you’re running squid and want to check if you could be affected, here’s a small python script to run against your logs we put together you can use as a base, you may need to tweak it based on your log format. Be sure to run it only against cookie-authenticated URLs, otherwise you will hit a lot of false positives.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: