• How we hire software developers at ShipHero

    I’ve been hiring folks in software development for about 20 years now, 15 of them remotely, so it’s no surprise I’ve accumulated some opinions over time.

    I wanted to share a bit what we’ve done at ShipHero over the last 3 years, where we’ve experimented often with different approaches that have succeeded and failed in different ways. The more time goes by the clearer it becomes to me that there’s no one-size-fits-all approach to it, mostly because companies, which are just groups of humans, work in surprisingly different ways.

    I do strongly believe that there’s one thing that will generally work better than average in healthy environments: testing for collaboration & communication skills. I don’t think that’s a surprising revelation to anyone who’s been building teams for a while, the reason why I think it’s infrequent that interview processes take that heavily into account is because it takes a lot of time and energy to find out if someone’s good at collaborating.

    It did take me a long time to realize just how easy it is to test for technical skills vs. skills like collaborating with others, and how technical skills can be improved quicker than a person’s understanding on how to work well with others. I think you can probably get most of what you need to know from a person’s technical experience in 30 minutes, but everything else just requires way more time to figure out. And the hard truth is, a person who’s able to work well within a team but lacks technical expertise has a much better chance of thriving over time than someone’s who’s brilliant but unable to build things with others.

    What we tried

    Once we agreed internally on that premise we did a multi-month experiment and changed the way we hired in a pretty drastic way, in an attempt to emphasize collaboration skills over everything else. We mapped out a week-long process:

    • Introduced a dozen questions into the application process and made submitting a CV optional. We didn’t look at CVs or names at all during the pre-selection process.
    • Because we were asking people to spend a lot of time on our process, we committed to paying people $500 if they went through the process regardless of whether they were offered a role or not. We wanted people to feel like we valued their time.
    • A brief introduction to vet basic things like English fluency and a very shallow sanity check that the information they submitted somewhat reflected their experience.
    • A project to work on together that resembles the type of work that’s done in our day-to-day. We expected the project to take a couple of days with about 6-12 hours to complete and require interaction over Github, Slack and email to ask questions and for a developer on our side to have to collaborate with a few snippets of code to make the whole thing work. The goal here was to see how someone works asynchronously at their own pace while still being able to communicate about what they’re doing and why.
    • If the first project showed good collaboration and a validation of their technical skills we moved on to a second, shorter and easier project that we’d try and do more or less in real-time. We’d build something small on top of the previous project and schedule a time range to work together on it. This could be over Slack, on a call, pairing, or a mix, whatever the candidate is comfortable with. The goal isn’t to see how someone works under stress, an interview process is a terrible place to test for that, but rather to validate they’re not farming out the test to someone else and to see the person collaborate in a more real-time environment.
    • Last step is meeting the team they’d be working with, give them an opportunity to see if the group of people they’d spend most of their time with is what they’re looking for. This usually goes both ways, but by the time someone’s spent that amount of time with us we’re fairly certain they’ll fit in well in a team.

    It took some time to get the process set up, and it took a lot of our developer’s time to support it. We loved it. The people that went through it and got to the end told us they enjoyed it. But very few people applied when they understood how much time they’d have to invest in our process, many (half or so) dropped off mid-way through, and in general people with a lot of experience didn’t apply.

    We were going through a heavy growth phase where we needed as many good developers as we could find, and while we felt our new process was much better, it attracted less people and wouldn’t allow us to hire at the rate that we needed at the time. So we stripped the process into the smallest piece we felt would still capture some of what was important to us while making it less demanding on candidates’ time.

    What we’re doing today

    We provide some flexibility to each hiring manager in how they chose to run interviewing processes, our general approach today is:

    • We kept the list of questions instead CVs for applicants, it turned out to tell us much more about candidates than their CVs ever did. We took what we looked for in CVs into questions and let people answer for themselves upfront instead of us trying to extrapolate and apply our own biases to them. The list of questions are listed at the end if you’re curious.
    • The hiring manager will have an introduction call with people who seem promising, same as before, basic vetting and let them ask whatever they want to know about the company. We’ll also explain the hiring process so they know what it entails.
    • We’ll send them over a small project that will take anywhere from 2-6 hours to complete, they can do it at their own pace and we encourage them to ask as many questions as they’d like. We don’t try and test here for anything more than basic technical skills, it’s much more about problem-solving and communication skills. We’d still extend an offer to someone even if they got blocked a dozen times throughout the project. People often get nervous while looking for a job, there’s no advantage in filtering those people out.
    • Once they send us back the project, we’ll spend some time reviewing it and set up a call to discuss it. This is usually what resembles a technical interview the most, but instead of trying to talk about abstract technical concepts we can discuss something we’re both familiar with.
    • If it all went well, we’ll extend an offer.
    • All the people involved are folks you’d be working with. We don’t farm off any part of the process to an internal recruiter to filter out people.

    The process ends up being two or three 30-60 minute calls, with the project in between them. It works fine, we generally hire people who end up sticking around and being great additions to the teams.
    Ideally we’d want to have the longer-form process, we felt we both learned more about people, but maybe more importantly we saw people who we did not think that would be good candidates initially come way ahead of others. It felt like the smarter way to hire. We haven’t figured out a way to make the math work, maybe it’ll be easier once we’re a more well-known company and more people are willing to invest time in the interviewing process, maybe eventually we’ll find a way to sell people on the idea.

    And of course, we’re still hiring!

    List of questions to apply for a development role

    • A Cover Letter
    • How many years have you been developing software? How many of those with Python?
    • Ideally, what do you want to do at work most of the time?
    • How many people were on the teams you worked in in the past? (Provide a range (ie., 2-3 people, 8-10, etc). We mostly want to understand if you’ve worked primarily on your own, small or large teams.)
    • What is the longest stretch of time you’ve been at the same company? (Do you like to work at companies for long periods of time (3+ years)? Do you prefer shorter periods?)
    • What country do you live in? (We’re generally ok with people moving around, this is to understand where you live right now so we know what timezone & legal framework we can expect.)
    • What is the largest scale of usage you’ve had on an application you have actively developed? (# of users, # of requests/sec, any metric that produces some scale. A service with 1,000 daily users? 1,000,000? Few users but millions of API requests?)
    • When was the last time you changed your mind about something important? Why did you change your mind?
    • What was the hardest part of your current/last role? (What was challenging for you?)
    • What’s a challenging project or bug you worked on? Why was it challenging?
  • The 6 year experiment

    This is a story about trying to accomplish two unrelated things with one solution: making your child bilingual and avoiding having to fight every day about screen time. As you might know TV is highly addictive with young children and there’s some guesses that it hinders some aspects in early development, but more importantly it disconnects you from each other at an age where you’re establishing what type of relationship you might have, or at least what the default might be. Regardless of what those long-term studies conclude, not having to fight a tablet for a bit of quality time with my daughter seemed like a desirable thing.

    Bilingualism was important to me for a few reasons. That’s what I grew up with and personally feel it’s often given me a huge advantage in life, both professionally but even more so on a personal level. Being able to read and comprehend natively in more than one language opens up entire new worlds for you to absorbe, learn from and enjoy. It also gives you access to form deep relationships with millions (or billions, depending on the language mix) of other human beings which would otherwise be much more challenging with a language barrier. It’s hard to overstate just how massive of an advantage it is, from personal experience. There’s also dozens of long term studies that link being bilingual with all sorts of great things ranging from strengthening cognitive abilities, increased creativity, better at multi-tasking and reducing the risks and effects of Dementia and Alzheimer’s.
    I have so many great friends spread out all over the world in a big part because I could communicate with them at a comfortable level to form a deeper relationship, it feels almost unfair 🙂

    So, it felt important and it felt like something that was basically free to acquire in an early age with the right set of conditions, and increasingly hard as time went by.
    However, the conditions that led me to be bilingual, which were essentially growing up in foreign countries where I did not speak the local language and attended an international schools was not available to my daughter when she was born, and we weren’t keen on moving somewhere far away to enable this specific advantage.

    But how does limiting screen time without making it a daily battle and being bilingual intersect you ask, 3 paragraphs in?
    Well, when I was kid growing up in communist Poland, TV was not always fun. Often there would just be some really old cartoons in Polish that I could not understand, and I remember sort of watching it but mostly trying to figure out what else to do that would be more entertaining and switching to that as soon as something came up. In contrast, I remember whenever I could watch something that I understood well, and how my brain just lit up and would fight anyone to the death (and often tried) that would pry it away from me.
    So here was my thought: what if any screen time my daughter had was boring enough that it would shave off most of the addictive aspect of it, and would still give her something good in return other than entertainment?

    I think you see where I’m going with this 😉

    That’s how the “mostly unlimited screen time, but only in English” rule came to be. It was a gamble. It took some scary turns for a while as I learned first hand how incredibly damaging and evil YouTube is with little kids, there was some course correction to start to slowly reduce and drop screen time as she got closer bedtime.
    As the years went by, one of the two goals had been pretty clearly accomplished: there was little or no fighting about screen time with our daughter, and she would often go days without watching anything. She also seemed to like to spend a lot less time than many other kids glued to TVs and tablets. But more importantly, I was not having to fight about it all the time, which I absolutely dreaded having to do.
    This, I feel and felt early on, was a huge success.

    The Bilingualism took some more time, though. There were signs here and there that English was easy for her to pick up, she would sometimes surprise you by understanding something she overheard, but there were lots of doubts about whether it had any meaningful impact on her or not. We sometimes read her books in English, but bedtime stories are such an intimate moment that things sort of kept on gravitating back to Spanish on their own.
    However, in what felt out of the blue, one day she could just understand everything. She could also have conversations with a lot of confidence, even if it required spanglishing in there a word or three. This is about around age 5. I don’t know if it was the right age of maturity, or whether it was the change in school where there was a stronger focus on English and she made a friend that could only really speak English, but once she surfaced it, it all unravelled pretty quickly.
    Friends and family that would often switch to English to say certain things that kids weren’t supposed to understand suddenly found that my daughter would understand them perfectly. She understood the plots of the cartoons and movies she would watch. You could read her books and she’d understand and enjoy them. It felt like a pretty big change overnight.
    Now, for me personally, being bilingual has one particularly interesting aspect: some thought process are in one language, some are in others. I suspect that’s one of the reasons it has some cognitive advantages. So what made it seem like the second goal had been a real and lasting success was that I noticed that most of the time when she was playing on her own, it’s all in English. Her toys talk to each other in English, they fight with each other in English and they go on adventures in English.
    I feel that’s a pretty big sign there’s internal thought processes happening in a second language.
    It also made me realize how much of what kids watch on a screen actually influences how they play. If she spends a few days going to someone else’s house where they watch TV in Spanish, often her toys would switch languages for a little while before going back.

    So, all in all I would say this has been a huge success that I’m very proud of and felt it was worth writing about and sharing. I learned a lot about the intersection of screen time and languages which I would not have guessed, at least in my very specific anecdotal experiment.
    I also have to say that my inspiration for trying out long-term things like this came from the wonderful Francis Lacoste, who taught his baby sign language without neither of them being deaf and had such an amazing story out of it that he wrote a book about it.

    Now, some smart readers will be thinking: there’s a problem with where this is heading. I know, I know. As her comprehension of the second language starts to equate her first language, screen time will be as attractive as if we hadn’t done any of this. I’ve already seen hints of this. She’s more often binging on specific cartoons, and she’ll sometimes prioritize watching TV over other things which wasn’t the case before.
    I don’t have a plan for this next phase yet. Maybe the fact that screen time wasn’t such a major thing in her first 6 years will make it a bit less addictive?
    We’ll see where this takes us, maybe I’ll think of a new hack soon enough.

  • On well executed releases and remote teams

    After some blood, sweat and tears, we finally brought Stacksmith into the world, yay!

    It’s been a lengthy and intense process that started with putting together a team to be able to build the product in the first place, and taking Bitnami’s experience and some existing tooling to make the cloud more accessible to everyone. It’s been a good week.

    However, I learnt something I didn’t quite grasp before: if you find really good people, focus on the right things, scope projects to an achievable goal and execute well, releases lack a certain explosion of emotions that are associated with big milestones. Compounded with the fact that the team that built the product are all working remotely, launch day was pretty much uneventful.
    I’m very proud of what we’ve built, we did it with a lot of care and attention, we agonized over trade-offs during the development process, we did load testing to do some capacity planning, added metrics to get hints as to when the user experience would start to suffer, we did CI/CD from day one so deployments were well guarded against breaking changes and did not affect the user experience. We did enough but not too much. We rallied the whole company a few weeks before release to try and break the service, asked people who hadn’t used it before to go through the whole process and document each step, tried doing new and unexpected things with the product. The website was updated! The marketing messaging and material were discussed and tested, analysts were briefed, email campaigns were set up. All the basic checklists were completed. It’s uncommon to be able to align all the teams, timelines and incentives.
    What I learned this week is that if you do, releases are naturally boring  🙂

    I’m not quite sure what to do with that, there’s a sense of pride when rationalizing it, but I can’t help but feel that it’s a bit unfair that if you do things well enough the intrinsic reward seems to diminish.

    I guess what I’m saying is, good job, Bitnami team!

  • A year at Bitnami

    I’m a stone’s throw away from reaching my 1 year anniversary at Bitnami, so it feels like a good time to pause a bit and look back.

    After 8 years working at Canonical on a wide range of projects and roles, it was a very difficult step to take and was riddled with uncertainty and anxiety about leaving behind so many things I had poured my heart and soul into for so many years behind, and more than anything else a once-in-a-life-time epic team of people to work with.

    A year in, I’m overwhelmingly happy I made that decision.

    A lot of people expressed surprise I was joining Bitnami as either they hadn’t heard about them at all or they had but thought of them as a company that “made some installers or something”. However, Bitnami had been quietly but consistently growing in size, scope and revenue, all fueled by being organically profitable which is very rare nowadays in the tech world.

    Fast forward a year later Bitnami is starting to move out of the shadows and some of what’s been cooking for a while is getting some well deserved time in the spotlight.

    Of the things that are public, I would say they fall into two buckets: Kubernetes & Packaging open source applications.



    The Kubernetes community has been growing at a healthy and inclusive pace for some time now, some would say it’s the hippest place to be right now.

    One of things that was attractive to me when changing jobs was the possibility of being able to use some new and interesting technologies more hands-on as part of my day-to-day job, and K8s was at the very top of my list. Shortly after I joined we made a company decision to go all-in on K8s and began setting up our own clusters and migrating over our internal services to it. Aside from the amount of buzz and customer requests we had, once we started using it more hands on it became obvious to us it would win over hearts and minds fairly quickly and we doubled down on being all-in  🙂

    Aside from all the knowledge we gained by setting up, maintaining and upgrading our internal clusters, Bitnami acquired a small but very relevant start-up called Skippbox which brought over further expertise in training, but even more interesting was a project called Kubeless.

    Kubeless is a functions-as-a-service framework which has the advantage of having been built on top of K8s native objects, making it very easy to extend and interact with anything inside your cluster. That project has been a lot of fun to play with is a natural addition to our internal clusters to fulfill our stated goal of making it easy and enjoyable for our own development team to get deliver software to production.

    It was a busy year, have I said it’s been a busy year? So, as well as all of that along came the Helm project. Once we heard “a packaging format for applications on K8s” we knew someone’s current iteration would be derailed  🙂

    We jumped in with Deis and helped get the project off the ground by applying our knowledge of how to package software to Helm and quickly produced the majority of the charts that the project launched with. It’s been getting a healthy string of contributions since then, which is as good as you can hope for.

    Because humans are visual creatures, no matter how technical, on the heels of the launch of Helm we took the lead on a new project code-named “Monocular”, which is a web-ui to search and navigate existing Helm charts and even deploy them to your cluster with one click. An app store of sorts for K8s applications.

    With all that K8s experience in our toolbelts, we started to realise there was a gap in how to consistently deploy applications across clusters. Why across clusters, you say? A very common pattern is to have at least a staging and a production environment, which in K8s you would likely want to model as different clusters. We happen to internal also provide a development cluster as we do so much development in the K8s and often need to test in larger machines or use some specific features that minikube doesn’t satisfy. The way to do that in Helm is essentially to copy and paste your yaml files, which for a small amount of clusters or apps is fine. For us, this quickly grew out of control and realised that we needed to instrument some amount of re-usability and flexibility when trying to use K8s features that Helm itself hadn’t exposed yet. It turned out, we weren’t alone. Our friends over at hept.io and box.com had similar problems and were in fact trying to address it in a similar way (given there were a few ex-googlers in our ranks, jsonnet was picked as the library to help with re-usability), and as such ksonnet was born. You can take  a look at it more closely if you’re interested, but in essence it takes json & jsonnet templates and compiles them down to native K8s yaml files that you can track and feed directly into your cluster.


    Packaging open source applications

    This is what is probably the most underrated aspect of Bitnami, as it’s probably not very obvious the scale at which we operate and there’s nobody else really to compare the company to.

    Let me try and give you some hints at the scale at which we operate. At this exact point in time, you can find Bitnami-built assets as:

    • Windows, Linux and macOS installers
    • Amazon EC2 VMs
    • Azure VMs
    • Google Cloud VMs
    • Oracle Cloud VMs
    • Virtual Machines (OVA’s)
    • Huawei Cloud VMs
    • Deutsche Telekom VMs
    • 1&1 VMs
    • GoDaddy VMs
    • Centurylink VMs
    • Docker containers
    • Eclipse Che containers
    • Docker-compose templates
    • Amazon Cloudformation templates
    • Azure ARM templates
    • Google deployment templates
    • Kubernetes Helm charts
    • …and more on its way  🙂

    That is 20 different target environments! Even if you just built one applications for all those targets it would be an interesting problem in itself. However, there’s more  🙂

    Bitnami has a catalog of about 170+ open source applications of which we don’t always provide the full catalog to every environment as it doesn’t always make sense (not everything makes sense as a Docker container or a multi-tier application), and while I haven’t looked at the exact numbers it would likely average out over all targets at around ~110 apps. That is 110 x 20 = 2,200 assets to build. That on its own should feel daunting for anyone who’s tried to build an application for more than one environment. But wait, there’s more!
    Bitnami’s missions is to make it easy for everyone to use open source software, and to try and reach more of “everyone” you need to support major versions of these applications (because not everyone has migrated to Python 3 yet :), so that ends up with around 4,400 assets. Mind. Blown. But you know how it goes, there’s always more!

    Building these images and templates is an interesting and hard problem, but the hardcore last-level boss problem is in doing so in a way where you can keep building those continuously so they are actually up-to-date all the time. In order to do that you have to track a lot of software (eg., libssl, libc, openssh, php, java, rails, etc) packaged in different ways (ie., debs, rpms, gems, pip, etc), so you end up watching thousands of different pieces of software over all, some of which can affect every single image (hello openssl & heartbleed!).
    To solve this problem there’s over a decade of code that’s been brewing, carefully structured metadata about how applications like to be configured in different scenarios, regression tests, tracking upstream releases and watching and matching CVEs. Over the last year there’s been a tight focus on taking all that work to streamline the tools to plan for continued growth as the landscape of software expands, and some refactoring to be able to shape it into a product that might be useful to others beyond our internal use.

    Daunting problems end up being the most fun to work on and this has been no exception. This is why I joined Bitnami, to lead this effort and make open source software a bit easier to use and access every day.

  • Developing and scaling Ubuntu One filesync, part 1

    Now that we’ve open sourced the code for Ubuntu One filesync, I thoughts I’d highlight some of the interesting challenges we had while building and scaling the service to several million users.

    The teams that built the service were roughly split into two: the foundations team, who was responsible for the lowest levels of the service (storage and retrieval of files, data model, client and server protocol for syncing) and the web team, focused on user-visible services (website to manage files, photos, music streaming, contacts and Android/iOS equivalent clients).
    I joined the web team early on and stayed with it until we shut it down, so that’s where a lot of my stories will be focused on.

    Today I’m going to focus on the challenge we faced when launching the Photos and Music streaming services. Given that by the time we launched them we had a few years of experience serving files at scale, our challenge turned out to be in presenting and manipulating the metadata quickly to each user, and be able to show the data in appealing ways to users (showing music by artist, genre and searching, for example). Photos was a similar story, people tended to have many thousands of photos and songs and we needed to extract metadata, parse it, store it and then be able to present it back to users quickly in different ways. Easy, right? It is, until a certain scale  🙂
    Our architecture for storing metadata at the time was about 8 PostgreSQL master databases where we sharded metadata across (essentially your metadata lived on a different DB server depending on your user id) plus at least one read-only slave per shard. These were really beefy servers with a truck load of CPUs, more than 128GB of RAM and very fast disks (when reading this, remember this was 2009-2013, hardware specs seem tiny as time goes by!).  However, no matter how big these DB servers got, given how busy they were and how much metadata was stored (for years, we didn’t delete any metadata, so for every change to every file we duplicated the metadata) after a certain time we couldn’t get a simple listing of a user’s photos or songs (essentially, some of their files filtered by mimetype) in a reasonable time-frame (less than 5 seconds). As it grew we added caches, indexes, optimized queries and code paths but we quickly hit a performance wall that left us no choice but a much feared major architectural change. I say much feared, because major architectural changes come with a lot of risk to running services that have low tolerance for outages or data loss, whenever you change something that’s already running in a significant way you’re basically throwing out most of your previous optimizations. On top of that as users we expect things to be fast, we take it for granted. A 5 person team spending 6 months to make things as you expect them isn’t really something you can brag about in the middle of a race with many other companies to capture a growing market.
    In the time since we had started the project, NoSQL had taken off and matured enough for it to be a viable alternative to SQL and seemed to fit many of our use cases much better (webscale!). After some research and prototyping, we decided to generate pre-computed views of each user’s data in a NoSQL DB (Cassandra), and we decided to do that by extending our existing architecture instead of revamping it completely. Given our code was pretty well built into proper layers of responsibility we hooked up to the lowest layer of our code,-database transactions- an async process that would send messages to a queue whenever new data was written or modified. This meant essentially duplicating the metadata we stored for each user, but trading storage for computing is usually a good trade-off to make, both in cost and performance. So now we had a firehose queue of every change that went on in the system, and we could build a separate piece of infrastructure who’s focus would only be to provide per-user metadata *fast* for any type of file so we could build interesting and flexible user interfaces for people to consume back their own content. The stated internal goals were: 1) Fast responses (under 1 second), 2) Less than 10 seconds between user action and UI update and 3) Complete isolation from existing infrastructure.
    Here’s a rough diagram of how the information flowed throw the system:

    U1 Diagram

    It’s a little bit scary when looking at it like that, but in essence it was pretty simple: write each relevant change that happened in the system to a temporary table in PG in the same transaction that it’s written to the permanent table. That way you get transactional guarantees that you won’t loose any data on that layer for free and use PG’s built in cache that keeps recently added records cheaply accessible.
    Then we built a bunch of workers that looked through those rows, parsed them, sent them to a persistent queue in RabbitMQ and once it got confirmation it was queued it would delete it from the temporary PG table.
    Following that we took advantage of Rabbit’s queue exchange features to build different types of workers that processes the data differently depending on what it was (music was stored differently than photos, for example).
    Once we completed all of this, accessing someone’s photos was a quick and predictable read operation that would give us all their data back in an easy-to-parse format that would fit in memory. Eventually we moved all the metadata accessed from the website and REST APIs to these new pre-computed views and the result was a significant reduction in load on the main DB servers, while now getting predictable sub-second request times for all types of metadata in a horizontally scalable system (just add more workers and cassandra nodes).

    All in all, it took about 6 months end-to-end, which included a prototype phase that used memcache as a key/value store.

    You can see the code that wrote and read from the temporary PG table if you branch the code and look under: src/backends/txlog/
    The worker code, as well as the web ui is still not available but will be in the future once we finish cleaning it up to make it available. I decided to write this up and publish it now because I believe the value is more in the architecture rather than the code itself   🙂

  • Disassembling a DeLonghi eco310.r coffee machine

    After a few weeks of being coffee-deprived, I decided to disassemble my espresso machine and see if I could figure out why it leaked water while on, and didn’t have enough pressure to produce drinkable coffee.
    I live a bit on the edge of where other people do, so my water supply is from my own pump, 40 meters into the ground. It’s as hard as water gets. That was my main suspicion. I read a bit about it on the interwebz and learned about descaling, which I’d never heard about. I tried some of the home-made potions but nothing seemed to work.
    Long story short, I’m enjoying a perfect espresso as I write this.

    I wanted to share a bit with the internet people about what was hard to solve, and couldn’t find any instructions on. All I really did was disassemble the whole thing completely, part by part, clean them, and make sure to put it back together tightening everything that seemed to need pressure.
    I don’t have the time and energy to put together a step-by-step walk-through, so here’s the 2 tips I can give you:

    1) Remove ALL the screws. That’ll get you there 95% there. You’ll need a philips head, a torx head, a flat head and some small-ish pliers.
    2) The knob that releases the steam looks unremovable and blocks you from getting the top lid off. It doesn’t screw off, you just need to pull upwards with some strength and care. It comes off cleanly and will go back on easily. Here’s a picture to prove it:

    DeLongi eco310.r

    Hope this helps somebody!

  • Click packages and how they’ll empower upstreams

    As the pieces start to come together and we get closer to converging mobile and desktop in Ubuntu, Click packages running on the desktop start to feel like they will be a reality soon (Unity 8 brings us Click packages). I think it’s actually very exciting, and I thought I’d talk a bit about why that is.

    First off: security. The Ubuntu Security team have done some pretty mind-blowing work to ensure Click packages are confined in a safe, reliable but still flexible manner. Jamie has explained how and why in a very eloquent manner. This will only push further an OS that is already well known and respected for being a safe place to do computing for all levels of computer skills.
    My second favorite thing: simplification for app developers. When we started sketching out how Clicks would work, there was a very sharp focus on enabling app developers to have more freedom to build and maintain their apps, while still making it very easy to build a package. Clicks, by design, can’t express any external dependencies other than a base system (called a “framework”). That means that if your app depends on a fancy library that isn’t shipped by default, you just bundle it into the Click package and you’re set. You get to update it whenever it suits you as a developer, and have predictability over how it will run on a user’s computer (or device!). That opens up the possibility of shipping newer versions of a library, or just sticking with one that works for you. We exchange that freedom for some minor theoretical memory usage increases and extra disk space (if 2 apps end up including the same library), but with today’s computing power and disk space cost, it seems like a small price to pay to empower application developers.
    Building on top of my first 2 favorite things comes the third: updating apps outside of the Ubuntu release cycle and gaining control as an app developer. Because Click packages are safer than traditional packaging systems, and dependencies are more self-contained, app developers can ship their apps directly to Ubuntu users via the software store without the need for specialized reviewers to review them first. It’s also simpler to carry support for previous base systems (frameworks) in newer versions of Ubuntu, allowing app developers to ship the same version of their app to both Ubuntu users on the cutting edge of an Ubuntu development release, as well as the previous LTS from a year ago. There have been many cases over the years where this was an obvious problem, OwnCloud being the latest example of the tension that arises from the current approach where app developers don’t have control over what gets shipped.
    I have many more favorite things about Clicks, some more are:
    – You can create “fat” packages where the same binary supports multiple architectures
    – Updated between versions is transactional so you never end up with a botched app update. No more holding your breath while an update installs, hoping your power doesn’t drop mid-way
    – Multi-user environments can have different versions of the same app without any problems
    – Because Clicks are so easy to introspect and verify their proper confinement, the process for verifying them has been easy to automate enabling the store to process new applications within minutes (if not seconds!) and make them available to users immediately

    The future of Ubuntu is exciting and it has a scent of a new revolution.

  • Engineering management

    I’m a few days away from hitting 6 years at Canonical and I’ve ended up doing a lot more management than anything else in that time. Before that I did a solid 8 years at my own company, doing anything from developing, project managing, product managing, engineering managing, sales and accounting.
    This time of the year is performance review time at Canonical, so it’s gotten me thinking a lot about my role and how my view on engineering management has evolved over the years.

    A key insights I’ve had from a former boss, Elliot Murphy, was viewing it as a support role for others to do their job rather than a follow-the-leader approach. I had heard the phrase “As a manager, I work for you” a few times over the years, but it rarely seemed true and felt mostly like a good concept to make people happy but not really applied in practice in any meaningful way.

    Of all the approaches I’ve taken or seen, a role where you’re there to unblock developers more than anything else, I believe is the best one. And unless you’re a bit power-hungry on some level, it’s probably the most enjoyable way of being a manager.

    It’s not to be applied blindly, though, I think a few conditions have to be met:
    1) The team has to be fairly experienced/senior/smart, I think if it isn’t it breaks down to often
    2) You need to understand very clearly what needs doing and why, and need to invest heavily and frequently in communicated it to the team, both the global context as well as how it applies to them individually
    3) You need to build a relationship of trust with each person and need to trust them, because trust is always a 2-way street
    4) You need to be enough of an engineer to understand problems in depth when explained, know when to defer to other’s judgments (which should be the common case when the team generally smart and experienced) and be capable of tie-breaking in a technical-savvy way
    5) Have anyone who’s ego doesn’t fit in a small, 100ml container, leave it at home

    There are many more things to do, but I think if you don’t have those five, everything else is hard to hold together. In general, if the team is smart and experienced, understands what needs doing and why, and like their job, almost everything else self-organizes.
    If it isn’t self-organizing well enough, walk over those 5 points, one or several must be mis-aligned. More often than not, it’s 2). Communication is hard, expensive and more of an art than a science. Most of the times things have seemed to stumble a bit, it’s been a failure of how I understood what we should be doing as a team, or a failure on how I communicated it to everyone else as it evolved over time.
    Second most frequent I think is 1), but that may vary more depending on your team, company and project.

    Oh, and actually caring about people and what you do helps a lot, but that helps a lot in life in general, so do that anyway regardless of you role  🙂

  • A story on finding an elusive security bug and managing it responsibly

    Now that all the responsible disclosure processes have been followed through, I’d like to tell everyone a story of my very bad week last week. Don’t worry, it has a happy ending.


    Part 1: Exposition

    On May 5th we got a support request from a user who observed confusing behaviour in one of our systems. Our support staff immediately escalated it to me and my team sprung into action for what ended up being a 48-hour rollercoaster ride that ended with us reporting upstream to Django a security bug.

    The bug, in a nutshell, is that when the following conditions lines up, a system could end up serving a request to one user that was meant for another:

    – You are authenticating requests with cookies, OAuth or other authentication mechanisms
    – The user is using any version of Internet Explorer or Chromeframe (to be more precise, anything with “MSIE” in the request user agent)
    – You (or an ISP in the middle) are caching requests between Django and the internet (except Varnish’s default configuration, for reasons we’ll get to)
    – You are serving the same URL with different content to different users

    We rarely saw this combination of conditions because users of services provided by Canonical generally have a bias towards not using Internet Explorer, as you’d expect from a company who develops the world’s most used Linux distribution.


    Part 2: Rising Action

    Now, one may think that the bug is obvious, and wonder how it went unnoticed since 2008, but this really was one was one of those elusive “ninja-bugs” you hear about on the Internet and it took us quite a bit of effort to track it down.

    In debugging situations such as this, the first step is generally to figure out how to reproduce the bug. In fact, figuring out how to reproduce it is often the lion’s share of the effort of fixing it.  However, no matter how much we tried we could not reproduce it. No matter what we changed, we always got back the right request. This was good, because it ruled out a widespread problem in our systems, but did not get us closer to figuring out the problem.

    Putting aside reproducing it for a while, we then moved on to combing very carefully through our code, trying to find any hints of what could be causing this. Several of us looked at it with fresh eyes so we wouldn’t be tainted by having developed or reviewed the code, but we all still came up empty each and every time. Our code seemed perfectly correct.

    We then went on to a close examination of all related requests to get new clues to where the problem was hiding. But we had a big challenge with this. As developers we don’t get access to any production information that could identify people. This is good for user privacy, of course, but made it hard to produce useful logs. We invested some effort to work around this while maintaining user privacy by creating a way to anonymise the logs in a way that would still let us find patterns in them. This effort turned up the first real clue.

    We use Squid to cache data for each user, so that when they re-request the same data, it’s queued up right in memory and can be quickly served to them without having to recreate the data from the databases and other services. In those anonymized  Squid logs, we saw cookie-authenticated requests that didn’t contain an HTTP Vary header at all, where we expected it to have at the very least “Vary: Cookie” to ensure Squid would only serve the correct content all the time. So we then knew what was happening, but not why. We immediately pulled Squid out of the middle to stop this from happening.

    Why was Squid not logging Vary headers? There were many possible culprits for this, so we got a *lot* of people were involved searching for the problem. We combed through everything in our frontend stack (Apache, Haproxy and Squid) that could sometimes remove Vary headers.

    This was made all the harder because we had not yet fully Juju charmed every service, so could not easily access all configurations and test theories locally. Sometimes technical debt really gets expensive!

    After this exhaustive search, we determined that nothing our code removed headers. So we started following the code up to Django middlewares, and went as far as logging the exact headers Django was sending out at the last middleware layer. Still nothing.


    Part 3: The Climax

    Until we got a break. Logs were still being generated, and eventually a pattern emerged. All the initial requests that had no Vary headers seemed for the most part to be from Internet Explorer. It didn’t make sense that a browser could remove headers that were returned from a server, but knowing this took us to the right place in the Django code, and because Django is open source, there was no friction in inspecting it deeply.  That’s when we saw it.

    In a function called fix_IE_for_vary, we saw the offending line of code.

    del response[‘Vary’]

    We finally found the cause.

    It turns out IE 6 and 7 didn’t have the HTTP Vary header implemented fully, so there’s a workaround in Django to remove it for any content that isn’t html or plain text. In hindsight, if Django would of implemented this instead as a middleware, even if default, it would have been more likely that this would have been revised earlier. Hindsight is always 20/20 though, and it easy to sit back and theorise on how things should have been done.

    So if you’ve been serving any data that wasn’t html or plain text with a caching layer in the middle that implements Vary header management to-spec (Varnish doesn’t trust it by default, and checks the cookie in the request anyway), you may have improperly returned a request.

    Newer versions if Internet Explorer have since fixed this, but who knew in 2008 IE 9 would come 3 years later?


    Part 4: Falling Action

    We immediately applied a temporary fix to all our running Django instances in Canonical and involved our security team to follow standard responsible disclosure processes. The Canonical security team was now in the driving seat and worked to assign a CVE number and email the Django security contact with details on the bug, how to reproduce it and links to the specific code in the Django tree.

    The Django team immediately and professionally acknowledged the bug and began researching possible solutions as well as any other parts of the code where this scenario could occur. There was continuous communication among our teams for the next few days while we agreed on lead times for distributions to receive and prepare the security fix,


    Part 5: Resolution

    I can’t highlight enough how important it is to follow these well-established processes to make sure we keep the Internet at large a generally safe place.
    To summarise, if you’re running Django, please update to the latest security release as quickly as possible, and disable any internal caching until then to minimise the chances of hitting this bug.

    If you’re running squid and want to check if you could be affected, here’s a small python script to run against your logs we put together you can use as a base, you may need to tweak it based on your log format. Be sure to run it only against cookie-authenticated URLs, otherwise you will hit a lot of false positives.

  • On open sourcing Ubuntu One filesync

    This week has been bitter-sweet. On the one hand, we announced that a project many of us had poured our hearts and minds into was going to be shut down. It’s made many of us sad and some of us haven’t even figured out what to do with their files yet    🙂

    On the other hand, we’ve been laser-focused on making Ubuntu on phones and tablets a success, our attention has moved to making sure we have a rock-solid, scalable, secure and pleasant to use for developers and users alike. We just didn’t have the time to continue racing against other companies whose only focus is on file syncing, which was very frustrating as we saw a project we were proud of be left behind. It was hard to keep feeling proud of the service, so shutting it down felt like the right thing to do.

    I am, however, very excited about open sourcing the server-side of the file syncing infrastructure. It’s a huge beast that contains many services and has scaled well into the millions of users.

    We are proud of the code that is being released and in many ways we feel that the code itself was successful despite the business side of things not turning out the way we hoped for.

    This will be a great opportunity to those of you who’ve been itching to have an open source service for personal cloud syncing at scale, the code comes battle-tested and with a wide array of features.

    As usual, some people have taken this generous gesture “as an attempt to gain interest in a failing codebase”, which couldn’t be more wrong. The agenda here is to make Ubuntu for phones a runaway success, and in order to do that we need to double down on our efforts and focus on what matters right now.

    Instead of storing away those tens of thousands of expensive man-hours of work in an internal repository somewhere, we’ve decided to share that work with the world, allow others to build on top of that work, benefit from it.

    It’s hard sometimes to see some people trying to make a career out of trying to make everything that Canonical does as inherently evil, although at the end of the day what matters is making open source available to the masses. That’s what we’ve been doing for a long time and that’s the only thing that will count in the end.


    So in the coming months we’re going to be cleaning things up a bit, trying to release the code in the best shape possible and work out the details on how to best release it to make it useful for others.

    All of us who worked on this project for so many years are looking forward to sharing it and look forward to seeing many open source personal cloud syncing services blossoming from it.