Quantcast
Channel: NoRedInk
Viewing all articles
Browse latest Browse all 193

Swapping Engines Mid-Flight

$
0
0

A few months ago, I had the privilege of joining the product team for our First Design Sprint. Starting with a huge user pain-point, we used the Design Sprint process to arrive at a solution with a validated design (yay, classroom visits!), and a working prototype. If you’re curious about that process, I highly recommend you give that post a read. Long story short (and simplified): students practice on NoRedInk to gain mastery; the old way of calculating mastery frustrated students… a lot; the new way of calculating mastery feels much more fair.

This post is about what came after the design sprint:

We replaced the core functionality of our site with a completely new experience, without downtime, without a huge feature branch full of merge conflicts, and while backfilling 250 million records.

Actually, this post is only the first of two, in which I hope to discuss the strategies we did and didn’t use to build and deploy this feature. A future post will be a deep-dive into backfilling the 250 million rows without requiring weeks of wall time. I make no claim we did anything original or even unexpected. But, I hope reading this particular journey, and my missteps along the way, will bring together some pieces that help you in your own work.

The Omnibus Strategy

I’ve been working at NoRedInk for 4 years - back since the engineering team consisted of just a handful of us – and things have changed a lot. In the early days, when we had a big new feature we would: 1. Start an omnibus feature branch 1. Create feature branches off of the omnibus branch 1. Review that feature branch, and merge it into the omnibus branch 1. Resolve all the merge conflicts in the omnibus branch that crop up as other engineers merged code into master 1. Then, deal with merge conflicts between the omnibus branch and any/all feature branches 1. Keep creating, reviewing, and merging feature branches until the omnibus branch is fully featured 1. QA the completed omnibus branch 1. Merge the omnibus branch into master and deploy

As we added more team members, and our features got more complex, the merge conflicts became a nightmare. I had heard this could be avoided by using feature flags, but (though I’d never actually tried it) I’d decided that the resulting code complexity wasn’t worth it. Maybe I was right back when we had 3 engineers, but by the time we were 6+ - quite frankly - I was dead wrong.

The Flipper Strategy

Around year 3, we started using feature flags for large features (in particular, we use the Flipper gem) thanks to some polite prodding by the trio of Charles, Marica, and Rao. For the uninitiated, this produces code similar to the following all over your codebase:


if FeatureFlag[:new_thing].enabled?
  do_fancy_new_thing()
else
  do_old_thing()
end

As long as that feature flag is turned off, your new code has no effect. The magical win you get when you write code that doesn’t affect users is you can merge every little PR about your new feature directly into master! No extended merge conflicts. No branches off of branches. And if you’re using feature flags, you can have tests for both the new and old functionality co-exist. Plus when you’re ready, you can turn the new feature on (and back off) without a deploy.

The new approach looks like this: 1. Start a feature branch off of master 1. Code up a small piece of your new feature, and put that functionality behind a feature flag. Make sure the old functionality still works 1. Review that branch as if it were any other PR, except now we need to make sure both the new functionality works and the old functionality is unchanged 1. Merge your PR into master

It’s almost exactly the same as development-as-usual.

Side not: you don’t need feature flags to merge not-yet-released code. As long as the new functionality is disabled (e.g. if false) or no-op (e.g. writing data to an as-of-yet unused table) you’re in good shape. What feature flags give you is an easy way to toggle functionality in tests, during QA, and on production – so your “disabled” functionality can also be easily verified and tested.

Running Two Different Engines at Once

The first talk I heard about migrating between systems with a lot of usage was a talk in 2010 by Harry Heymann at Foursquare. They were moving from PostgreSQL to MongoDB while users were “checking in” ~1.6M times / day. They followed a pretty clean approach:

  1. Build the new system to run in parallel with the old system. Write to both systems, but keep reading from only the old system.
  2. Validate that new system is running as expected. At this point, we’re confident all data moving forward is good.
  3. Backfill the new system.
  4. Swap! Start reading from the new system - and you’re live!
  5. Retire the old system.

“Swap!” in our case, meant turning on the feature flag.

This seemed like the right approach. Even our usage numbers are similar – our usage today is about 5x theirs in 2010.

The key difference for us is that Foursquare had two systems that were expected to work identically, we have two systems designed to work completely differently. One example: if a student answers a question incorrectly on the site - the old system would take away 50% of her mastery points, - the new system doesn’t take away any points, but requires her to get three questions correct in a row before she can get points in the future.

So, here’s the problem. Let’s imagine Susan is doing her homework while we’re writing to both systems. At this point, the “Old System” is still what users are seeing. The following are real mastery score calculations from both systems:


| Susan        | Old System Score | New System Score |
------------------------------------------------------
| initial      |         0        |         0        |
| correct      |        20        |        20        |
| incorrect    |        10        |        20        | Scores don't match anymore !!!
| correct      |        10        |        20        |
| correct      |        30        |        20        |
| correct      |        50        |        20        |
| correct      |        70        |        40        |
| correct      |        90        |        60        |
| correct      |       100  done! |        80        |

Great! Susan is done with her homework, and she has a grade of 100. Then tomorrow, we swap to the new system. Suddenly, her grade just dropped to a 80! I’ll let you imagine how furious students and teachers would be if we let that happen.

We’re using feature flags to deploy new code right away, we’re writing to both systems just like Foursquare… I just need everything to match when we flip the feature flag.

I came up with a plan. I’d run the backfill script on the historical data and all the recent data. That way, we overwrite all “New System” data so that it would perfectly match “Old System” scores. Susan’s “New System Score” gets overwritten to be 100, and crisis averted. We’d just have to bring the site down for a couple hours on the weekend so there wouldn’t be any additional writes while the script is running.

Here’s Susan again:


| Susan        | Old System Score | New System Score |
------------------------------------------------------
| initial      |         0        |         0        |
| correct      |        20        |        20        |
| incorrect    |        10        |        20        | Scores don't match anymore !!!
| correct      |        10        |        20        |
| correct      |        30        |        20        |
| correct      |        50        |        20        |
| correct      |        70        |        40        |
| correct      |        90        |        60        |
| correct      |       100  done! |        80        |

        TAKE THE SITE DOWN FOR MAINTAINANCE

| RUN SCRIPT   |       100        |       100        | Scores match again !!!

             BRING THE SITE BACK UP

There are two problems with this. One, my estimate of “a couple hours of downtime” turned out to be wildly optimistic (I’ll talk more about how wildly in a future post). But moreover, I was solving the wrong problem: there was no reason to let the scores get out of sync to begin with…

Running Two Different Engines in Sync

Foursquare had the right idea, I’d just been applying it wrong. We needed to sync up the two datastores first, and only afterwards start using the new calculation. The key was to write to both datastores with identical values until turning on the feature flag. So, here’s the plan we actually used (changes in bold):

  1. Build the new datastore to run in parallel with the old system. Write the values from the old system to both datastores, but keep reading from only the old datastore.
  2. Validate that new system is recording the same values. At this point, we’re confident all data moving forward is good.
  3. Backfill the new system.
  4. Swap! Turn on the feature flag: start reading from the new system, and use the new calculation.
  5. Retire the old system.

Now Susie’s scores will be identical in both systems, and there’s no need to bring the site down before swapping to the new system.

In Conclusion

So what have I learned? First, be careful what lessons you take from other’s experience. And, if you think you need to take the site down to make a change, consider again very carefully.

If you notice anything I missed or got wrong, I’d love to hear about it and keep learning - please write to me and let me know. Thanks!


Josh Leven
@thejosh
Engineer at NoRedInk


Viewing all articles
Browse latest Browse all 193

Trending Articles