Not so long ago, I wrote about how we swapped out our old mastery engine for a shiny, new one without any downtime. At the time, I glossed over the problem of backfilling what was, at the time, about 250,000,000 rows. This post is a deep dive into what I learned, and mistakes I made, when writing that backfill.
The Backfill
So backfilling is relatively simple. For each record I needed two queries: calculate the current score, then create a new record for the new engine with the same score. I built the simplest possible script, told the team we could start doing QA on our staging machine tomorrow, and went home to run the script on staging.
We have a Rails app, and I wrote the script as a rake task. Here is basically what I had:
task :sync_mastery_scores => :environment do
StudentTopic.find_each do |student_topic|
score = student_topic.calculate_mastery_score_the_old_way
student_topic.update_attribute(:new_mastery_score, score)
end
end
A script like this produces noticeable load against the production server, so we can’t run it during peek hours. There are about 8 hours every evening when usage is low enough to run it. I should have started by doing a little math: what would it take to get this done in ~8 hours?
( 8 hours / 250,000,000 rows ) / 2 queries/row = 0.0576 ms/query
That means I would need 18 queries to run in 1ms. With round trip in a datacenter at 0.5ms, one approach might be to have each query run in 1ms, and 18 processes working in parallel. That, however, was not what I had built. Here’s what I saw:
New Mastery Records: |= | 0.00% (15) ETA: > 4 days
Did you catch it? Here it is again: ETA: > 4 days
. That’s what the ProgressBar gem says when it’s going to take so long it doesn’t bother telling you just how long it thinks. But, no problem, cause that was actually just the first half of the script. There are really two datasets, and the second half is the larger one.
We had a goal of releasing the new mastery system in 2 weeks. There was just no way I could complete running this script on both staging and production in less than 2 weeks. I needed to speed it up.
The thing is, the old system stored mastery scores in a complex relational structure requiring several joins. At best I could get a score out in 80ms.
This may have been the point I started freaking out.
I tinkered a little that night, trying to speed up the calculate_mastery_score
part by fetching all the records for a student at once. Students have 0-500 scores in the system, so this had the potential to help significantly. It did, in that the progress bar went from “so much time I’m not gonna bother” to “70 hours.” Here’s what I had after that change:
Student.find_each do |student|
scores = student.calculate_all_mastery_scores_the_old_way
scores.each do |topic_id, score|
StudentTopic.
where(student_id: student.id, topic_id: topic_id). # a unique constraint
update_all(new_mastery_score: score)
end
end
N = 250,000,000
I was tired and getting nowhere, so I went to sleep and started the next day fresh. One thing I’m very grateful for is having such a talented and experienced team. First thing the next day, I reached out to @dgtized for advice.
I was thinking parallelization, but he pointed out my real problem was how big N was. I needed to fetch multiple students at the same time, and update multiple rows at the same time, and he had just the thing. Here’s an example stolen from this stackoverflow answer:
UPDATE table_name
SET a = CASE WHEN id = 1
THEN 'something'
WHEN id = 2
THEN 'something else'
WHEN id = 3
THEN 'another'
END
WHERE id IN (1,2,3)
;
Bulk Queries
Armed with a way to do bulk updates, and determined to do more with each query, I rejiggered the SQL queries to fetch data in batches of students, and write all the updates for those students at once.
In order to select batches of students more efficiently, I skipped fetching Student, and instead iterated through a range of integers. The only reason this approach might be a problem would be if a large percentage of the ids in that range had no associated users. If that were the case, we would be constructing expensive bulk queries that did very little. Fortunately, that is not the case for us.
BATCH_SIZE = 50
max_id = Student.order("id desc").limit(1).pluck(:id)
ids = ( 1..max_id )
ids.step(BATCH_SIZE) do |first_id|
last_id = student_id + BATCH_SIZE - 1
student_ids = ( first_id..last_id )
# Fetch {primary_key_id => old_mastery_score} for every topic for each student_id
score_by_primary_key = OldMasteryCalculator.data_for_student_ids(student_ids)
StudentTopic.batch_update_new_mastery(score_by_primary_key)
end
In order to make the bulk update with CASE
statements work, we need to fetch the primary key id for each record that we are updating, along with the mastery score. We use that data to construct the following update query:
class StudentTopic
def batch_update_new_mastery(score_by_primary_key)
return if score_by_primary_key.empty?
case_statement = "CASE id "
score_by_primary_key.each do |id, score|
case_statement.concat "WHEN #{id} THEN #{ActiveRecord::Base.sanitize(score)} "
end
# If the case statement misses a record, don't do anything (ELSE: no-op)
case_statement.concat "ELSE new_mastery_score END"
update_query.update_all("new_mastery_score = #{case_statement}")
end
end
If we revisit that math from before, it looks a whole lot different now:
(8 hours / ~144,000 batches) / 2 queries/batch = 100 ms/batch
Now that is a much more attainable goal! Thanks, @dgtized!
Batch Size
There’s definitely a sweet spot for the batch size. Make it too small, and you’re missing out on performance. Make it too large, and it can actually slow you down. You could improve the script to adapt while it’s running – measuring performance and adjusting the batch size over time – but I decided that was overkill in this case. I ran the script for 2 minutes at values between 2 and 100 and compared how many students it was able to update. At the time, this led me to find that 50 was pretty close to optimal. I did later discover, however, that the newest student accounts (highest id values) had much more data, and 20 was a better value for the final id range of the script.
Parallelize
I was feeling pretty good after the batching changes, and was ready to run the scripts again – see where we were. But first, I took the one script and split it into two, one for each table we needed to update, and then tried running each of them. We were at right about 20 hours for each script. Not bad, but not good enough.
From experience at this point, I knew the scripts slowed down as they reached students with higher ids. So, to add a little buffer, I figured a 3-4x speedup would get us to the point where each script was safely under 8 hours.
I still had at my disposal two parallelization strategies: multi-threaded, and multi-process.
Multi-Process
When it comes to multiple threads vs multiple processes, at least in this case, multiple processes is the much easier approach. Running a rake task starts up a brand new, shiny rails process. Running the same rake task in two terminals (or two tmux panes) starts up two, totally separate, rails processes. Now, it doesn’t do us any good to have two processes running at the same time if they’re doing the same work, so the key change here is to be able to pass in to the rake task which section of the work to do. Which looks like this:
task :sync_mastery_scores, [:start_id, :max_id] => :environment do |_, args|
start_id = args[:start_id]
max_id = args[:max_id]
ids = ( start_id..max_id )
ids.step(BATCH_SIZE) do |first_id|
last_id = student_id + BATCH_SIZE - 1
student_ids = ( first_id..last_id )
score_by_primary_key = OldMasteryCalculator.data_for_student_ids(student_ids)
StudentTopic.batch_update_new_mastery(score_by_primary_key)
end
end
Now, I can jump into tmux, open a couple panes, and start two processes:
|--------------------------------------------------------
|
| $ rake sync_mastery_scores[1,50000]
|
|--------------------------------------------------------
|
| $ rake sync_mastery_scores[50001,100000]
|
|--------------------------------------------------------
The EC2 instance I needed to run the script from had two cores, so two processes seemed like a reasonable number, but there’s nothing stopping us from starting three, or a dozen.
Note: There are a number of other ways I could have used to get two processes running (parallel, IO::popen
, fork
, etc), but I wanted two independent panels for printing out the current progress, and this was the simplest way I knew of doing that.
Multi-Threaded
The multi-threaded approach is undeniably more complex. In fact, it led to its own blog post! I’m reprinting the final code below, and if you’re curious feel free to dive into the earlier post.
task :sync_mastery_scores, [:start_id, :max_id] => :environment do |_, args|
start_id = args[:start_id]
max_id = args[:max_id]
ids = ( start_id..max_id )
thread_each(n_threads: 2, ids: ids, batch_size:BATCH_SIZE) do |first_id, last_id|
student_ids = ( first_id..last_id )
score_by_primary_key = OldMasteryCalculator.data_for_student_ids(student_ids)
StudentTopic.batch_update_new_mastery(score_by_primary_key)
end
end
# Executes a block of code for all ids in a range, in batches of a
# specified size, and distributed across a specified number of threads
#
# @param n_threads, Fixnum, number of threads to create
# @param ids, Range, ids to be distributed between the threads
# @param batch_size, Fixnum, how many ids to pass into the block at a time
# @param block, Block, execute this block passing in a start_id and last_id for
# each batch of ids
def thread_each(n_threads:, ids:, batch_size:, &block)
_preload = [
StudentTopic
OldMasteryCalculator
]
ActiveRecord::Base.connection_pool.release_connection
threads = []
ids_per_thread = (ids.size / n_threads.to_f).ceil
(0…n_threads).each do |thread_idx|
thread_first_id = ids.first + (thread_idx * ids_per_thread)
thread_last_id = thread_first_id + ids_per_thread
thread_ids = (thread_first_id...thread_last_id)
threads.append Thread.new {
puts "Thread #{thread_idx} | Starting!"
thread_ids.step(batch_size) do |id|
block.call(id, id + batch_size - 1)
end
puts "Thread #{thread_idx} | Complete!"
}
end
threads.each { |t| t.abort_on_exception = true } # make sure the whole task fails if one thread fails
threads.each { |t| t.join } # wait for all the Threads to complete
end
One Last Thing About Logging
In the end, I was able to run the script successfully on our staging environment so our work could be tested. I could run two processes, each with two threads, and get the job done in about 8 hours.
Between running it on staging and on production, Marica noticed something on the staging server: About 2 GB of logs in an 8 hour time period. The whole time the script had been running, it had been pumping out ream after ream of logging data to the filesystem. So before running the script on production, I added one little function and called it at the start of the script:
def turn_off_logging!
nil_logger = Logger.new(nil)
Rails.logger = nil_logger
ActiveRecord::Base.logger = nil_logger
end
Some Final Considerations
I wish I could tell you exactly how long it finally took to run on production, but in the end I ran into a couple hiccups that prevented me from getting a good measurement.
For one, I discovered that running more than 4 tasks at a time (2 processes x 2 threads) was enough to affect performance of the production site, even at the very low usage we experience in the late evening.
In addition, I didn’t have a good way to resume from where I left off when something went wrong (e.g. slowing down production), so when I did hit a hiccup I had to kill the tasks and start them over.
To minimize the pain of “starting over”, I ran those 4 tasks over smaller chunks of the user space (user_ids 0-200,000, then 200,001-400,000, etc) over the course of two nights.
Using shorter running tasks also helped me adjust when I discovered that the backfill performance was slowing down as the student_ids were getting larger.
Because of how NoRedInk has changed over the years, early users produced much less mastery data than newer users.
Lowering the batch size for newer users helped to level out the backfill performance.
At the start of each chunk, I was able to do a trial run for a minute or two, judge the performance, and adjust the batch size to speed things up.
Fortunately, we designed a backfilling procedure which was idempotent, so it was harmless to kill and restart to my hearts content!
Thanks
And that’s everything I learned. Thanks for coming along on this journey. Hopefully this will help you avoid some of the pitfalls I ran into along the way. If you notice anything I missed or got wrong, I’d love to hear about it and keep learning - please write to me and let me know. Thanks!
![]()
Josh Leven
@thejosh
Engineer at NoRedInk