production migration – Pablo Fernandez

A few years ago, at Wifinity, around 4 in the morning, I made one of the hardest decisions of my career: I rolled back a migration we’d spent months preparing for.

We were migrating the production infrastructure of an ISP serving 200k+ customers. Traffic was already flowing through the new environment in read-only mode, the cutover was minutes from its point of no return, and we’d burned every minute of buffer trying to fix the issue that had surfaced. As we hit the edge of the rollback window, I made the call to bail.

The rollback executed like the most rehearsed ballet production I’ve ever seen. We re-ran the migration cleanly a week later. Zero customer complaints.

This post is about how rollback became the thing that made the migration work.

The situation

I’d inherited a production environment from a third-party agency that had been running our ISP’s software co-mingled with services for unrelated clients: one shared cluster, one database server hosting multiple databases, our services intermixed with theirs. There was no clean way to separate “our” infrastructure without fully migrating off.

So we replicated the entire production stack into our own AWS environment, set up database replication between the two, and planned a single overnight cutover. The software was authenticating every Wi-Fi user across military bases, holiday parks, hotels and universities. Communication between the application and the network gear ran in both directions.

And we had two overnight windows, a week apart, in which to do it. The expectation going in was that one window would be enough. I negotiated for two, because I like building wide margins of error into anything this consequential. If we missed both, the next opportunity wasn’t for another six months.

Those two windows were both the earliest and the latest realistic options we had. The months on either side were full of problematic dates. It was tight enough that one of the windows fell on my wife’s birthday. She was understanding. Still, as an apology I bought her the Hogwarts LEGO castle. In our relationship it’s become a meme. She asks if work is making me miss her birthday again so she can pick out the next big apology LEGO set.

The reverse runbook

Every step of the forward runbook had a corresponding reverse step. About 120 steps in total. Some reverse steps were no-ops. Most weren’t. We wrote them in parallel with the forward steps. Writing the forward runbook and “thinking about rollback later” was never on the table.

Treating rollback as a first-class artifact had design consequences.

Engineer the point of no return to be as late as possible. The new infrastructure ran in read-only mode for as long as we could keep it there. Traffic flowed through it, customers got internet directly without going through the captive portal, and no new state was being written to the new system. Until we crossed the read-only-to-read-write boundary, the rollback was always available. We could test that things were working properly and still bail out.

Keep users online by default. Our internal principle: it’s better to give internet access and not charge for it than to not give internet access at all. With one exception for a customer with specific security requirements, that principle simplified a lot of judgment calls during the cutover.

Handle the legally-required logs inside the rollback envelope. UK law required us to keep connection logs for a defined retention period. During the read-only phase we captured them via a separate path, in a separate format, in a separate location. We made the deliberate choice to never integrate those logs back into the new system. They sat in storage for the retention period and that was that. Sometimes the right call is to accept a small permanent ugliness in exchange for never having to think about it again.

The best workflow tool was…

The migration was more than 100 individual steps, each with its reverse counterpart, forming a dependency graph that would split and merge, with different people in different teams performing different actions. The merges meant a person stopping and waiting for someone else to catch up. Not respecting these dependencies, in some cases, could have had catastrophic results.

We tried workflow tools that could express all of this, but in the end, the best tool was a spreadsheet. It let us easily reorder steps, which we did a lot during planning, and it was flexible enough to capture everything else.

Adversarial rehearsals

We rehearsed the migration repeatedly over the months before the windows. The format was a mix: some live commands against the real environment, some test commands against a test environment that didn’t quite mirror production, and a lot of tabletop walkthroughs. We didn’t have a clean production analogue to play with, so we leaned on inspecting real systems and on injecting faults verbally: “assume this server is not responding from now on, what happens next, can we finish the migration?”

Each rehearsal was more adversarial than the last. Late in the cycle we ran one where every engineer in the company was invited and asked to aggressively poke holes. They found issues. We kept rehearsing until nobody could think of any more.

In the final rehearsal, I dropped from the call on purpose, without warning. My role was orchestrating the steps. I wanted to know whether the team could continue without me. Someone picked up the orchestration within a minute and the rehearsal carried on. By the time of the real migration I knew the team didn’t need me present for it to succeed. You know, in case my ISP boycotted me and decided to kill my internet that evening. We had contingency plans for other engineers too, but they were more complex, sometimes forcing us into rollback.

A note on team shape. Most of my engineers had been at the company about six months. Almost nobody had deep tenure on the inherited system. The rehearsals were both a stress test of the plan and the team’s training course in how the system actually worked. The veterans we had (on the network side, plus a few from the original agency) brought essential know-how and surfaced questions we wouldn’t have thought to ask.

The decision

In the first window, a database incompatibility surfaced after traffic was already switched, with the new infrastructure running in read-only mode. We worked the problem. We burned every minute of buffer. As we hit the edge of the rollback window, I had to choose: keep trying to fix it, or bail.

Even with all the preparation, the call was one of the hardest of my career. The competing voices in my head: maybe one more hour and we solve it. What if the rollback itself fails on something we missed? What’s more risky here: solving an unknown unknown under time pressure, or executing a rehearsed rollback? And underneath all of that, the knowledge that I’d planned the second window as a buffer to not be used, and consuming it meant I’d planned for two windows when I should have planned for three.

I’d file that as a transferable lesson: buffer that gets used isn’t buffer. If your plan needs all N windows to succeed, you didn’t have margin, you had a fallback. Plan for one more than you think you need.

I called the rollback. It ran beautifully. Smooth, quick, parallel across teams. We landed back on the original infrastructure right at the edge of our time budget. The rehearsals had paid off.

The week between

The week between the two windows looked like roughly one day to understand and mitigate the database issue (a calm second look at the problem, away from the pressure of an active cutover, made it tractable), and the rest of the week double-checking, searching for further unknown unknowns. We didn’t run more rehearsals.

The mood I shielded my team from: I had dread going in. I was confident in the rollback if we needed it (we’d just proved it worked), but a second failure would no longer have been a tech problem. It would have been a business problem, with consequences I’d be answering for. I don’t think the team felt it. I made sure of that.

The second window ran clean. We had a 5 hour window booked. We were done in under three. One of those hours was budget reserved for a rollback we didn’t need.

We monitored customer support tickets live during the cutover. Customers complain about Wi-Fi immediately. At one point during the migration the support monitor interrupted the call to announce a ticket about no internet access. Everyone’s heart stopped. We started checking systems. Then someone re-read the ticket carefully: it was a customer of one of our competitors who had misdialled us. Zero real complaints.

I’m now ready for my next big migration, but I’ll try to avoid my wife’s birthday. LEGO is expensive.

Tag: production migration

When failure was an option