Often we read stories of stories of applications build by large teams and successfully deployed. War stories is nothing like that. Reality is, most of developers are working on legacy code with it’s own trademarked idiosyncronices. War stories is a series of snippets from my professional career of developing applications and how sometimes plans do not work out, and how they were dealt with when they decided to misbehave. I do not mention name on companies, let’s say these stories are fictional ;).
War stories: Ghost of the past
During a lifecycle of a project sooner or later someone will propose a rewrite. The code base has gotten to a state where it no longer matches the requirements and or further development of the project is hindered by early choices made in the project. Developers start complaining, management baffled how such proposals even came to fruition.
Rebuilding an application from ground up is always risky. The existing system is not only set of features, but a bundle of work arounds and hacks hidden beneath layer of code. I am not here to define merits of a rewrite or any other strategies. This is a story, of a hack, of a launch delay and a bit of stress.
After 6 weeks of intensive development of the web application, we were ready to launch. We have spent two weeks testing our application and comparing the behaviour with the existing system to ensure the migration would be smooth for all the stakeholders. We did not modify the existing application deployment pipeline or any of the subdomains involved. To ensure we always have the existing application available for a comparison or rollback.
Let’s hit the button.
Deployment of the application was handled with Jenkins, so it was as simple as pressing a button a button to deploy. Once application was deployed to the kubernetes, we go to the url only to be greeted by a fabulous white page. Don’t panic, pods are probably not ready yet, the issue will auto resolve in a minute. A minute later, open the browser incognito mode and it works.
Roses are Red, Violets are blue, so is production.
We start receiving emails, slack messages and people coming from sales floor. Production is down. We do a typical developer thing, look it works on my machine. We show it works on Firefox, Opera, Edge and Chrome. One exception though, as we get to show on Chrome, we get a white page as well.
Panic. Let’s rollback. Where is DevOps? Not present. Long story.
Time to solve a problem under mild stress, only production doesn’t work. Wait. Why does it work in incognito mode. This is a cache issue. Cleared application cache on effected machines, it is all good. Wait. How do our customers access the web application? Wait. Let’s find a machine with the cache. It is the service worker, with a one month cache. This problem will auto resolve in one month. Problem solved.
How does one force all the machines to unmount the service worked that was part of the previous application? Nobody in the team is aware how that service worker cache was setup, or how to setup a service worker. At that point, I have only heard about service workers but never worked on one or read in-depth about them.
Let’s Google This.
How do I clear the cache remotely on everyone’s computer? Investigate if I can inject any code. Wait. Is that Google Tag Manager loading first? Googles how to remove service worker, copies the code hastily and injects via tag manager. Refresh. Page loads. After the initial fix, we improved the code to cover some edge cases and kept the code for next month in the Google Tag manager.
The story is full of bad practices. What happened in mere span of 20-30 minutes will always be engraved in my memory. At the same time, I am happy with the solution, even though I am aware that there are myriads of other potential solutions that we could have deployed that time, this one offered us the quickest way to get code production.
Prevent and Avoid
The incident did not occur due to only one mistake or a person within the organisation. Adding more steps to the deployment, increasing number of manual and automated tests could of have prevented the issue. The key word being could, since it is such a small edge case that nobody in the team predicted. The only technical solution which could of helped preventing the application from going to production is a canary release. Canary release have their own set of complexities and fun, and might introduce other problems (hello multiple versions of the application running at the same time).
TODO: - Decent conclusion Let’s be honest, things happen. The question is, are you ready when it does?
It is reflection of the culture, leadership and management team of the company.