This loss is definitely an attention grabber. Yes, you read it right.
$172,222 lost per second, for 45 straight minutes.

Such incidents and glitches happen suddenly and no amount of preparation can prevent it. While this company faced such a huge loss, which almost brought it to the brink of bankruptcy; there is another line of thought on how can DevOps be the solution?

This is a classic infamous story, caused by “software glitches”, as some analysts call them.

So, what really are these “glitches” and what causes them? Can DevOps be a possible saviour?

Firstly, we need to understand that DevOps is not a magic wand solution.
Rather, it is a set of tools, practices and cultural changes, which have to be brought in over time.
It can also be considered a mind-set change, which gradually matures, visibly demonstrates value addition and potentially mitigates such situations from occurring.

Let’s review what happened and what could have been done to avoid it.

What Happened?

Knight Capital is a firm that specialises in executing trades for retail brokers.
SMARS is its automated, high speed, algorithmic router that sends orders into the market for execution.
New software code was rolled out at New York Stock Exchange in SMARS, to enable customer participation in the Retail Liquidity Program (“RLP”), scheduled to commence on August 1, 2012.

During the deployment of the new code, however, one of the technicians did not copy the new code to one of the eight SMARS computer servers. No one realized that an older functionality code had not been removed from the eighth server, nor the new code added. Notifications were not marked as “Error” and were ignored as a result.

In one of its attempts to address the problem, the new code was uninstalled from the seven servers where it had been deployed correctly. This action worsened the problem, broad basing the older code on all servers.

What Went Wrong

  • Issue 1: Code deployed to 7 out of 8 servers
    • One of the technicians missed deploying the new code to one of the eight servers
    • This is a manual error and could be avoided if Deployment is automated – setup to be deployed on a pre-defined “set of machines”, that make up a deployment cluster.
    • This approach takes away the manual decision of where to deploy!
  • Issue 2: No review of deployments
    • The deployment was not reviewed from a bird’s eye view
    • Unless a modern Deployment tools is used, there’s no easy way to figure out the component versions are deployed at each server.
  • Issue 3: Notifications were not marked as errors are not taken seriously
    • Operations personnel get a ton of notification emails. Unless this barrage of emails is managed, with appropriate flags for alerts, it’s fairly easy to miss out a critical event – as it happened here.
  • Issue 4: Manual un-deploy steps carried out without taking cognizance of it’s impact
    • Without a consistent, defined process flow to deploy/un-deploy, erratic manual steps at the spur-of-the-moment can cause more errors than fix. And that’s exactly what happened here.

DevOps Postulation

A DevOps implementation, backed by cultural changes, is on which brings Development and Operations functioning together as one cohesive unit. An approach of this kind is needed to prevent an incident like this from taking place.

From a tools standpoint, below is a list of leading tools in the DevOps deployment automation category:

All of these tools have the ability to mitigate the issues highlighted above.
They also have functionalities using which other issues of similar nature can be mitigated.

Sample DevOps Illustration

Taking IBM UrbanCode Deploy as an example, a sample DevOps Illustration is presented below. Feature sets from this tool are linked to demonstrate actual functionality. This is not an endorsement of one tool over the other – similar features exist in all the above mentioned tools.

  • Issue 1: Code deployed to 7 out of 8 servers:
    • IBM UrbanCode Deploy has a functionality to create “Environments”. Using this, an Environment called, say Production can be created, which is a set of these 8 servers. Thus, any deployment to Production will roll out to all of these servers, based on their role.
  • Issue 2: No review of deployments
    • “Multi-Tier Application Model” can tracks which components make up an application so they can be deployed and tracked together. With Snapshots, it is easy to ensure that components that were tested together are released together.
    • A Desired-Vs-Original comparison can help you figure out the inconsistencies.
  • Issue 3: Notifications were not marked as errors are not taken seriously
    • With Role assignments and “Quality Gates and Approvals”, only appropriate emails go to the right people, at each stage. These email thus, get the required attention.
  • Issue 4: Manual un-deploy steps carried out without taking cognizance of its impact
    • A “Process Designer” makes it easy to translate a cryptic manual process into an easy to understand automated flow.

So, where does all this lead us?

The story illustrated above is not unique.

Every day, we hear of “glitches” and “failures” – all of which can be traced back to one or more of the issues highlighted above.
The mitigations and solutions are not easy. After all, anything that can potentially save millions of Dollars ought not to be easy!
DevOps is a foresight and its effective implementation has the capability to possibly be the “Magic Wand”.
It takes time, patience and endurance to build the right solution.

This is the need of the hour.
DevOps cannot be “seen” as one tool that deployed and DONE!
It’s a movement. It’s a culture.
It’s change.

Also, it is the way forward for business to enhance overall effectiveness but also to be better prepared for any sudden shocks like this one.

References

Author: Savinder Puri, DevOps expert at Zensar

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...


Savinder Puri

Posted by Savinder Puri

Leave a reply

Your email address will not be published. Required fields are marked *