« No Titles Except "Plant" Manager | Main | Why Did THAT Happen? »

26 January 2008

Asking Why in IT

Thanks to reader Alex for sending me a detailed article on how one IT system administrator used the five whys to solve a network connectivity problem. 

At 3:30 in the morning of January 10th, 2008, a shrill chirping woke up our system administrator, Michael Gorsuch, asleep at home in Brooklyn. It was a text message from Nagios, our network monitoring software, warning him that something was wrong. Michael logged onto his computer in the other room and discovered that one of the three data centers he runs, in downtown Manhattan, was unreachable from the Internet.

After a couple more occurences the culprit was identified.

The problem was something with the network switch. Michael temporarily took the switch out of the loop, connecting our router directly to Peer 1's router, and lo and behold, we were back on the Internet.  Michael spent some time doing a post-mortem, and discovered that the problem was a simple configuration problem on the switch. The switch that failed had been set to autonegotiate. This usually works, but not always, and on the morning of January 10th, it didn't.

After this experience he got to thinking about "uptime" in general, and the problems of outlier events.. 

Internet providers like Peer 1 like to guarantee the uptime of their services in terms of a Service Level Agreement, otherwise known as an SLA. A typical SLA might state something like "99.99% uptime." When you do the math, let's see, there are 525,949 minutes in a year, so that allows them 52.59 minutes of downtime per year. If they have any more downtime than that, the SLA usually provides for some kind of penalty.

Keeping internet services online suffers from the problem of black swans. Nassim Taleb, who invented the term, defines it thus: "A black swan is an outlier, an event that lies beyond the realm of normal expectations." Almost all internet outages are unexpected unexpecteds: extremely low-probability outlying surprises. They're the kind of things that happen so rarely it doesn't even make sense to use normal statistical methods like "mean time between failure."

There must be a better way to deal with such events... and he discovered the five whys.

Somewhere between the "extremely unreliable" level of service, where it feels like stupid outages occur again and again and again, and the "extremely reliable" level of service, where you spend millions and millions of dollars getting an extra minute of uptime a year, there's a sweet spot, where all the expected unexpecteds have been taken care of.

To reach this sweet spot, we borrowed an idea from Sakichi Toyoda, the founder of Toyota. He calls it Five Whys. When something goes wrong, you ask why, again and again, until you ferret out the root cause. Then you fix the root cause, not the symptoms.

Applying that methodology he identified a preventative approach.

  • Our link to Peer1 NY went down
  • Why? – Our switch appears to have put the port in a failed state
  • Why? – After some discussion with the Peer1 NOC, we speculate that it was quite possibly caused by an Ethernet speed / duplex mismatch
  • Why? – The switch interface was set to auto-negotiate instead of being manually configured
  • Why? – We were fully aware of problems like this, and have been for many years.  But - we do not have a written standard and verification process for production switch configurations.
  • Why? – Documentation is often thought of as an aid for when the sysadmin isn’t around or for other members of the operations team, whereas, it should really be thought of as a checklist.

"Had we produced a written standard prior to deploying the switch and subsequently reviewed our work to match the standard, this outage would not have occurred," Michael wrote. "Or, it would occur once, and the standard would get updated as appropriate."

Not only are they fixing the root cause, they are telling their customers about the problem and solutions.  That creates value through increased confidence.

Instead of setting up a SLA for our customers, we set up a blog where we would document every outage in real time, provide complete post-mortems, ask the five whys, get to the root cause, and tell our customers what we're doing to prevent that problem in the future.

Wouldn't you appreciate a supplier that did this instead of simply filling our corrective action forms, probably documenting the corrective action to a problem that has occured over and over again?

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.

Subscribe

Search the Blog

Gemba Academy

Superfactory

  • Resources for lean excellence
    - Articles | Books
    - Events | Glossary
    - Topic Resources | eNewsletter
    - PowerPoints | Videos
    - Virtual Tours | Lean History

    PowerPoint
    Presentations

    Lean Manufacturing
    Lean Overview - 3P - 5S - Jidoka - Kaizen - Value Streams - Visual Factory - Pull - JIT - Kanban - Quick Changeover - Cellular Manufacturing - Standard Work - Theory of Constraints - TPM - TWI

    Lean Enterprise
    Lean Manufacturing - Lean Office - Lean Accounting - Lean Design - Lean Project Management - Lean Sales & Marketing - Lean Supply Chains - Hoshin Planning - Lean Enterprise Assessment

    Quality
    SPC - Root Cause Analysis - Six Sigma - FMEA - ISO 9001 - Mistake Proofing

    Business
    Balanced Scorecard - Design for Lean - Cost Accounting - Capital Budgeting - Competitive Intelligence - Knowledge Management - Job Design - Outsourcing Strategy - Supply Chain Strategy - Strategic Management - Project Management

    Safety
    Accident Investigation - Biosafety - Chemical Spills - Hazard Communication - and 35 more

     


    Factory Toolbox


    Over 500 forms, procedure templates, and tools for download.

    Lean Toolkit - Procedures Toolkit - Quality Toolkit - Tools and Forms Toolkit - Engineering Toolkit - Materials Toolkit - Safety Toolkit - HR Toolkit - Six Sigma Toolkit - Finance Tookit

The Book

  • Evolving Excellence
    Thoughts on Lean Enterprise Leadership

    by Kevin Meyer and Bill Waddell

    A 458-page edited and categorized compilation of our favorite posts! All for only $29.95.

    More information

    All 1500+ pages of Evolving Excellence from January of 2005 through July of 2008, including comments and reference sources, is now available in a series of six e-books. Perfect reading for those long plane rides to visit your farflung factories...! The entire series for only $10, which helps cover our costs.

    Purchase and download now!

Sponsors

Other

  • Copyright © 2004 - 2008
    Factory Strategies Group LLC.
    All rights reserved.