We cannot avoid human made disasters, but we can prepare for them

Yury Niño
3 min readMay 25, 2020

--

This is the story of how a picture became an article :) I wanted to share an illustration that I made to explain the roles involved in a Gameday to my coworkers — you can find in the end of this post — but after reading this quote in Twitter:

Without context, data does not have meaning at all.

I decided that it would need a context, I started providing some ideas, but at the end of my Sunday it became an article :O. I hope you find useful these ideas and references and you don’t get bored while you read the context, because at the end is the dessert, my picture :)

Disasters such as The Apollo1 Catastrophe in 1967, The Chernobyl Nuclear Accident in 1986 and The Collapse of Space Building in Medellín have taught us that we, humans, are central to both the problem and the solution of incidents in engineering. As Andy Fleener mentioned in [2] “we are simultaneously the reason there’s a need and the solution to solve that need”.

While these tragedies involved loss of situational awareness, procedural violation, regulatory shortfalls, managerial deficiencies, but human errors can not be seen as the cause of a mishap, human errors as the symptom of deeper trouble [3]. This connection between humans and tools, tasks and operational/organizational environment is known as Human Factors. Human Factors have been studied in many fields such as it is illustrated in the next Figure.

As a result of the application of psychological to the engineering and design of products, processes, and systems, some researchers have concluded that arriving at the edge of chaos allow us see the entropy. According to Sidney Dekker at the edge of chaos, systems have tuned themselves to the point of maximum capability [3].

In these sense, in Chaos Engineering we are trying to achieve this border through simulated disaster exercises called Gamedays. A Gameday is an practice event hosted to conduct chaos experiments against components of a system to validate or invalidate a hypothesis about a system’s resiliency in real-world turbulent conditions.

Although Gamedays lasts between two and four hours, it could take a whole day. They involve the team who develop, operate, monitor, and/or secure an application, but not all these roles are needed to run experiments. Probably that is the reason why we have many many styles and formats. Some formats include: Dungeons & Dragons and Informed in Advance. They were described by Russ Miles in [4].

It is very important to pick a proper Gameday style according to the organization’s size, people’s skills, time, resources and budget, I think that, ideally, they should involve members working collaboratively from a combination of areas.

This whole story for showing an illustration that I made :) with the roles that should be included in a Gameday.

References

[1] https://interestingengineering.com/23-engineering-disasters-of-all-time

[2] Chaos Engineering Book.

[3] The Field Guide to Understanding ‘Human Error’ Book.

[4] Learning Chaos Engineering Book.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Yury Niño
Yury Niño

Written by Yury Niño

Cloud Infrastructure Engineer @Google. Chaos Engineer Advocate. Loves building software applications, DevOps, Security and SRE

No responses yet

Write a response