Stockholm Chaos & Resilience Engineering Day 2019

3rd European Chaos Engineering Day:
Wednesday 4 December 2019, 9:00 – 16:30
KTH Main Campus

Sponsored by the CASTOR Software Research Centre

2018 Edition / 2017 Edition



  • 08:30 – 09:00 Welcome coffee
  • 9:00 – 9:15 Opening
  • 9:15 – 10:00 Keynote “From Being Wrong(TM) to A Superpower in One Step” (Russ Miles)
  • 10:00 – 10:30 Coffee break
  • 10:30 – 11:30
  • 11:30 – 13:30 Lunch break
  • 13:30 – 14:30 Demo session
  • 14:30 – 15:00 Coffee break
  • 15:00 – 16:00
  • 16:00 – 16:15 Closing

Registration website:


[Keynote] Russ Miles, CEO of ChaosIQ, discusses the tools and techniques he uses to turn inevitably being wrong into being successful at being wrong. Being wrong can be turned to your advantage, and Russ shares stories of how this has happened and also the challenges to look out for. Being wrong is often seen as the worst thing that can happen™, especially when you architect, build, and run business-critical applications and services. But the increased velocity of modern software development, plus the increased need for systems to be resilient, reliable, and right has increased the pressure on teams, and in particular architects, exponentially. Never before have software owners had such an opportunity, or the power, to be wrong. We need to get better at being wrong.

Barry O’Reilly is the founder of Black Tulip Technology and creator of Antifragile Systems Design. Barry is a CITA-P certified Architect who has held Chief Architect positions at Microsoft and iDesign. He has also been a startup CTO, the Worldwide Lead for the Solutions Architecture Community at Microsoft, and founder of the Swedish Azure User Group. He is also a trainer for IASA. He is currently embarking on a research career in the area of resilience, complexity science, and software engineering.  Barry will talk about techniques that allow us as architects to make pragmatic, evidence-based decisions about the boundaries and granularity of components for systems that will operate in complex contexts. He will present simple tools that allow us to quickly model systems, their context, and their ability to cope with fluctuation in that context, allowing us to make decisions with confidence using critical thinking rather than the copy/paste approaches of pattern libraries. 
Julien Bisconti is a site reliability engineer. Abstract: The source of the most complexity is not the services themselves, but communication between services. Those concerns can be addressed by integrating libraries, that leads to library bloat. The service mesh is a network for services not for bytes. It is an inter communication infrastructure that allows the traffic to be routed by configuring proxy running as a side car to each service.  Starting to do chaos engineering can seem like a daunting task if one has never practice that before.  Doing chaos engineering with service mesh is trivial and safe. It opens up a new level of what’s possible to do with a small team. In this talk, we will outline what is a service mesh and how does it help us to do chaos engineering, all of that running on Kubernetes. When the concept of a service mesh is understood, starting to do chaos engineering is simpler and easier.  The talk will contain 2 demos: first, in order to explain the inner working of a service mesh (we will use Istio, one of the dominant service mesh) and second, an example of chaos engineering on top of Istio.
Long Zhang is a Ph.D. student in computer science at KTH Royal Institute of Technology, Sweden. His research work focuses on chaos engineering, self-healing software, and antifragile systems. Long received his BE degree and ME degree in software engineering from Harbin Institute of Technology, China. Before his Ph.D. study, Long was hired by Tencent as a software developer and project manager, who was responsible for university-enterprise cooperation projects design and development. In this talk, he will present RoyalChaos, a GitHub repo that contains the team’s research work on application-level chaos engineering. Then he will briefly demonstrate TripleAgent and POBS project to the audience.
Markus Weninger is a teaching and research assistant at the Institute for System Software at the Johannes Kepler University Linz, Austria. 

In this talk, he presents the AntTracks Analyzer, a memory analysis tool developed at the Johannes Kepler University Linz. Specifically, he will show how this tool provides guidance to support users to analyze memory leaks and high memory churn. The basic idea is that the tool automatically detects and highlights the most important information on the screen, explains why it is important, and which next steps are appropriate based on these findings. This way, the user is guided through the whole analysis process, enabling them to explore the root cause of a problem even without prior experience.

Paris Carbone is open source committer at the Apache Foundation and a senior computer scientist, currently serving as the leader of the “Continuous Deep Analytics” group at RISE.  Paris will talk about data stream processing pipelines that involve tens to hundreds of compute instances, exchange messages and leave side effects to internal states as well as external systems (databases, logs etc.). Anything from a single partial failure (e.g., process/network channel failures) to a complete datacenter disaster is capable of producing incorrect side effects. To avoid this, Apache Flink has an underlying snapshotting mechanism that captures state changes correctly and transparently. This talk offers a rigorous overview of Flink’s state of the art error avoidance approach which has been serving hundreds of production pipelines over the last years. Paris further covers how the same mechanism can be exploited for many other useful usages beyond fault tolerance such as provenance, reconfiguration, debugging, pipeline migration and external access isolation on top of stream pipelines.


  • Chaos engineering principles and tools
  • Chaos monkey, monkey testing in the field
  • DevOps tools and approaches
  • Site reliability engineering
  • Automated recovery and remediation
  • Software antifragility
  • Production support for monitoring distributed systems
  • Automatic software repair
  • Error and anomaly detection
  • Chaos & cloud elasticity / scalability
  • Continuous integration, testing and deployment

Practical information:

Language of the workshop: English

Meals: lunch & coffee for registered participants, wireless included 🙂

Organizing committee: Long Zhang, Martin Monperrus, Maria Berthelius