PageRenderTime 7ms CodeModel.GetById 2ms app.highlight 3ms RepoModel.GetById 1ms app.codeStats 0ms

/posts/kubernetes-failure-stories.rst

https://codeberg.org/hjacobs/srcco.de
ReStructuredText | 83 lines | 60 code | 23 blank | 0 comment | 0 complexity | 108912cf4cdc552a3aa886b099a40093 MD5 | raw file
 1.. title: Kubernetes Failure Stories
 2.. slug: kubernetes-failure-stories
 3.. date: 2019/01/20 11:26:00
 4.. tags: kubernetes
 5.. link:
 6.. description:
 7.. previewimage: ../galleries/kubernetes-logo.png
 8.. type: text
 9
10.. image:: ../galleries/kubernetes-logo.png
11   :class: left
12
13I started to compile a `list of public failure/horror stories related to Kubernetes <https://github.com/hjacobs/kubernetes-failure-stories>`_.
14It should make it easier for people tasked with operations to find outage reports to learn from.
15
16
17.. TEASER_END
18
19Since we started with Kubernetes at Zalando in 2016, we collected many internal postmortems.
20Docker bugs (`daemon unresponsive <https://github.com/moby/moby/issues/28889>`_, process stuck in pipe wait, ..) were a major pain point in the beginning, but Docker itself has become more mature and did not bite us recently.
21The biggest chunk of problems can be attributed to the nature of distributed systems and "cascading failures", e.g. a Kubernetes API server outage should not affect running workloads, but `it did <https://github.com/zalando/skipper/issues/406>`_,
22or see `our recent CoreDNS incident <https://twitter.com/sszuecs/status/1085292025895940097>`_.
23
24We shared some of our incidents and Kubernetes failures in talks:
25
26* `Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevOpsCon Munich 2018 <https://www.slideshare.net/try_except_/running-kubernetes-in-production-a-million-ways-to-crash-your-cluster-devopscon-munich-2018>`_
27* `Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK 2018 <https://www.slideshare.net/try_except_/running-kubernetes-in-production-a-million-ways-to-crash-your-cluster-container-camp-uk>`_
28* `Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW meetup 2018 <https://www.slideshare.net/try_except_/kubernetes-on-aws-at-zalando-failures-learnings-devops-nrw>`_
29
30My main motivation for giving such talks about failures is that **I want to hear more of them myself!** Nordstrom's `talk "101 Ways to Crash Your Cluster" on KubeCon 2017 <https://www.youtube.com/watch?v=xZO9nx6GBu0>`_ was my inspiration
31(as you can even see from the similarity in talk titles ;-)). I hope to see more people share their postmortems and give failure talks.
32Monzo's transparency and `public postmortem <https://community.monzo.com/t/resolved-current-account-payments-may-fail-major-outage-27-10-2017/26296/95>`_ is a great service to the community and should be something we all strive towards.
33
34Compiling a List of Kubernetes Failure Stories
35----------------------------------------------
36
37On my quest to find more public Kubernetes failure stories, I discovered that it's either really hard to find them (or my web search skills are lacking) or that there are only very few published. Search terms I tried on `DuckDuckGo <https://duckduckgo.com/>`_ and Google:
38
39* `kubernetes outage <https://duckduckgo.com/?q=kubernetes+outage>`_
40* `kubernetes incident <https://duckduckgo.com/?q=kubernetes+incident>`_
41* `kubernetes postmortem <https://duckduckgo.com/?q=kubernetes+postmortem>`_
42* `kubernetes failure <https://duckduckgo.com/?q=kubernetes+failure>`_
43* `kubernetes crash <https://duckduckgo.com/?q=kubernetes+crash>`_
44
45I also tried various combinations and "k8s", "kube-dns", and "kube-proxy" instead of "kubernetes". This did not yield many results and most of the pages I found are somehow more "success" stories and highlight how to prevent outages from happening.
46That's boring!
47
48The `compiled list of Kubernetes Failure Stories I found so far is available on GitHub <https://github.com/hjacobs/kubernetes-failure-stories>`_.
49I hope to see many contributions to the list from the community, but I guess the hard part is encouraging people to publish their outage reports.
50**Please contribute to the list** by opening an issue, creating a PR or `reaching out to me on Twitter <https://twitter.com/try_except_>`_!
51
52.. image:: ../galleries/twitter-kubernetes-failure-stories.png
53   :class: center
54   :target: https://twitter.com/try_except_/status/1086582859224285184
55
56What's Next
57-----------
58
59I'll be on a meetup in Hamburg in February to talk more about Kubernetes failures, please join if you can: `"Let’s talk about Failures with Kubernetes!" meetup Hamburg <https://www.meetup.com/inovex-Meetup-Hamburg/events/258065688/>`_.
60
61At Zalando, we will try to publish a write-up of our recent Kubernetes DNS incident and hopefully find a way to more systematically share postmortems with the community.
62Sharing our failure stories is something we can all benefit from to harden our setups and help prioritize upstream issues.
63"Production-readiness" is, from my perspective, still something mostly discussed behind closed doors (i.e. inside organizations) --- e.g. `CPU CFS quota behavior and latency impact <https://www.slideshare.net/try_except_/optimizing-kubernetes-resource-requestslimits-for-costefficiency-and-latency-highload>`_ is not well known and not mentioned in `the docs <https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/>`_.
64Let's change that!
65
66BTW: I'm also still looking for the first Istio failure talk..
67
68.. image:: ../galleries/twitter-istio-horror-story.png
69   :class: center
70   :target: https://twitter.com/ipedrazas/status/979293422199738368
71
72Some recommended talks/reads for Kubernetes in production:
73
74* `Hardening Kubernetes Setups: War Stories from the Trenches of Production - Giant Swarm - KubeCon North America 2018 <https://www.youtube.com/watch?v=MTHj0_NdeeM>`_: not very deep, but mentions some good points to look out for
75* `90 days of AWS EKS in Production - Graham Moore - blog post 2018 <https://kubedex.com/90-days-of-aws-eks-in-production/>`_: many tunable system parameters (which you probably should not copy 1-1 without understanding them), mentions important ``kube-dns`` scaling
76* `Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency - Zalando - Highload++ 2018 <https://www.youtube.com/watch?v=eBChCFD9hfs>`_: why you should consider disabling CPU throttling (CFS quota) in your cluster(s)
77* `Kubernetes the very hard way at Datadog <https://www.youtube.com/watch?v=2dsCwp_j0yQ>`_: good insights of common (DNS issues, OOM) and less common (e.g. Datadog uses containerd and IPVS) challenges
78* `Inside Kubernetes Resource Management (QoS) – Mechanics and Lessons from the Field - Michael Gasch - KubeCon Europe 2018 <https://www.youtube.com/watch?v=8-apJyr2gi0>`_: fundamental information on how Kubernetes resources work
79
80UPDATE 2019-01-28
81-----------------
82
83I did a brief `write-up on what happend after posting this blog article on Hacker News </posts/tale-of-a-hacker-news-post.html>`_.