Maybe, but the point with containers and kubernets is to treat it like cattle, not pets.
If something blows up or dies, then with Kubernetes it's often faster to just tear down the entire namespace and bring it up again. If the entire cluster is dead, then just spin up a new cluster and run your yaml files on it and kill your old cluster.
Treat it like cattle, when it doesn't serve your purpose anymore then shoot it.
This is one of the biggest advantages of Kubes, but often overlooked because traditional Ops people keep treating infrastructure like a pet.
Only thing you should treat like a pet is your persistence layer, which is presumably outside Kubes, somehting like DynamoDb, Firestore, CosmosDb, SQL server, whatever.
This is not good engineering. If somebody told me this at a business, I’d not trust them anymore with my infrastructure.
So, you say that problems happen, and you consciously don’t want to know/solve them. A recurring problem in you view is solved with constantly building new K8s clusters and your whole infrastructure in it every time!?!
Simple example - A microservice that leaks memory.... let it keep restarting as it crashes?!
I remember at one of my first jobs, at a healthcare system for a hospital in India, their Java app was so poorly written that it kept leaking memory and bloated beyond GC could help and will crash every morning at around 11 AM and then again at around 3 PM. The end users - Doctors, nurses, pharmacists knew about this behavior and had breaks during that time. Absolutely bullshit engineering! It’s a shame on those that wrote that shitty code, and shame on whoever reckless to suggest a ever rebuilding K8s clusters.
Yes, "let it keep restarting while it crashes and while I investigate the issue" is MUCH preferred to "everything's down and my boss is on my ass to fix the memory issue."
The bug exists either way, but in one world my site is still up while I fix the bug and prioritize it against other work and in another world my site is hard-down.
That only works if the bug actually gets fixed. When you have normalized the idea that restarting the cluster fixes a problem — all of the sudden, you don’t have a problem anymore. So now your motivation to get the bug properly fixed has gone away.
Sometimes feeling a little pain helps get things done.
You and I wish that's what happened in real life. Instead, people now normalize the behavior thinking it'll sort itself out automatically over time without ever trying to fix it.
Self-healing systems are good but only if you have someone who is keeping track of the repeated cuts to the system.
This is something that has been bothering me for the last couple of years. I consistently work with developers who no longer care about performance issues, assuming that k8s and the ops team will take care of it by adding more CPU or RAM or just restarting. What happened to writing reliable code that performed well?
Business incentives. It's a classic incentive tension between more time on nicer code that does the same thing or building more features. Code expands to it's performance budget and all.
At least on backend you can quantify the cost fairly easily. If you bring it up to your business people they will notice easy win and then push the devs to make more efficient code.
If it's a small $$ difference although, the devs are probably prioritizing correctly.
I've witnessed the same thing, however there is nothing mutually exclusive about having performant code running in Kubernetes. There's a trade-off between performance and productivity, and maintaining a sense of pragmatism is a good skill to have (that's directed towards those that use scaling up/out as a reason for being lax about performance).
Nothing is this black and white. I tried to emphasise just a simple philosophy that life gets a lot easier if you make things easily replaceable. That was the message I tried to convey, but of course if there is a deep problem with something it needs proper investigation + fixing, but that is an actual code/application problem.
That's not what cattle vs pets is. Treating your app as cattle means that it deploys, terminates, and re-deploys with minimal thought at the time of where and how. Your app shouldn't care which Kubernetes node it gets deployed to. There shouldn't be some stateful infrastructure that requires hand-holding (e.g. logging into a named instance to restart a specific service). Sometimes network partitions happen, a disk starts going bad, or some other funky state happens and you kill the Kubernetes pod and move on.
You should try to fix mem leaks and other issues like the one you described, and sometimes you truly do need pets. Many apps can benefit from being treated like cattle, however.
When cattle are sick, you need to heal them. Not shoot them in the head and bring in new cattle. If you your software behaves badly you need to FIX THE SOFTWARE.
Just doing the old 'just restart everything' is typical windows admin behavoir and a recipy for making bad unstable systems.
Kubernetes absolutly does do strang things, crahes on strange things, does strange things and not tell you about it.
I like the system, but to pretend its this unbelievable great thing is an exaturation.
If something blows up or dies, then with Kubernetes it's often faster to just tear down the entire namespace and bring it up again. If the entire cluster is dead, then just spin up a new cluster and run your yaml files on it and kill your old cluster.
Treat it like cattle, when it doesn't serve your purpose anymore then shoot it.
This is one of the biggest advantages of Kubes, but often overlooked because traditional Ops people keep treating infrastructure like a pet.
Only thing you should treat like a pet is your persistence layer, which is presumably outside Kubes, somehting like DynamoDb, Firestore, CosmosDb, SQL server, whatever.