I think the post title should be called “Production is hard” (as the author talk...

morelisp · on March 27, 2023

> Suddenly your program that wasn’t checking for errors, breaks. The memory that you didn’t manage properly becomes now a problem.

Yeah, nobody deployed anything and ran it for months, even years, before Kubernetes.

988747 · on March 27, 2023

Yes, they did, but did it really take less effort than running it on Kubernetes?

majormajor · on March 27, 2023

Definitely if they weren't running very complex services or their business tolerated an occasional maintenance window.

The particular way Kubernetes can bite you is that it makes it much easy to start with far more complex setups - not necessarily much harder than to start with a simple setup! - but then you have to maintain and operate those complex setups forever, even if you don't need them that much.

If you're growing your team and your services, having a much bigger, more complicated toolbox for people to reach into on day 1 gives you way more chance of building expensive painful tech debt.

suralind · on March 27, 2023

I think it may appear so because Kubernetes promotes good practices. Do logging, do metrics, do traces. That list quickly grows and while these are good practices, there's a real cost to implement them. But I wouldn't agree that Kubernetes means building tech debt - on the contrary, if you see the tech debt, k8s makes it easier to get rid of it, but that of course takes time and if you don't do it regularly that tech debt is only gonna grow.

majormajor · on March 27, 2023

I just rarely see people tackling greenfield problems with the discipline to choose to do Kubernetes without also choosing to do "distributed" in a broader, complexity-multiplying way.

If not for Kubernetes (and particularly Kube + cloud offerings) I really doubt they'd do all the setup necessary to get a bunch of distributed systems/services running with such horizontal sprawl.

morelisp · on March 28, 2023

The operational value of ~weekly ~30 min "maintenance windows" is wildly underestimated by teams today, and its business cost is wildly overestimated.

athoscouto · on March 27, 2023

I'm going to diverge from sibling comments: it depends. As the article points out, k8s may really simplify deploys for devs, while giving the autonomy. But it isn't always worth it.

ranger207 · on March 27, 2023

Yes, until you've scaled enough that it wasn't. If you're deploying a dev or staging server or even prod for your first few thousand users then you can get by with a handful of servers and stuff. But a lot of stuff that works well on one or three servers starts working less well on a dozen servers, and it's around that point that the up-front difficulty of k8s starts to pay off with the lower long-term difficulty

morelisp · on March 28, 2023

Whatever crossover point might exist for Kubernetes it's not at a dozen servers, at the low end it's maybe 50. The fair comparison isn't against "yolo scp my php to /var/www", but any of the disciplined orchestration/containerization tools other than Kubernetes.

I ran ~40 servers across 3 DCs with about 1/3 of my time going to ops using salt and systemd.

The next company, we ran about 80 in one DC with one dedicated ops/infra person also embedded in the dev team + "smart hands" contracts in the DC. Today that runs in large part on Kubernetes; it's now about 150 servers and takes basically two full ops people disconnected from our dev completely, plus some unspecified but large percentage of a ~10 person "platform team", with a constant trickle of unsatisfying issues around storage, load balancing, operator compatibility, etc. Our day-to-day dev workflow has not gotten notably simpler.

slyall · on March 27, 2023

No it didn't. You ended up with each site doing things differently. You'd go somewhere and they would have a magical program with a cute name written by a founder that distributed traffic, scheduled jobs and did autoscaling. It would have weird quirks and nobody understood it.

Or you wouldn't have it at all. You'd have a nice simple infra and no autoscaling and deploys would be hard and involve manually copying files.

deathanatos · on March 27, 2023

People think it took less effort.

Right up until you needed to do one of the very many things k8s implements.

For example, in multiple previous employers, we had cronjobs: you just set up a cronjob on the server, I mean, really, how hard is that to do?

And that server was a single point of failure: we can't just spin up a second server running crond, obviously, as then the job runs twice. Something would need to provide some sort of locking, then the job would need to take advantage of that, we'd need the job to be idempotent … all of which, except the last, k8s does out of the box. (And it mostly forces your hand on the last.)

Need to reboot for security patches? We just didn't do that, unless it was something like Heartbleed where it was like "okay we have to". k8s permits me to evict workloads while obeying PDB — in previous orgs, "PDBs" (hell, we didn't even have a word to describe the concept) were just tribal knowledge known only by those of us who SRE'd enough stuff to know how each service worked, and what you needed to know to stop/restart it, and then do that times waaay too many VMs. With k8s, a daemonset can just handle things generically, and automatically.

Need to deploy? Pre-k8s, that was just bespoke scripts, e.g., in something like Ansible. If a replica failed to start after deployment, did the script cease deployment? Not the first time it brought everything down, it didn't: it had to grow that feature by learning the hard way. (Although I suppose you can decide that you don't need that readiness check in k8s, but it's at least a hell of a lot easier to get off the ground with.)

Need a new VM? What are the chances that the current one actually matches the Ansible, and wasn't snowflaked? (All it takes is one dev, and one point in time, doing one custom command!)

The list of operational things that k8s supports that are common amongst "I need to serve this, in production" things goes on.

The worse part of k8s thus far has been Azure's half-aaS'd version of it. I've been pretty satisfied with GKE, but I've only recently gotten to know it and I've not pushed it quite as hard as AKS yet. So we'll see.

nailer · on March 28, 2023

I have 0 results for PDB SRE on multiple search engines. What does it mean?

mardifoufs · on March 28, 2023

I think it's Pod Disruption Budgets, a kubernetes redundancy/resiliency related concept.

nailer · on March 28, 2023

> k8s permits me to evict workloads while obeying PDB — in previous orgs, "PDBs" (hell, we didn't even have a word to describe the concept)

Odd, the parent makes it seem like resource budgets weren’t a thing before k8s.

deathanatos · on March 28, 2023

I've never heard the term "resource budget" used to describe this concept before. Got a link?

That'd be an odd set of words to describe it. To be clear, I'm not talking about budgeting RAM or CPU, or trying to determine do I have enough of those things. A PodDisruptionBudget describes the manner in which one is permitted to disrupt a workload: i.e., how can I take things offline?

Your bog simple HTTP REST API service, for example, might have 3 replicas, behind like a load balancer. As long as any one of those replicas is up, it will continue to serve. That's a "PodDisruptionBudget", here, "at least 1 must be available". (minAvailable: 1, in k8s's terms.)

A database that, e.g., might be using Raft, would require a majority to be alive in order to serve. That would be a minAvailable of "51%", roughly.

So, some things I can do with the webservice, I cannot do with the DB. PDBs encode that information, and since it is in actual data form, that then lets other things programmatically obey that. (E.g., I can reboot nodes while ensuring I'm not taking anything offline.)

( https://kubernetes.io/docs/tasks/run-application/configure-p... )

morelisp · on March 28, 2023

A PDB is a good example of Kubernetes's complexity escalation. It's a problem that arises when you have dynamic, controller-driven scheduling. If you don't need that you don't need PDBs. Most situations don't need that. And most interesting cases where you want it, default PDBs don't cover it.

deathanatos · on March 28, 2023

> A PDB is a good example of Kubernetes's complexity escalation. It's a problem that arises when you have dynamic, controller-driven scheduling. If you don't need that you don't need PDBs. Most situations don't need that.

No, and that's my point: PDBs exist always. Whether your org has a term for it, or whether you're aware of them is an entirely different matter.

We I did work comprised of services running on VMs, there is still a (now, spritual) PDB associated with that service. I cannot just take out nodes willy-nilly, or I will be the cause of the next production outage.

In practice, I was just intimately familiar with the entire architecture, out of necessity, and so I knew what actions I could and could not take. But it was not unheard of for a less-cautions or less-skilled individual to do before thinking. And it inhibits automation: automation needed to be aware of the PDB, and honestly we'd probably just hard-code the needs on a per-service basis. PDBs, as k8s structures them, solves the problem far more generically.

morelisp · on March 28, 2023

> we'd probably just hard-code the needs on a per-service basis.

For 99% of situations this is a better decision. For, idk, at least 20% of the remaining 1%, PDBs won't handle it anyway.

nailer · on March 29, 2023

Sounds like a PDB isn’t a resource budget then. We were using that concept in ESX farms 20 years ago but it seems PDBs are more what more SREs would describe as minimum availability.

nailer · on March 29, 2023

Maybe 'minimum instance availability' to be specific that were referring to instances of a service rather than an SLA.

lowercased · on March 28, 2023

"Pre-k8s, that was just bespoke scripts, e.g., in something like Ansible"

How are these 'bespoke scripts' but helm charts are not 'bespoke'? Or do you consider them 'bespoke-but-better'?

deathanatos · on March 28, 2023

Because they're completely different things you're comparing. The functionality that I describe as having to have built out as part of Ansible (needing to check that the deploy succeeded, and not move on to the next VM if not) is not present in any Helm chart (as that's not the right layer / doesn't make sense), as it's part of the deployments controller's logic. Every k8s Deployment (whether from a Helm chart or not) benefits from it, and doesn't need to build it out.

morelisp · on March 28, 2023

> needing to check that the deploy succeeded, and not move on to the next VM if not

It's literally just waiting for a port to open and maybe check for an HTTP response, or run an arbitrary command until non-zero status; all the orch tools can do that in some way.

deathanatos · on March 28, 2023

… there's a difference between "can do it" and "is provided."

In the case of either k8s or VMs, I supply the health check. There's no getting around that part, really.

But that's it in the case of k8s. I'm not building out the logic to do the check, or the logic to pause a deployment if a check fails: that is inherent to the deployments controller. That's not the case with Ansible/Salt/etc.¹, and I end up re-inventing portions of the deployments controller every time. (Or, far more likely, it just gets missed/ignored until the first time it causes a real problem.)

¹and that's not what these tools are targetting, so I'm not sure it's really a gap, per se.

jamwt · on March 27, 2023

Yes, way less effort

deterministic · on March 30, 2023

Yep. Still doing it today. Very large scale Enterprise systems with complex multi-country/multi-organisational operational rules running 24/7. Millions of lines of code. No Kubernetes. No micro-services. No BS. It’s simple. It works. And it has worked for 30+ years.

lowercased · on March 27, 2023

winrid · on March 27, 2023

Yeah... at small to medium scale anyway.

nosequel · on March 27, 2023

Yes. Much.

nailer · on March 27, 2023

morelisp · on March 27, 2023

papruapap · on March 27, 2023

Well then you could the resume the whole article in the title because there isnt anything else in it.

znpy · on March 27, 2023

Quoting Bryan Cantrill: production is war.

srejk · on March 27, 2023

And war is hell.

wingmanjd · on March 27, 2023

I'm a simple man...I see a Bryan Cantrill quote and I upvote.

bdougherty · on March 27, 2023

Production is only as hard as you make it.

madeofpalk · on March 27, 2023

Production is as easy as you (and your users) will tolerate.