Except that Kubernetes, along with all the additional components that are needed...

withinboredom · on Nov 18, 2023

I feel like this is a pretty short-sighted solution.

> you either need a highly competent team of domain experts, which also incurs a significant cost

You can build this team and domain knowledge from the ground up. Internal documentation goes a long, long way too. From working at bare-metal shops off-and-on over the last decade, the documentation capturing why and how are extremely important.

> you opt for cloud services where all of this management is taken care of for you

Is it though? One of the biggest differences between a bare-metal shop and a cloud shop is when shit hits the fan. When the cloud goes down, everyone is sitting around twiddling their thumbs while customer money flies away. There isn't anything anyone can do. Maybe there will be discussions about how we should make the staging/test envs in more regions ... but that's expensive. When the cloud does come back, you'll be spending more hours possibly rebooting things to get everything back into a good state, maybe even shipping code.

When bare-metal goes down, a team of highly competent people, who know it in-and-out are giving you minute-by-minute status updates, predictions on when it will be back, etc. They can even bring certain systems back online in whatever order they want instead of a random order. Thus you can get core services back online pretty quickly, while everything runs degraded.

Like all things in software, there are tradeoffs.

tutfbhuf · on Nov 18, 2023

Based on my experience, uptime for the Generally Available (GA) services in the cloud typically ranges between 99.9% and 99.95% (maximum 8 hours and 41 minutes of downtime per year in case of 99.9) for a single region. This aligns with my long-term experience as a Google Cloud Platform (GCP) user. If you use preview or beta versions of services, the reliability may be lower, but then you're taking on that risk yourself.

Should you require greater than 99.95% uptime for particularly critical operations, then opting for a multi-region approach, such as using multi-region storage buckets, is advisable. It's also worth mentioning that I have never experienced a full 8 hours of downtime at once in a given year. It has usually been a case of increased error rates or heightened latency due to the inherent redundancy provided by availability zones within each region. Just make sure your network calls have a retry mechanism and you should be fine in almost all cases.

withinboredom · on Nov 18, 2023

On my first day ever in production on AWS, we had an entire 9 hours of downtime[1]... so maybe I'm a bit biased, especially because we had another one not too long after that one[2] on Christmas freaking Eve. Prior to moving to AWS, resolution timelines were able to be passed to stakeholders within minutes of discovering the issue. After moving to AWS, we played darts and looked like fools because there was nothing we could do while the company hemorrhaged money.

The cloud is much more mature as is dev-ops in general, these days, but major outages still happen. If you run your own cloud, you'll still have major outages. You can't really escape them.

If you have the expertise or can get the expertise, to do it in-house, you should do it. Just look to the US and its inability to build an in-expensive rocket, or even just manufacture goods. They outsourced everything (basically) to the point where they are reliant on the rest of the world for basic necessities.

You gotta think long-term, not short-term.

[1]: https://aws.amazon.com/message/680342/

[2]: https://aws.amazon.com/message/680587/