Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Except that Kubernetes, along with all the additional components that are needed, for example, you mentioned Longhorn and a database such as PostgreSQL, Cilium, and others, has a multitude of ways to fail in unpredictable manners during updates, and there can be many hidden, minor bugs. The list of issues on GitHub for these projects is very lengthy.

I'm not suggesting that this isn't a viable solution, but I would prefer to be well-prepared and have a team of experts in their respective fields who are willing to have an on-call duty. This team would include a specialist for Longhorn or Ceph, since storage is extremely important, one to set up and maintain a high-availability PostgreSQL database with an operator and automated, thoroughly tested backups, and another for eBPF/Cilium networking complexities, which is also crucial because if your cluster network fails, it results in an immediate major outage.

Certainly, you can claim that you have sufficient experience to manage all these systems independently, but when do you plan to sleep if you're on call 24/7/365? Therefore, you either need a highly competent team of domain experts, which also incurs a significant cost, or you opt for cloud services where all of this management is taken care of for you. Of course, this service is already included in the price, hence it's more expensive than bare-metal.



I feel like this is a pretty short-sighted solution.

> you either need a highly competent team of domain experts, which also incurs a significant cost

You can build this team and domain knowledge from the ground up. Internal documentation goes a long, long way too. From working at bare-metal shops off-and-on over the last decade, the documentation capturing why and how are extremely important.

> you opt for cloud services where all of this management is taken care of for you

Is it though? One of the biggest differences between a bare-metal shop and a cloud shop is when shit hits the fan. When the cloud goes down, everyone is sitting around twiddling their thumbs while customer money flies away. There isn't anything anyone can do. Maybe there will be discussions about how we should make the staging/test envs in more regions ... but that's expensive. When the cloud does come back, you'll be spending more hours possibly rebooting things to get everything back into a good state, maybe even shipping code.

When bare-metal goes down, a team of highly competent people, who know it in-and-out are giving you minute-by-minute status updates, predictions on when it will be back, etc. They can even bring certain systems back online in whatever order they want instead of a random order. Thus you can get core services back online pretty quickly, while everything runs degraded.

Like all things in software, there are tradeoffs.


Based on my experience, uptime for the Generally Available (GA) services in the cloud typically ranges between 99.9% and 99.95% (maximum 8 hours and 41 minutes of downtime per year in case of 99.9) for a single region. This aligns with my long-term experience as a Google Cloud Platform (GCP) user. If you use preview or beta versions of services, the reliability may be lower, but then you're taking on that risk yourself.

Should you require greater than 99.95% uptime for particularly critical operations, then opting for a multi-region approach, such as using multi-region storage buckets, is advisable. It's also worth mentioning that I have never experienced a full 8 hours of downtime at once in a given year. It has usually been a case of increased error rates or heightened latency due to the inherent redundancy provided by availability zones within each region. Just make sure your network calls have a retry mechanism and you should be fine in almost all cases.


On my first day ever in production on AWS, we had an entire 9 hours of downtime[1]... so maybe I'm a bit biased, especially because we had another one not too long after that one[2] on Christmas freaking Eve. Prior to moving to AWS, resolution timelines were able to be passed to stakeholders within minutes of discovering the issue. After moving to AWS, we played darts and looked like fools because there was nothing we could do while the company hemorrhaged money.

The cloud is much more mature as is dev-ops in general, these days, but major outages still happen. If you run your own cloud, you'll still have major outages. You can't really escape them.

If you have the expertise or can get the expertise, to do it in-house, you should do it. Just look to the US and its inability to build an in-expensive rocket, or even just manufacture goods. They outsourced everything (basically) to the point where they are reliant on the rest of the world for basic necessities.

You gotta think long-term, not short-term.

[1]: https://aws.amazon.com/message/680342/

[2]: https://aws.amazon.com/message/680587/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: