AWS isn't just IaaS they are PaaS. So it's a fact that for most use cases it wil...

withinboredom · on Nov 16, 2023

Setting up k3s: 2 hours

Setting up Garage for obj store: 1 hour.

Setting up Longhorn for storage: .25 hour.

Setting up db: 30 minutes.

Setting up Cilium with a pool of ips to use as a lb: 45 mins.

All in: ~5 hours and I'm ready to deploy and spending 300 bucks a month, just renting bare metal servers.

AWS, for far less compute and same capabilities: approximately 800-1000 bucks a month, and takes about 3 hours -- we aren't even counting egress costs yet.

So, for two extra hours on your initial setup, you can save a ridiculous amount of money. Maintenance is actually less work than AWS too.

(source: I'm working on a youtube video)

threeseed · on Nov 16, 2023

You should stick to making Youtube videos then.

Because there is a world of difference between installing some software and making it robust enough to support a multi-million dollar business. I would be surprised if you can setup and test a proper highly-available database with automated backup in < 30 mins.

withinboredom · on Nov 17, 2023

When you are making multi-million dollars, that’s when you spend the money on the cloud. There is a spot somewhere between “survive on cloud credits” and “survive in the cloud” where “survive on bare metal” makes far more sense. The transitions are hard, but it is worth it.

flkenosad · on Nov 16, 2023

What do you mean? You install postgres on 2+ machines and configure them.

Installing software is exactly what multi-million dollar companies do.

Backups are not hard either. There's many open source setups out there and building your own is not that complex.

icedchai · on Nov 17, 2023

Doing this right is easier said than done. I worked at one company that ran their own Postgres instances on EC2 instead of using RDS. Big mistake. The configuration was so screwed up they couldn't even take a backup that wasn't corrupted. They had "experts" working on this for months.

kikimora · on Nov 17, 2023

>Backups are not hard either.

I guess you never setup one, because I’ve see numerous attempts at this and none took less than a month to do.

withinboredom · on Nov 18, 2023

Backups are not hard. I've set them up, and restored from them, numerous times in my career. Even PITR is not hard.

It takes about 10-15 mins to apply the configs and install the stuff, then maybe another 10-15 minutes to run a simple test and verify things.

sgarland · on Nov 17, 2023

> You install postgres on 2+ machines and configure them.

Oh sweet summer child... let me know how that goes for you.

nunez · on Nov 17, 2023

It's easy to stand this stuff up initially, but the real work is in scaling, automating, testing and documenting all of that, and many places don't have the people, skills or both to do all of that easily.

Also, with EKS, you get literally all of this (except Cilium and Longhorn, if you need that, which you don't if you use vpc-cni and eks-csi), in ~8 minutes, and it comes with node autoscaling, tie-ins into IAM and a bunch of other stuff for free. This is perfect for a typical lean engineering team that doesn't really do platform stuff but need to out of necessity and/or a platform team that's just getting ramped up on k8s.

You also don't need to test your automation for k8s upgrades or maintain etcd with EKS, which can be big time-savers.

(FWIW I love Kubernetes and have made courses/workshops of exactly this work)

tutfbhuf · on Nov 17, 2023

Except that Kubernetes, along with all the additional components that are needed, for example, you mentioned Longhorn and a database such as PostgreSQL, Cilium, and others, has a multitude of ways to fail in unpredictable manners during updates, and there can be many hidden, minor bugs. The list of issues on GitHub for these projects is very lengthy.

I'm not suggesting that this isn't a viable solution, but I would prefer to be well-prepared and have a team of experts in their respective fields who are willing to have an on-call duty. This team would include a specialist for Longhorn or Ceph, since storage is extremely important, one to set up and maintain a high-availability PostgreSQL database with an operator and automated, thoroughly tested backups, and another for eBPF/Cilium networking complexities, which is also crucial because if your cluster network fails, it results in an immediate major outage.

Certainly, you can claim that you have sufficient experience to manage all these systems independently, but when do you plan to sleep if you're on call 24/7/365? Therefore, you either need a highly competent team of domain experts, which also incurs a significant cost, or you opt for cloud services where all of this management is taken care of for you. Of course, this service is already included in the price, hence it's more expensive than bare-metal.

withinboredom · on Nov 18, 2023

I feel like this is a pretty short-sighted solution.

> you either need a highly competent team of domain experts, which also incurs a significant cost

You can build this team and domain knowledge from the ground up. Internal documentation goes a long, long way too. From working at bare-metal shops off-and-on over the last decade, the documentation capturing why and how are extremely important.

> you opt for cloud services where all of this management is taken care of for you

Is it though? One of the biggest differences between a bare-metal shop and a cloud shop is when shit hits the fan. When the cloud goes down, everyone is sitting around twiddling their thumbs while customer money flies away. There isn't anything anyone can do. Maybe there will be discussions about how we should make the staging/test envs in more regions ... but that's expensive. When the cloud does come back, you'll be spending more hours possibly rebooting things to get everything back into a good state, maybe even shipping code.

When bare-metal goes down, a team of highly competent people, who know it in-and-out are giving you minute-by-minute status updates, predictions on when it will be back, etc. They can even bring certain systems back online in whatever order they want instead of a random order. Thus you can get core services back online pretty quickly, while everything runs degraded.

Like all things in software, there are tradeoffs.

tutfbhuf · on Nov 18, 2023

Based on my experience, uptime for the Generally Available (GA) services in the cloud typically ranges between 99.9% and 99.95% (maximum 8 hours and 41 minutes of downtime per year in case of 99.9) for a single region. This aligns with my long-term experience as a Google Cloud Platform (GCP) user. If you use preview or beta versions of services, the reliability may be lower, but then you're taking on that risk yourself.

Should you require greater than 99.95% uptime for particularly critical operations, then opting for a multi-region approach, such as using multi-region storage buckets, is advisable. It's also worth mentioning that I have never experienced a full 8 hours of downtime at once in a given year. It has usually been a case of increased error rates or heightened latency due to the inherent redundancy provided by availability zones within each region. Just make sure your network calls have a retry mechanism and you should be fine in almost all cases.

withinboredom · on Nov 18, 2023

On my first day ever in production on AWS, we had an entire 9 hours of downtime[1]... so maybe I'm a bit biased, especially because we had another one not too long after that one[2] on Christmas freaking Eve. Prior to moving to AWS, resolution timelines were able to be passed to stakeholders within minutes of discovering the issue. After moving to AWS, we played darts and looked like fools because there was nothing we could do while the company hemorrhaged money.

The cloud is much more mature as is dev-ops in general, these days, but major outages still happen. If you run your own cloud, you'll still have major outages. You can't really escape them.

If you have the expertise or can get the expertise, to do it in-house, you should do it. Just look to the US and its inability to build an in-expensive rocket, or even just manufacture goods. They outsourced everything (basically) to the point where they are reliant on the rest of the world for basic necessities.

You gotta think long-term, not short-term.

[1]: https://aws.amazon.com/message/680342/

[2]: https://aws.amazon.com/message/680587/

kikimora · on Nov 17, 2023

Setting up a db with backups, PITR, periodic backup tests in 30 minutes? You made my day :) Also how about controlling who has access to servers, tamper protected activity logs? It is just scratching the surface.

withinboredom · on Nov 18, 2023

Yeah, it is pretty straightforward to set it up in about 30-45 minutes. I have a cookbook I've been building since 2004, and I keep it updated. Maybe one day I'll publish it, but mostly, it is for me.

> how about controlling who has access to servers

It depends on where you want to control access to and how. If you are talking about SSH, you can bind accounts to a GitHub team with about 20 lines of bash deployed to all the servers. Actually, I have a daemonset that keeps it updated.

Thus only people in my org, on a specific team, can access a server. If I kick them out of the team, they lose access to the servers.

I have something similar for k8s, but it is a bit more complicated and took quite a bit of time to get right.

> tamper protected activity logs?

sudo chattr +au /path/to/file

will force Linux to only allow a file to written to and read, but only in append mode and deletions are actually soft-deletes (assuming you are using a supported filesystem). Things like shell history files, logs, etc. make a ton of sense to be set that way. There are probably a couple of edge cases I'm forgetting, but that will get you at least 80% there.

jessekv · on Nov 17, 2023

Debugging longhorn outages on a monthly basis: priceless.

withinboredom · on Nov 17, 2023

I haven’t had any issues with longhorn in a long time. Most issues I run into are network related, in fact, I tend to find a lot of cilium bugs (still waiting on one to be fixed to upgrade, actually).

nunez · on Nov 17, 2023

I hope you like slugging through the depths of Kubernetes api-resources and trudging through a trillion lines of logs across a million different pods