Stop Optimizing Your AWS Bill

Disclaimer: I hate writing. I’m using AI to get my ideas onto paper. The opinions, experience, and numbers are mine. The grammar is not.

Everyone is arguing about whether your company needs Kubernetes.

I use Kubernetes. I’ve been using it since 2015, back when GKE was barely out of beta and half the documentation was “just read the source.” I’ve run it in production at scale for years. It is a very different tool today than it was then. More mature, more stable, genuinely easier to operate.

So I am not here to dunk on Kubernetes.

I am here to ask a different question: why are you running it on AWS?

Actually, let me go further. Why are you on a public hyperscaler at all?

I’m the CTO of Rave. I’ve spent over a decade building and scaling global infrastructure. GCP, AWS, OCI, bare metal. I’ve run production on all of them. I currently run Kubernetes on OCI. It works. But increasingly I find myself doing the math on what we are actually paying for versus what we are getting, and the numbers are getting harder to justify.

Not just for us. For everyone.

Sometimes the right optimization is not tuning the thing. It is turning the thing off.

That sounds reckless until you realize how much infrastructure work is built around protecting assumptions nobody has questioned in years. AWS became one of those assumptions. AI makes that harder to justify, because the cost of exploring the alternative collapsed.

The wrong layer

I saw a representative case recently: Series A startup, first DevOps hire, six engineers, B2B SaaS expecting 500 customers year one, SOC 2 requirement, 18 months of runway, $4,000/mo cloud budget.

The replies were full of people debating ECS vs K3s vs Fargate.

They’re optimizing the wrong layer.

That company does not have a Kubernetes problem. It has an infrastructure economics problem. Whether they pick ECS, K3s, Fargate, EKS, or Nomad, the bigger question is still sitting underneath the whole discussion: why is the default assumption AWS?

Four bare metal servers gets you a lot of machine now. Real cores. Real RAM. Real NVMe disks. Not EBS volumes with provisioned IOPS charges. Not a maze of VPCs, security groups, NAT gateways, and per-GB toll booths. Spend $1,200/mo on compute, then use the rest of the budget for backups, monitoring, a CDN, and a Vanta subscription to automate SOC 2 evidence collection.

Under budget. Room to breathe. More predictable. Less billing theater.

This is the part cloud discourse keeps skipping. We talk endlessly about orchestration layers and almost never about whether the substrate is worth the premium.

The AWS tax

AWS is not expensive because of the compute.

AWS is expensive because of everything around the compute. The networking. The logging. The storage IOPS. The NAT gateways. The data transfer fees designed to make sure you never leave.

I wrote a whole post breaking down the egress tax, and the numbers are genuinely absurd. 88% cost reduction moving the same traffic from AWS to OCI. That is not a rounding error. That is a business model built on you not reading your bill.

Take a very normal AWS setup trying to get from an $18,000/mo bill down to $12,600. Everyone starts hunting for instance rightsizing, reserved instances, savings plans, and Fargate tweaks. Fine. Do that. But then look at the line items nobody wants to talk about.

NAT gateway processing 15 TB at roughly $0.045/GB is $675/mo.

That is not compute. That is not storage. That is not delivering value to anyone. That is a toll booth between your private subnet and the internet.

CloudWatch ingesting 800 GB of logs at $0.50/GB is another $400/mo.

That is not observability. That is rent. You could run a Grafana stack on one of your own boxes and get better dashboards for the cost of the electricity.

Then add load balancer LCUs, cross-AZ traffic, EBS IOPS, managed NAT, public IPv4 charges, data transfer out, and whatever else the pricing page turned into a product surface this quarter.

This is the AWS tax. It is not one big line item. It is fifty small ones that each seem reasonable in isolation and add up to you spending $18,000/mo on infrastructure that a $3,000/mo setup can often handle better.

Let that sink in. $0.045 per gigabyte just to leave your private subnet.

Who pays the tax

This is the part that makes it more than a vendor rant.

The AWS tax does not stop at your AWS bill. It moves downstream. It becomes the $20/mo subscription for a note-taking app. The $30/mo subscription for dictation software. The per-seat charge on some internal tool your company barely uses but cannot quite cancel. The pile of small monthly charges everyone is getting tired of paying.

Not every subscription is expensive because of infrastructure. Payroll matters. Support matters. Product matters. Customer acquisition definitely matters.

But infrastructure sets a floor. If your product costs $18,000/mo to run when it could cost $3,000/mo, that $15,000 does not vanish. Early on, maybe investors pay for it. Later, customers pay for it. If they do not, your margin does.

We have normalized a world where every tiny SaaS tool wants a credit card and a monthly subscription. Some of that is because software is valuable. Fine. I like paying for good software. But some of it is because teams built on cost structures where NAT gateways, managed logs, cross-AZ traffic, load balancer metrics, managed storage IOPS, and egress fees became part of the unit economics.

The cloud bill became the pricing model.

That is why this matters. Not because AWS is evil. Not because bare metal is romantic. Because inflated infrastructure costs get passed to real people and real businesses who are already drowning in subscriptions.

If you can run the same product for 75% less, you have choices. Lower prices. More runway. Better margins. More support. More engineering time spent on the product instead of the bill.

That is the why.

Bare metal is not nostalgia

The usual response is that bare metal is some kind of regression. Like the only two choices are modern cloud-native infrastructure or racking servers in a closet next to a dying UPS.

That is not the world anymore.

You can rent serious bare metal from companies that know what they are doing. You can get redundant power, decent networks, remote hands, private VLANs, IPMI, fast disks, and enough bandwidth that your cloud bill starts looking obscene. You can automate the whole thing with Terraform, Ansible, Talos, Kubernetes, Nomad, or whatever boring tool gets the job done.

Bare metal does not mean artisanal infrastructure. It means you stopped pretending every workload needs a hyperscaler attached to it.

Most SaaS infrastructure is boring. Web servers. Queues. Databases. Object storage. Logs. Metrics. Background jobs. CDN. Backups. Deployments. The work is important, but the shape is not exotic.

If your workload is steady-state, your database fits on machines you can afford, and your traffic is predictable enough to plan capacity, the hyperscaler premium deserves scrutiny. Not because AWS is bad at running infrastructure. AWS is very good at running infrastructure. That is not the question.

The question is whether they are earning the markup on your workload.

Reliability is still your problem

“But what about reliability? Failover? Disaster recovery?”

Fair. I am not going to hand-wave this away.

When you run Kubernetes at scale, you learn to respect failure modes. Hardware fails. Disks die. Network cards get weird. Datacenter power goes out. Humans push bad config at 2 AM and suddenly the thing that was “obviously safe” is taking production with it.

I’ve dealt with all of it.

Bare metal does not eliminate those problems. It makes them explicit. You need backups. You need restore drills. You need monitoring that tells you when replication is broken before the primary disappears. You need a real answer for RPO and RTO. You need to know which services are stateless, which ones are stateful, and what happens when the stateful ones catch fire.

But here is the part people miss: the public cloud does not eliminate those problems either. It abstracts them. You still get regional outages. You still get paged when a managed service misbehaves. You still need backups, restore drills, monitoring, failover plans, and someone who understands the architecture well enough to debug it when the abstraction leaks.

The difference is that when bare metal fails, you usually understand why. A disk died. A host died. A switch died. A route changed. When us-east-1 goes sideways, you are reading a status page that says “increased error rates” for six hours while your customers ask why the app is broken.

At a 75-80% cost reduction, you have budget to do reliability properly. Not vibes. Not whitepaper reliability. Actual reliability. More machines. More disks. Offsite backups. A second provider. Disaster recovery drills. Cold standby. Warm standby. Whatever your business actually needs.

Reliability is not something you buy once from a cloud provider and stop thinking about. It is a discipline. If you are going to be responsible for it either way, you should at least understand what you are paying for.

OCI is the middle ground

If bare metal feels like too big a leap, OCI is the obvious middle ground.

I have moved workloads from AWS to Oracle Cloud and seen 60%+ savings on the same architecture. Not magic. Not a special deal. Just a pricing model that is not designed to nickel and dime you on data transfer.

Flat-rate egress with a generous free tier. No NAT gateway processing fees. No cross-AZ surprise charges. The networking works and they do not charge you extra for the privilege.

OCI is not perfect. I have written about the sharp edges. IAM can be maddening. OCIDs are giant opaque strings stapled to everything. The console has improved but still has moments where you wonder if the product manager has ever met a human. The documentation is better than it was, but it still has gaps.

Three years ago, that was a serious reason to hesitate.

Today, it is a weaker argument.

OCI got better. The rough edges are smaller. The tooling is better. The docs are better. The platform is still more raw than AWS in places, but raw is not the same thing as unusable.

And AI changed the landscape.

You can ask a model to write the Terraform. You can ask it to explain the IAM policy. You can give it an AWS architecture and ask it to translate the pieces into OCI terms. You can paste a route table, a security list, a DRG attachment, or an error from the CLI and get a useful debugging path in minutes.

That does not make infrastructure easy. It does not replace judgment. It does not mean you should blindly apply whatever code falls out of a chat window.

But it absolutely reduces the discovery tax.

The thing that used to require a senior cloud engineer, three vendor calls, and two days of documentation spelunking is now often a 20-minute conversation with a model and a pricing page. That matters. The gap between “AWS is familiar” and “OCI is viable” got a lot smaller, and the bill did not.

So yes, AWS still makes sense if your product is deeply tied to AWS-native services and ripping them out would be self-harm. If your whole architecture is DynamoDB, Lambda, SQS, EventBridge, IAM, and managed analytics, do not torch the business to prove a point.

But most companies are not that special.

Most companies are running a web app, a database, a queue, object storage, logs, metrics, and some workers. That stack does not automatically justify hyperscaler pricing.

Storage and CDN are the same story

The storage story is the same everywhere.

Everyone defaults to S3 because it is what they know. S3 is excellent. It is also not a law of physics.

Backblaze B2 is a quarter of the price with S3-compatible APIs. Wasabi is flat-rate with no egress fees. Cloudflare R2 exists for exactly the reason everyone hates paying to move their own data. These are not toy services. They are production services for companies that bothered to look at the bill.

CDN is the same. CloudFront is fine. Bunny.net is excellent. Cloudflare is excellent. Fastly is excellent if you need what Fastly does. The default does not need to be “whatever AWS happens to sell next to EC2.”

If you are on a provider in the Bandwidth Alliance, origin-to-edge traffic can be free. If your workload moves real data, that one sentence can change your architecture.

This is the broader point. Stop treating the hyperscaler bundle as inevitable. Compute from one place. Object storage from another. CDN somewhere else. Monitoring from the thing that does monitoring best. Payments from the thing that does payments best. There is no prize for putting every line item on the same invoice.

The prize is margin. Or lower prices. Or more runway. Or all three.

The narrative changed before

I have been in this industry long enough to watch the narrative shift a few times.

First, you needed your own data center. Then you needed the cloud. Then you needed Kubernetes on the cloud. Then you needed serverless on the cloud. Each shift came with real technical merit and a healthy dose of vendor marketing.

The cloud solved real problems. I am not pretending otherwise. It made provisioning faster. It made global infrastructure accessible. It gave small teams primitives that used to require serious capital and dedicated ops teams.

But the pendulum swung too far. We turned “cloud is useful” into “cloud is the default for everything,” and then we stopped doing the math.

That is the part I cannot get past.

If you are tiny and moving fast, the premium might be worth it. If managed services are the thing letting your team ship, pay the premium and ship. If your architecture genuinely benefits from hyperscaler primitives, use them.

But if you are spending five figures a month on ordinary infrastructure and your plan is to shave 30% off the bill with reserved instances, you are probably solving the wrong problem.

You are negotiating with the toll booth instead of asking why the toll road is mandatory.

The actual question

I run Kubernetes. I like Kubernetes. This is not a Kubernetes post.

This is a substrate post.

The platform underneath your orchestration layer matters more than anyone wants to admit. The big three hyperscalers have gotten very good at making sure you never do the math on alternatives. They wrap the bill in enough complexity that every line item feels defensible and the total feels inevitable.

It is not inevitable.

Bare metal is viable. OCI is viable. Alternative object storage is viable. Alternative CDNs are viable. Hybrid architectures are viable. AI made the rough edges less scary. The tooling got better. The economics were already better.

Stop optimizing your AWS bill.

Start questioning whether you need it.