Reducing AWS Costs by $150,000: Xflow Optimization Strategies

Introduction

As organizations increasingly move to the cloud, managing and optimizing costs becomes a priority. To meet this growing demand, the cloud computing platform Amazon Web Services (AWS) offers a wide range of services and tools to businesses to help reduce infrastructure costs while maintaining performance and scalability.

At Xflow, we’ve implemented several strategies for saving costs on AWS, focusing on components such as CloudWatch, RDS, EC2, EKS, MSK Kafka, Kinesis Data Analytics, and more.

In this blog, we’ll share the best practices and methods that helped us cut our AWS bill from ~$25,000 to ~$11,500 a saving of ~$12,500 per month, upon adjusting for additional costs. This resulted in an annual savings of ~$150,000.

Xflow: Our engineering architecture

Before we dive into the specifics of how we cut costs, here’s a peek into Xflow’s engineering architecture and where we focused our cost-saving efforts:

Key points:

APIs are built using API Gateway
Requests from API Gateway are routed to EKS clusters
Transaction data is read and written from MySQL
Search data is read and written from OpenSearch
Data from MySQL to OpenSearch is propagated using a combination of Debezium plus Kafka Connect for change data capture
Messages are stored in Kafka
Messages are processed using Flink
Kafka and Flink are used for asynchronous messaging between different systems, such as Risk and Transaction Monitoring etc
A combination of CloudWatch and New Relic are used as Observability platforms

Now, let’s explore the specific strategies that helped us reduce our AWS expenses.

CloudWatch

We were ingesting ~16,000 monthly metrics and ~1500GB of log data. CloudWatch cost for the above data came to ~$4700. Given the various components that we had and the critical nature of them, we did not want to sacrifice the number of metrics that we were ingesting or the log data. In fact, we were looking at increasing both metric and ingestion data; as we also had APM in our plan.

We started looking at more cost-effective alternatives. We compared New Relic and CloudWatch and found that New Relic offered more value at a much lower cost. New Relic had a transparent cost structure compared to CloudWatch. Within CloudWatch, there were many individual cost components.

Upon moving to New Relic, we selected the Standard plan, limited to 5 full-access users. This worked for us since our team had only 15 members, and only a few needed full access to features like APM (Application Performance Monitoring). The rest of the team used basic access, which came at no additional cost.

The table below represents a cost comparison between CloudWatch and New Relic:

Comparison	CloudWatch	New Relic
Metrics Ingestion	$3700	$525
Logs Ingestion	$1000	$300
User Cost	$0	$406
Total Cost	$4700	$1231

The other advantage of New Relic was that we got $100,000 credits for the first 12 months. We of course didn’t exhaust all our credits in one go. In fact, we used less than 25% of it, which eventually resulted in saving costs.

Additional note:

At scale, CloudWatch may work out to be cheaper since the pricing is tiered based on ingestion, and the cost gets significantly lower
We used NewRelic, but an equivalent comparison with other Observability platforms like SigNoz etc., also worked out to be in a similar range cost-wise

RDS

Like most startups on AWS, we use RDS and specifically MySQL. RDS is a significant cost contributor to our AWS Cloud cost. We took the following steps to reduce the costs:

Downsizing instances

For most cases, we were initially using 2xlarge instances. After assessing the performance and our requirements, we realized we could switch to xlarge instances in many cases without any issues. Thus, we completely eliminated 2xlarge instances. Even though we received plenty of alerts and query timeouts after switching to xlarge, most workloads were fine. We were using xlarge instances too in a number of places where it was not required, and currently we have moved most of those workloads to large instances. Per month, this helped save ~$920 for 2xlarge to xlarge movement, $280 for 2xlarge to large, and $150 for xlarge to large instances monthly.

Moving to the latest generation instances

We were using m5 instance types. However, soon we realized m6g instances were cheaper by ~$0.02 per hour for large and ~$0.15 for xlarge. We have ~10 large instances and ~2 xlarge instances, and our monthly net savings was at ~$100 for large instances and $206 for xlarge instances.

Removing Idle instances

We identified idle instances that were running but not being used. Removing them saved us around ~ $180 per month.

Purchasing Reserved Instances

Purchasing Reserved Instances significantly helped us to lower the cost. AWS offers different options here on the commitment period and payment terms. We opted for 1-year with no upfront payment. We wanted a flexible option on the commitment side. We were looking for flexibility in switching easily in case our requirements change during/after one year. On the payment side, the benefits were not as significant. Together, with the 1-year commitment and no upfront payment, we managed to save ~$525 for large instances and $524 for xlarge instances.

Summary of savings:

Action	Savings
Downsizing 2xlarge to xlarge	~$960
Downsizing 2xlarge to large	~$280
Downsizing xlarge to large	~$150
Latest generation m5.large to m6g.large	~$100
Latest generation m5.xlarge to m6g.xlarge	~$206
Removing Idle Instances	~$180
Purchasing Reserved instances m6.large	~$525
Purchasing Reserved instances m6.xlarge	~$525
Total Savings	~$2900

Kinesis Data Analytics

We use Kafka as our message broker and Flink for message processing. Since we did not want to take on the onus of deploying and managing Flink clusters, we had decided to use AWS managed offering, Kinesis Data Analytics. Additionally, we were also using multiple independent deployments segregated by repositories, which require multiple KDA deployments.

In total, we had ~22 pipelines. Each of these pipelines were priced at ~$0.256 per hour amounting to a total of ~$4,200. We considered deploying Flink using the Kubernetes operator on our EKS cluster. We managed to move all our KDA workload to Flink on EKS cluster using 3 t3a.2xlarge spot instances in our test environment and 3 t3a.2xlarge instances under savings plan for the production environment. Our cost was down to ~$500, savings of ~$3,700!

EC2

Right sizing pods

We analyzed CPU utilization for our Kubernetes workloads on EKS and found that many EC2 instances were underutilized. We standardized CPU and memory allocations for our Java and Node.js services, leading to a more balanced distribution across nodes.

Right sizing instances

After optimizing pod configurations, we reviewed our EC2 instances. We were using the c5 family but switched to the t3a family to achieve a 1:4 CPU to memory ratio. This switch matched our Java-based workload requirements better. Further, the change reduced costs by nearly 50%, and the burstable nature of t3a perfectly met our needs.

Savings plan

We opted for a 1-year, no upfront savings plan for all CPU requirements, similar to our approach with RDS.

Spot instances

Our clusters for user data are small, with most services running a single pod per availability zone. Therefore, we only use Spot Instances in our testing environment and for asynchronous message processing, ensuring that testing remains unaffected.

Overall, these optimizations helped us save ~$2,800.

Kafka vs SQS

We use Kafka for messaging and rely on AWS managed service to reduce operational overhead. While SQS is not a perfect replacement for Kafka, it is a cheaper alternative. The cost difference is significant, making self-hosting Kafka on our Kubernetes cluster the only option to achieve comparable costs.

Switching to SQS could save us $1,100 per month, but we preferred Kafka due to our familiarity with it and its integration with our architecture. We’re currently exploring self-hosting options to reduce costs further.

Best practices to save costs on cloud-based services

Upgrade to the latest generation instances whenever possible, as they are often cheaper
Avoid running EKS and similar clusters that are nearing their end-of-life in extended support mode
Optimize network costs by using Internet Gateways instead of NAT Gateways for public endpoint access
Self-host services wherever operationally feasible
Track costs per component regularly, either on a weekly, biweekly, or monthly basis, to monitor trends and understand any changes

In a future post, we will discuss some of our ongoing cost control measures and their impact on our cloud bill.

More engineering blogs by Xflow:

How we cut our AWS bill by $150,000: Proven strategies for cost optimization