Validating that Cluster Costs are Justified

Introduction

One of the biggest concerns related to cloud costs is visibility into cloud spending and the risk of cost overruns resulting from misconfigured resources or inefficient resource allocation.

RealTheory not only provides near real time cost visibility at every level of your Kubernetes environment by enabling deep dives from the cluster level to each individual object, but can also provide insight into whether the allocation of resources in your Kubernetes environment aligns with the actual utilization of those resources by your applications.

Prerequisites

To perform this procedure you must have the following:

A user account in your organization's RealTheory account

Procedure

Log in to your organizational account using one of the following methods:
- SSO (single sign-on) - if configured for your organizational account
- Email address, password, and account name - if SSO is not configured
Note: If you are unsure how to complete this step, contact the system administrator for your RealTheory account.
Navigate to Reports > Total Cost.
Verify that the breadcrumb filter is set to Reports > Total Cost > Clusters.

This view will list the total cost of each cluster you are authorized to view.
By default, the period selector will show the cost for the last 7 days; you can change the time period to 24 hours, 31 days, or 1 year if the 7 day time period is not appropriate.
Use the Cluster column to identify each cluster; the cluster at the top of the report is the highest cost cluster for the selected time period.
Double-click the cluster row of most interest.
The report will list the namespaces from highest to lowest cost for the selected cluster:
The higher cost of the my-app-dev namespace combined with its lower Utilization % indicates that investigating this namespace could yield important insights.
Double-click the namespace row of most interest.

The report will list the deployments from highest to lowest cost for the selected cluster and namespace:
Focusing on the highest cost deployments, review the Utilization% column for each.

Deployments with a higher cost and a lower utilization might indicate that the deployment is overallocated.
To gain more insight, click on the row for the object of most interest.

This will open the Information Panel and display the This Report tab, which shows the Total, Utilized, and Non-utilized costs associated with the selected object for the selected time period:
Click on the chart to open a full size version of the chart where you can interact with the different aspects of cost-related data as follows:
- Total Cost - Total, utilized, and non-utilized costs for the object:
- Utilization - Total, memory, and CPU utilization for the object:
- Efficiency - a cost-weighted analysis of the utilization of the object:
From these charts, we can see the following:
- CPU utilization is around 2%
- Memory utilization is around 65% - 70%
- Total utilization is around 35%
Which results in an efficiency (cost-weighted utilization) of around 10% for the last 7 days.
You can use the other time periods to review historical data and identify usage patterns, peak loads, and times of underutilization to gain more insight.
By reviewing the charts, we can see that workloads associated with this deployment are probably well provisioned on memory but are possibly overprovisioned on CPU; however, as the cost of CPU is significantly higher than the cost of memory, we have quickly identified a possible reason why the cost of this deployment is high relative to its utilization.
Further investigation is likely warranted to confirm that the CPU is overprovisioned, but within a minute or less, you have valuable insights into why this cluster is the most expensive cluster in your environment.
You can examine other tabs in the Information Panel on the deployment report view to gain additional insights.
If the analysis of the deployment identifies an issue of interest, click the deployment name link in the Information Panel to view the deployment in the visualization:
In the visualization, double-click the deployment (it will be highlighted) to drill in to its workloads and change the perspective selector from Health to Cost to see which workloads might need more analysis:

In the pod visualization, we can see that all of the pods are under cost pressure.
By reviewing the metrics for just one pod in this deployment, we can see that CPU usage has short bursts of very high spikes (one spike each day of ~600%) but otherwise usage is around 10%:
This could indicate that we are spending more than necessary to accommodate the spikes and that there are opportunities for cost savings if, for example, any of the following might be options:
- breaking the work performed during the spikes into smaller tasks that could be distributed more evenly throughout the day
- scaling resources dynamically using Kubernetes Horizontal Pod Autoscaling (HPA) to automatically accommodate increased demand during peak periods
- right-sizing resource requests and limits for the pod based on its actual resource requirements during peak and non-peak periods
- conducting thorough performance testing and profiling to identify any inefficiencies or bottlenecks within the application
If the analysis of the deployments did not identify any significant discrepancy between cost and utilization, continue your investigation by using the in the Reports > Total Cost > Deployments breadcrumb to select another object type of interest.
Typically, reviewing objects in the following order will quickly identify discrepancies between cost and utilization that might be contributing to higher than expected costs:
- ReplicaSets
  Note: You can also double-click the deployment row of most interest to list the replicasets for the selected cluster, namespace, and deployment.
- StatefulSets
- DaemonSets
If, after reviewing the charts and the Information Panel tabs for these objects as you did for deployments, the utilization of each object is within an acceptable range, a review of the nodes might determine if the high cluster costs are related to over provisioned nodes.
In the breadcrumb, click Namespaces and then use the to select Nodes.
Repeat the cost-to-utilization analysis for the highest cost node(s) in the cluster.
If node utilization is within an acceptable range, then the cluster costs can be justified, no immediate action is needed, and monitoring should continue.
If node utilization is not within an acceptable range, you might have too many nodes running or scheduling constraints that prevent pods from being scheduled to a node; consider the following:
- Review the scaling policies configured for your cluster and, if necessary, adjust the scaling thresholds and policies to better match the workload requirements
- Review the distribution of workloads across the nodes in the cluster and consider rebalancing workloads or adjusting pod scheduling and affinity configurations to distribute workloads more evenly across nodes
- Evaluate the resource requests and limits set for pods running in your cluster and adjust resource requests and limits as needed to prevent overprovisioning of resources and to optimize resource utilization across nodes