Clouds are comfortable with their flexibility. We need a powerful computing cluster for eight hours: rented in three clicks, completed the task and put out the cars. Unfortunately, many simply misunderstand the ideology of cloud resources and are often disappointed when they see accounts at the end of the month.
To optimize costs, you need to start by collecting good statistics. I will try to briefly describe the appropriate tools for this.
The key principle of saving is to turn off all unnecessary and minimize the reserve as much as possible. Weaned to think in the paradigm of the local server in the infrastructure of the company. If you can automate the expansion and shutdown of cloud resources depending on the load – even better.
Consider situations where fixed rates are more profitable, and when – pay-as-you-go-concept (PAYG). Plus, consider what you can turn off the excess, and where resources are most often wasted. Let’s go through the main types of resources: CPU, RAM, virtual disks, networks and backups.
We collect information
Before you make any optimization actions, you first need to understand how your machine is loaded. Therefore, it is most right to start with monitoring systems in order to accurately understand the nature of the load. Moreover, the load should be analyzed at different scales. Averaged data for the month will be useful for understanding the degree of reservation or lack of resources in general. Also, you can predict the approximate dates when the current capacity will cease to be enough if the load is gradually growing. Data for several days will be able to show diurnal fluctuations, usually associated with the life cycle of time zones, and sharp single bursts when resources are in short supply.
How to monitor
There are a huge number of information collection and visualization systems to choose from, but I would like to draw attention to the key ones:
- Zabbix is a good old Swiss knife in monitoring systems. The graphics are poor and scary, for my taste.
- Kibana is part of the ELK stack. Not suitable for any type of data, but often allows you to make very complex visualizations, including cartographic.
- Graphana is a very flexible tool that allows you to intuitively create beautiful dashboards. It seems to me that it is the most diverse in terms of various types of graphs.
- Custom cloud vendor panels. They are often forgotten, but often they can provide you with a unique low-level picture of the resource consumption of your machines.
How to optimize the tariff
The cloud itself is designed to save money, but you can also optimize costs by properly consuming cloud resources. Estimate the uniform loading of your machines. If the virtual machine is stably loaded, then everything is fine, and the tariff with fixed limits and a small power reserve will be most beneficial.
During peak load, only one core was involved. You can save money and not take two.
You can take two cores instead of four without loss of performance.
If more than 10% of the time the VM is idle, then you need to try to count on the PAYG model. Only then should the machine be turned off when not in use. For example, once a day, a node turns on, compiles the project and turns off again.
All cores are evenly loaded. Most likely, it will be advantageous to take a tariff with a fixed payment.
Usually, the tactics of financial optimization look like this: you divide the elements of your infrastructure into stable ones according to consumption and unstable ones with peak loads. We try to transfer elements with constant consumption to a fixed tariff, and unstable ones to PAYG, in order to be able to smoothly smooth out single bursts inexpensively.
We do the same with various test and experimental machines. Most often, it is more profitable to pay them upon the consumed resources, rather than with a fixed tariff. If you find a car that consumes 1% of the power, then turn it off and transfer to PAYG.
You need to understand that in the cloud you are not paying for the load on the processor, but for the very fact of using cores. If you use the PAYG tariff plan and you do not need a VM at the moment, for example, you have performed the necessary tests on it and do not use it anymore, then it’s more logical to turn it off and thereby save on the cost of virtual memory and processors.
Some clients use the vCloud Director API to turn on and off VMs on a schedule to save even more on resources consumed. A very good approach would be to use orchestration to control the cloud, turn nodes on and off when the load changes.
At the same time, you will have to pay for virtual disks anyway. If you no longer need a VM, then it is better to remove it. Or, if you use it very rarely, then for a while, while it is off, transfer it to a cheaper storage. You can also save on virtual disks if you deploy a node each time from a ready-made template, rather than keep the finished one turned off. And keep the templates already in cheap storage.
You should not allocate more memory to the VM than is required. Very often, cloud users configure VMs empirically: “Well, about as much memory needs to be dumped and so many cores.” At the same time, it is RAM that, as a rule, is the most expensive resource in the cloud.
Using the PAYG tariff plan, it makes sense to configure the VM so that it optimally utilizes the resources allocated to it, having a margin in performance for peak loads. At the same time, the reserve should not be x4 or x10 from the average consumption of your application: it is simply irrationally expensive. Of course, limits are always determined individually for each task, but most often you should strive to ensure that the stock does not exceed 25%.
One important resource is virtual disks. There are several types of virtual disks that differ in speed and price. The faster the drive, the more expensive it is expected. Therefore, saving on this parameter should begin with the analysis and separation of data into “cold” and “hot”.
Wrong option. The VM allocation policy is configured so that the VM swap, configuration, and disks will be placed on expensive disks, unless you explicitly specify otherwise.
A VM can simultaneously use different types of disks that can be placed on fast and slow storage. Accordingly, if you store some kind of data archives or logs, it makes sense to place them on slow disks. Databases are critical to IOPS, so we send them to fast repositories.
The correct option. Each data type has its own disk speed.
Now we need to solve the question on which disk to place the VM itself. The vСloud Director platform has a feature: it reserves disk space for the VM’s RAM at the time of its launch. In this case, you pre-select the type of storage. The OS itself is not too critical for IOPS, most components are already loaded into RAM, and the disk is not loaded. However, if you save money and place yourself on a slow disk, you will get a long reboot and wake up from sleep mode due to the low speed of reading swap and configuration files.
Consider this factor if it is critical for you to quickly raise the virtual machine after a reboot. In other cases, you can save.
Disks under the database
Very often, disks under the database are taken “for growth.” This is a typical mistake for cloud systems. If you need 60 GB, then take as much as you need with a small margin. Conditional 200 GB of a fast disk will eat up finance and idle.
Unlike the iron server, there is no problem gradually expanding the disk in volume as necessary. Just do not forget to hang up monitoring with triggers to overflow the disk so as not to miss the moment when it is time to expand the space. If you want to do it very beautifully, you can try to automatically increase the space as it fills through the API.
The only minus of this approach is that you cannot reduce the size of the disks, changes only towards expansion. If you still need to reduce it, then this procedure is performed with transferring data to a smaller volume and deleting the old disk. Do not forget about backups and thorough testing at this stage.
Do not forget to dump the logs on a separate slow storage so as not to waste precious space on the fast disk from the database. They tend to absorb space very quickly, especially in the case of MS SQL. It is also highly desirable to exclude logs from the images of regular backups.
We do the same with the Network: we don’t chase round numbers of 1-10-100 megabits, but immediately take 45 megabits if you have an average channel load of around 40. Take as many resources as you need.
Specify the ability to control this parameter on your part. At our place, the client himself cannot change it, that is, pay-as-you-go does not work, and changing the channel width is possible once a month. However, there are clouds where this setting is configured in real time.
In the vCloud Director-based cloud, there’s a great thing about the NSX Edge — the virtual router. It is free and can replace more expensive solutions that require additional capacity and purchase licenses. It has a balancer that in simple situations can replace such solutions as Haproxy and virtual appliance Citrix NetScaler. No need to buy licenses for commercial products, and you do not pay for NSX Edge resources. It is by default.
If necessary, NSX Edge can be scaled. Able to work as a VPN of three types:
- IPsec VPN tunnel Site-to-Site – for organizing a secure channel between the cloud and the office or other clouds where the client’s resources are located.
- SSL VPN – for client access from mobile devices and personal computers, while you can not use more expensive solutions like Checkpoint Cisco and Fortigate.
- L2VPN is the same as IPSEC Site-to-Site, but connects at the L2 level.
Here you can save a few options.
First you need to carefully select the optimal processor type in terms of frequency and number of cores. Everything will depend on the features of software licensing that runs on this machine. For example, an AD server or terminal server is convenient to keep on hypervisors with a large number of mid-range cores. On servers with software licensed by the number of cores, the frequency of a single core is already becoming important. Such machines are more expensive, but more profitable in terms of performance per core.
It is not at all necessary to use the same processors on all of your machines. Moreover, a combination of instances with a core frequency of 2.4 GHz and, for example, 3.1 GHz may be more economically attractive.
Hybrid architectures can also be used. If you have your own infrastructure and you don’t want to transfer everything, then a mixed solution will be a good solution, when a part is spinning in the office and the other in the cloud. At the same time, cloud resources are taken only to smooth the peak load of their own infrastructure.
Please note: the RAM will reduce the RAM and the CPU only when the VM is turned off, otherwise the system might panic. You can expand the resources on a running machine if the OS supports and the hot add mode is enabled.
When using the Backup service, the client pays for two things: for the number of VMs and for the amount of data on disk space. Again we start with the analysis.
Think about whether you need copies of Active Directory a year ago, or if an incremental daily backup of one to two weeks is enough. Decide how long corrupted data may go undetected, and configure backup storage accordingly. Working machines, on which the service is only spinning, as a rule, it makes no sense to store for longer than a week. In the case of the database, it may be necessary to restore from a copy of six months ago if someone messes up the data, but this will not be noticed right away.
The second point is the speed of recovery. If the service has a reserve, then you can limit yourself to a full backup and incremental copies. In this case, the last copy is restored, but before that, all changes from incremental copies must be considered in stages. This is a long time. But in the case of working backup nodes, the service will continue to function while one of the nodes is restored.
If there is no reserve for some reason, then you should think about storing the current full backup without increment. Eats more places, but it is deployed much faster and minimizes downtime. The difference in speed can be one and a half to two times.
If you need to keep backups for a long time, then you should consider the option of cold storage. These are usually slow recovery tape libraries. The option is ideal for archived data. A good example is Amazon Glacier, where the start of deployment from an old backup starts three to five hours after the request. Conditionally, your data needs to be physically found, removed from the warehouse and counted. But the cost per gigabyte of storage is becoming much more profitable.
We can use the facility storage at the rate of “Single“. It has a small storage cost with a progressive discount: the more storage you store, the cheaper the gigabyte of data costs.
You can save resources on the cloud only if you carefully analyze the load and minimize the reserve capacity. Start by collecting high-quality statistics and begin to smoothly cut resources to the minimum necessary.
If the node consumption is stable and does not change – go to your own infrastructure for this task or choose fixed tariffs.
Do not forget about the possibility of hybrid schemes. Using cloud resources to smooth out irregular peak loads fits almost perfectly into the pay-as-you-go concept.