I'm deploying my cluster on Google Cloud Kubernates service. It already has a few nodes. Also, I need the server with GPU from Google Cloud to make it work with my cluster. GPU instance continuously processes the incoming traffic (bandwidth should be up to 1Gb/s) and sends the results on cluster nodes (bandwidth should be even more than incoming bandwidth).
The most critical things for me in the project:
1) bandwidth between these nodes inside cluster;
2) bandwidth between the node and the GPU server;
3) bandwidth between the GPU server and the world;
4) bandwidth between the node and world.
The minimum appropriate bandwidth for each node is 1 Gb/s on downloading and uploading both. When I make speed tests, it shows download speed 100-680 Mb/s and upload speed 67-138 Mb/s for the same node on the same time (screenshots below were made in period 30 seconds between each other). So the current bandwidth is too small and unstable. But I need stable bandwidth starting from 1 Gb/s.
I tried to find any technical specification or pricing on bandwidth in Google Docs. But, there are only CPU/GPU/RAM/Disk, not bandwidth in the technical specification. And there is only traffic per month pricing on docs.
TL;DR:
How can I set stable 1 Gb/s or more bandwidth for each of the cluster nodes, GPU instance and any other Google Cloud virtual machine? Is there any service in Google Cloud that provides bandwidth of more than 1 Gb/s? Is there any solution/service in Google Cloud how to handle big Internet traffic?
P.S. speed tests were made via:
npx speedo-cli
I fear that you can't have any bandwith commitments in mutualized infrastructure. If you have (a lot of) cash, using sole-tenant[1] with all the parts of your architecture on the same tenant can help to solve external parasite. But event in this case, there is no commitment on network bandwith. And, for now, GPU aren't supported in this solution.
Since Aleksi's answer there have been some changes to the per-VM egress cap/throttle. It is still computed as 2 Gbit/s * NumberOfvCPUs, but the maximum is now 32 Gbit/s (when the VM is created with min_cpu_platform of skylake or better) and there is a minimum of 10 Gbit/s for VMs with 2 or more vCPUs.
It wasn't clear to me what the endpoints were for your speed test, but one of the (many) limits to the throughput of a TCP connection is:
Throughput <= WindowSize / RoundTripTime
One would expect the GPU instance and the node(s) would be located close to one another, but that limit may come into play for GPU instance and node to the world.
Beyond that, understanding what was happening for the variable throughput likely calls for packet traces, definitely at the sending side, preferably at the receiving side as well. Just the first 96 bytes of each packet would be sufficient in this sort of case. It would be one of the things a support organization would request.
There's no guarantee really, especially when it comes to traffic to/from networks outside GCP. Here's a few things you can do to maximize bandwidth though:
Increase the number of CPU cores per instance:
caps are dependent on the number of vCPUs that a virtual machine instance has. Each core is subject to a 2 Gbits/second (Gbps) cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine. source
Note that the 2 Gbps per vCPU
cap represents a theoretical limit using internal networks:
The cap is a limit that can't be exceeded and doesn't indicate the actual throughput of your egress traffic. There is no guarantee that your traffic will achieve the maximum throughput, which depends on many factors other than the cap. source
In case of traffic between VMs (i.e., cases 1 and 2 in your question) make sure the VMs are located in the same zone and you're using internal IPs:
Any time you transfer data or communicate between VMs, you can achieve max performance by always using the internal IP to communicate. In many cases, the difference in speed can be drastic. source
iperf
.For advanced use cases you can try fine-tuning the TCP window size in your VMs.
Finally, one benchmark observed that the GCP network throughput is 81x more variable when compared to AWS
. Naturally this just reflects one benchmark but you might find it worthwhile to test other providers yourself.