We are having a very strange behavior with unacceptable big latency for communication within a kubernetes cluster (GKE). The latency is jumping between 600ms and 1s for a endpoint that has a Memorystore get/store action and a CloudSQL query. The same setup running locally in dev enivornment (although without k8s) is not showing this kind of latency.
About our architecture: We are running a k8s cluster on GKE using terraform and service / deployment (yaml) files for the creation (I added those below). We're running two node APIs (koa.js 2.5). One API is exposed with an ingress to the public and connects via a nodeport to the API pod.
The other API pod is private reachable through an internal loadbalancer from google. This API is connected to all the resource we need (CloudSQL, Cloud Storage).
Both APIs are also connected to a Memorystore (Redis).
The communication between those pods is secured with self-signed server/client certificates (which isn't the problem, we already removed it temporarily to test).
We checked the logs and saw that the request from the public API to the private one is taking about 200ms only to reach it. Also the response to the public API from the private one took about 600ms (messured from the point when the whole business logic of the private API went throw until we received that response back at the pubilc API)
We're really out of things to try... We already connected all the Google Cloud resources to our local environment which didn't show that kind of bad latency.
In a complete local setup the latency is only about 1/5 to 1/10 of what we see in the cloud setup. We also tried to ping the private POD from the public one which was in the 0.100ms area.
Do you have any ideas where we can further investigate ? Here is the terraform script about our Google Cloud setup
// Configure the Google Cloud provider
provider "google" {
project = "${var.project}"
region = "${var.region}"
}
data "google_compute_zones" "available" {}
# Ensuring relevant service APIs are enabled in your project. Alternatively visit and enable the needed services
resource "google_project_service" "serviceapi" {
service = "serviceusage.googleapis.com"
disable_on_destroy = false
}
resource "google_project_service" "sqlapi" {
service = "sqladmin.googleapis.com"
disable_on_destroy = false
depends_on = ["google_project_service.serviceapi"]
}
resource "google_project_service" "redisapi" {
service = "redis.googleapis.com"
disable_on_destroy = false
depends_on = ["google_project_service.serviceapi"]
}
# Create a VPC and a subnetwork in our region
resource "google_compute_network" "appnetwork" {
name = "${var.environment}-vpn"
auto_create_subnetworks = "false"
}
resource "google_compute_subnetwork" "network-with-private-secondary-ip-ranges" {
name = "${var.environment}-vpn-subnet"
ip_cidr_range = "10.2.0.0/16"
region = "europe-west1"
network = "${google_compute_network.appnetwork.self_link}"
secondary_ip_range {
range_name = "kubernetes-secondary-range-pods"
ip_cidr_range = "10.60.0.0/16"
}
secondary_ip_range {
range_name = "kubernetes-secondary-range-services"
ip_cidr_range = "10.70.0.0/16"
}
}
# GKE cluster setup
resource "google_container_cluster" "primary" {
name = "${var.environment}-cluster"
zone = "${data.google_compute_zones.available.names[1]}"
initial_node_count = 1
description = "Kubernetes Cluster"
network = "${google_compute_network.appnetwork.self_link}"
subnetwork = "${google_compute_subnetwork.network-with-private-secondary-ip-ranges.self_link}"
depends_on = ["google_project_service.serviceapi"]
additional_zones = [
"${data.google_compute_zones.available.names[0]}",
"${data.google_compute_zones.available.names[2]}",
]
master_auth {
username = "xxxxxxx"
password = "xxxxxxx"
}
ip_allocation_policy {
cluster_secondary_range_name = "kubernetes-secondary-range-pods"
services_secondary_range_name = "kubernetes-secondary-range-services"
}
node_config {
oauth_scopes = [
"https://www.googleapis.com/auth/compute",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/trace.append"
]
tags = ["kubernetes", "${var.environment}"]
}
}
##################
# MySQL DATABASES
##################
resource "google_sql_database_instance" "core" {
name = "${var.environment}-sql-core"
database_version = "MYSQL_5_7"
region = "${var.region}"
depends_on = ["google_project_service.sqlapi"]
settings {
# Second-generation instance tiers are based on the machine
# type. See argument reference below.
tier = "db-n1-standard-1"
}
}
resource "google_sql_database_instance" "tenant1" {
name = "${var.environment}-sql-tenant1"
database_version = "MYSQL_5_7"
region = "${var.region}"
depends_on = ["google_project_service.sqlapi"]
settings {
# Second-generation instance tiers are based on the machine
# type. See argument reference below.
tier = "db-n1-standard-1"
}
}
resource "google_sql_database_instance" "tenant2" {
name = "${var.environment}-sql-tenant2"
database_version = "MYSQL_5_7"
region = "${var.region}"
depends_on = ["google_project_service.sqlapi"]
settings {
# Second-generation instance tiers are based on the machine
# type. See argument reference below.
tier = "db-n1-standard-1"
}
}
resource "google_sql_database" "core" {
name = "project_core"
instance = "${google_sql_database_instance.core.name}"
}
resource "google_sql_database" "tenant1" {
name = "project_tenant_1"
instance = "${google_sql_database_instance.tenant1.name}"
}
resource "google_sql_database" "tenant2" {
name = "project_tenant_2"
instance = "${google_sql_database_instance.tenant2.name}"
}
##################
# MySQL USERS
##################
resource "google_sql_user" "core-user" {
name = "${var.sqluser}"
instance = "${google_sql_database_instance.core.name}"
host = "cloudsqlproxy~%"
password = "${var.sqlpassword}"
}
resource "google_sql_user" "tenant1-user" {
name = "${var.sqluser}"
instance = "${google_sql_database_instance.tenant1.name}"
host = "cloudsqlproxy~%"
password = "${var.sqlpassword}"
}
resource "google_sql_user" "tenant2-user" {
name = "${var.sqluser}"
instance = "${google_sql_database_instance.tenant2.name}"
host = "cloudsqlproxy~%"
password = "${var.sqlpassword}"
}
##################
# REDIS
##################
resource "google_redis_instance" "redis" {
name = "${var.environment}-redis"
tier = "BASIC"
memory_size_gb = 1
depends_on = ["google_project_service.redisapi"]
authorized_network = "${google_compute_network.appnetwork.self_link}"
region = "${var.region}"
location_id = "${data.google_compute_zones.available.names[0]}"
redis_version = "REDIS_3_2"
display_name = "Redis Instance"
}
# The following outputs allow authentication and connectivity to the GKE Cluster.
output "client_certificate" {
value = "${google_container_cluster.primary.master_auth.0.client_certificate}"
}
output "client_key" {
value = "${google_container_cluster.primary.master_auth.0.client_key}"
}
output "cluster_ca_certificate" {
value = "${google_container_cluster.primary.master_auth.0.cluster_ca_certificate}"
}
The service and deployment of the private API
# START CRUD POD
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: crud-pod
labels:
app: crud
spec:
template:
metadata:
labels:
app: crud
spec:
containers:
- name: crud
image: eu.gcr.io/dev-xxxxx/crud:latest-unstable
ports:
- containerPort: 3333
env:
- name: NODE_ENV
value: develop
volumeMounts:
- [..MountedConfigFiles..]
# [START proxy_container]
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.11
command: ["/cloud_sql_proxy",
"-instances=dev-xxxx:europe-west1:dev-sql-core=tcp:3306,dev-xxxx:europe-west1:dev-sql-tenant1=tcp:3307,dev-xxxx:europe-west1:dev-sql-tenant2=tcp:3308",
"-credential_file=xxxx"]
volumeMounts:
- name: cloudsql-instance-credentials
mountPath: /secrets/cloudsql
readOnly: true
# [END proxy_container]
# [START volumes]
volumes:
- name: cloudsql-instance-credentials
secret:
secretName: cloudsql-instance-credentials
- [..ConfigFilesVolumes..]
# [END volumes]
# END CRUD POD
-------
# START CRUD SERVICE
apiVersion: v1
kind: Service
metadata:
name: crud
annotations:
cloud.google.com/load-balancer-type: "Internal"
spec:
type: LoadBalancer
loadBalancerSourceRanges:
- 10.60.0.0/16
ports:
- name: crud-port
port: 3333
protocol: TCP # default; can also specify UDP
selector:
app: crud # label selector for Pods to target
# END CRUD SERVICE
And the public one (including ingress)
# START SAPI POD
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: sapi-pod
labels:
app: sapi
spec:
template:
metadata:
labels:
app: sapi
spec:
containers:
- name: sapi
image: eu.gcr.io/dev-xxx/sapi:latest-unstable
ports:
- containerPort: 8080
env:
- name: NODE_ENV
value: develop
volumeMounts:
- [..MountedConfigFiles..]
volumes:
- [..ConfigFilesVolumes..]
# END SAPI POD
-------------
# START SAPI SERVICE
kind: Service
apiVersion: v1
metadata:
name: sapi # Service name
spec:
selector:
app: sapi
ports:
- port: 8080
targetPort: 8080
type: NodePort
# END SAPI SERVICE
--------------
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: dev-ingress
annotations:
kubernetes.io/ingress.global-static-ip-name: api-dev-static-ip
labels:
app: sapi-ingress
spec:
backend:
serviceName: sapi
servicePort: 8080
tls:
- hosts:
- xxxxx
secretName: xxxxx
We fixed the issue by removing the @google-cloud/logging-winston from our logTransport. For some reason it blocked our traffic so that we got such bad latency.