I have a Kubernetes service that collects models. A system that builds these models is a Python Dataproc job.
-> I need a way to push the result of the Dataproc job to the model collection service.
Question: How do I access the service in the Kubernetes cluster from within Dataproc, what are my options?
Do I need an ingress controller? Is it possible to access the Kubernetes network (including DNS) from within Dataproc? Would it be an option to do it through the gcloud
(forwarding, but that seems not elegant from within Python)?
Dataproc and GKE nodes are all GCE VMs, by default they can access services in the same VPC network through internal IP, internal DNS or hostname. But unfortunately Pods run in another layer of virtual network above the VPC network, which are not accessible by VMs directly.
There are several options to make GKE services accessible by Dataproc nodes (and GCE VMs in general), but I would suggest you put the GKE cluster and Dataproc cluster in the same VPC network, then create a special k8s Service resource with type LoadBalancer
and annotation cloud.google.com/load-balancer-type: "Internal"
for your Pod, then VMs in the same VPC could access the service through its internal IP. See this doc for more details.