Connect to a DB hosted within a Kubernetes engine cluster from a PySpark Dataproc job

7/25/2018

I am a new Dataproc user and I am trying to run a PySpark job that is supposed to use the MongoDB connector to retrieve data from a MongoDB replicaset hosted within a Googke Kubernetes Engine cluster.

Is it there a way to achieve this as my replicaset is not supposed to be accessible from the outside without using a port-forward or something?

-- Cedric Morent
apache-spark
google-cloud-dataproc
google-kubernetes-engine
mongodb
pyspark

2 Answers

8/13/2018

Just expose your Mogodb service in GKE and your should be able to reach it from within the same VPC network.

Take a look at this post for reference.

You should also be able to automate the service exposure through an init script

-- Notauser
Source: StackOverflow

7/26/2018

In this case I assume by saying "outside" you're pointing to the internet or other networks than your GKE cluster's. If you deploy your Dataproc cluster on the same network as your GKE cluster, and expose the MongoDB service to the internal network, you should be able to connect to the databases from your Dataproc job without needing to expose it to outside of the network.

You can find more information in this link to know how to create a Cloud Dataproc cluster with internal IP addresses.

-- Milad
Source: StackOverflow