GCP - spark on GKE vs Dataproc

1/31/2019

Our organisation has recently moved its infrastructure from aws to google cloud compute and I figured dataproc clusters are a good solution to running our existing spark jobs . But when it comes to comparing the pricing , I also realised that I can just fire up a google kubernetes engine cluster and install spark in it to run spark applications on it .

Now my question is , how do “running spark on gke “ and using dataproc compare ? Which one would be the best option in terms of autoscaling , pricing and infrastructure . I’ve read googles documentation on gke and dataproc but there isn’t enough for to be sure in terms of advantages and disadvantages of using GKE or dataproc over the other .

Any expert opinion will be extremely helpful.

Thanks in advance.

-- user1411837
google-cloud-dataproc
google-cloud-platform
google-kubernetes-engine
pyspark

2 Answers

1/31/2019

Spark on DataProc is proven and it's in use at many organizations, though its not fully managed, you can automate cluster creation and tear down, submitting jobs etc through GCP api, but still it's another stack you have to manage.

Spark on GKE is something new, Spark started adding features from 2.4 onwards to support Kubernetes, and even Google updated the Kubernetes for the preview couple of days back, Link

I would just go with DataProc if I have to run Jobs in Prod environment as we speak otherwise you could just experiment yourself with Docker and see how it fares, but I think it needs little more time to be stable, from purely cost perspective it would be cheaper with Docker as you can share resources with your other services.

-- skjagini
Source: StackOverflow

3/14/2019

Adding my two cents to above answer.

  • I would favor DataProc, because its managed and supports Spark out of the box. No hazzles. More importantly, cost optimized. You may not need clusters all the time, you can have ephemeral clusters with dataproc.
  • With GKE, I need to explicitly discard the cluster and recreate when necessary. Additional care needs to be taken care of.
  • I could not come across any direct service offering from GCP on data lineage. In that case, I would probably use Apache Atlas with Spark-Atlas-Connector on Spark installation managed by myself. In that case, running Spark on GKE with all the control lying with myself would make a compelling choice.
-- Raghavendra Prakash
Source: StackOverflow