Kubernetes Vs Spark Vs Spark on kubernetes

1/17/2020

So I have a use case where I will stream about 1000 records per minute from kafka. I just need to dump these records in raw form in a no sql db or something like a data lake for that matter I ran this through two approaches

Approach 1 —————————— Create kafka consumers in java and run them as three different containers in kubernetes. Since all the containers are in the same kafka consumer group, they would all contribute towards reading from same kafka topic and dump data into data lake. This works pretty quick for the volume of work load I have

Approach 2 ——————————- I then created a spark cluster and the same java logic to read from kafka and dump data in data lake

Observations ———————————- Performance of kubernetes if not bad was equal to that of a spark job running in clustered mode.

So my question is, what is the real use case for using spark over kubernetes the way I am using it or even spark on kubernetes? Is spark only going to rise and shine much much heavier work loads let’s say something of the order of 50,000 records per minute or cases where some real time processing needs to be done on the data before dumping it to the sink? Spark has more cost associated to it so I need to make sure I use it only if it would scale better than kuberbetes solution

-- Ray S
apache-spark
kubernetes

2 Answers

1/17/2020

Running kafka inside Kubernetes is only recommended when you have a lot of expertise doing it, as Kubernetes doesn't know it's hosting Spark, and Spark doesn't know its running inside Kubernetes you will need to double check for every feature you decide to run.

For your workload, I'd recommend sticking with Kubernetes. The elasticity, performance, monitoring tools and scheduling features plus the huge community support adds well on the long run.

Spark is a open source, scalable, massively parallel, in-memory execution engine for analytics applications so it will really spark when your load become more processing demand. It simply doesn't have much room to rise and shine if you are only dumping data, so keep It simple.

-- willrof
Source: StackOverflow

1/17/2020

If your case is only to archive/snapshot/dump records I would recommend you to look into the Kafka Connect.

If you need to process the records you stream, eg. aggregate or join streams, then Spark comes into the game. Also for this case you may look into the Kafka Streams.

Each of these frameworks have its own tradeoffs and performance overheads, but in any case you save much development efforts using the tools made for that rather than developing your own consumers. Also these frameworks already support most of the failures handling, scaling, and configurable semantics. Also they have enough config options to tune the behaviour to most of the cases you can imagine. Just choose the available integration and you're good to go! And of course beware the open source bugs ;) .

Hope it helps.

-- Aliaksandr Sasnouskikh
Source: StackOverflow