Docker application support in Hadoop YARN

8/25/2015

I need to process a large (terabytes) set of data (mainly images). I was thinking of using Hadoop YARN with HDFS to process these data. The idea is to ingest all the data into HDFS and then submit Hadoop job to process the data. YARN will deploy the processing application close to the data and will process them. This is fine if my processing application is a "jar" file. If my image processing application is a docker image , is it possible to submit a job to YARN so that the submitted application is a docker image (and not a jar file)? YARN has to deploy the application (docker image) in the data nodes to start processing.

I checked Docker Container Executor but it launches YARN containers inside Docker containers and the application (job) is still a jar file as seen in the above link.

Google Kubernetes seems to fit my need (to deploy and manage docker images in a cluster) but it does not provide "HDFS-like" storage (hence "move app to data than data to app" doesn't fit).

Please let me know if there is any cluster manager framework that can deploy standard application packages (like jar, rpm, docker containers) in a cluster to access a shared/distributed data storage.

Thanks in advance.

-- dealbitte
cluster-computing
docker
hadoop
kubernetes
yarn

1 Answer

8/25/2015

Current docker executor in YARN is not very good because, afaik, you need to replace whole executor and at least at a time it was introduced it was a cluster wide setting.

HW is doing something around docker http://hortonworks.com/blog/docker-kubernetes-apache-hadoop-yarn/. You didn't mention that blog so I'm posting it here.

-- Janne Valkealahti
Source: StackOverflow