Using AWS EMRFS in apache spark hosted on ec2

12/24/2018

If I am running spark on ec2 (or in kubernetes), can I use s3/emrfs in place of hdfs? Is this production ready and does it use parallelism to read/process data from s3?

Thanks in advance

-- Pragmatic
amazon-emr
amazon-s3
aws-eks
hdfs
kubernetes

2 Answers

12/26/2018

EMR uses a closed source S3 connector with proprietary features "emrfs". You don't get to see the source, can't get support from anyone else and don't get to use it except when you run emr. For independent apps: the s3a connector is great but not a full replacement for HDFS

-- Steve Loughran
Source: StackOverflow

12/24/2018

No, EMRFS is for EMR only, the easy way to make S3 look like part of HDFS. For EC2 you connect to S3, but that is less easy than with EMR. S3 is not tightly coupled to EC2. Yes, parallelism is applied but not according to MR data locality, worker and data node that is.

-- thebluephantom
Source: StackOverflow