If I am running spark on ec2 (or in kubernetes), can I use s3/emrfs in place of hdfs? Is this production ready and does it use parallelism to read/process data from s3?
Thanks in advance
EMR uses a closed source S3 connector with proprietary features "emrfs". You don't get to see the source, can't get support from anyone else and don't get to use it except when you run emr. For independent apps: the s3a connector is great but not a full replacement for HDFS
No, EMRFS is for EMR only, the easy way to make S3 look like part of HDFS. For EC2 you connect to S3, but that is less easy than with EMR. S3 is not tightly coupled to EC2. Yes, parallelism is applied but not according to MR data locality, worker and data node that is.