In a cloud environment, what's a good way, using which technologies, to have a client upload a file to be processed?

7/21/2018

In a cloud environment, providing client companies REST API services to store and update their customers information (phone numbers etc'), I'm looking for a way for a newly joining client company to pass a file (or a set of files) containing all their customers. The file/s may contain millions of customer records.

Suppose the idea is that file/s may be uploaded to a certain folder, and once detected, an import process starts. Suppose also that there exists a service in the cloud that can create a customer from a request containing the details. Suppose each file is limited to something like 1GB.

I've heard that Yarn may be used or Kubernetes, but i can't really see how they can be used, or what's the advantage of using them.

This import process could be done in pure Java as follows: A Folder watching code in Java can easily detect the new file in the folder, and invoke a process that reads the records of the file/s, and from each record in the file, create a request obejct and call the service which can create the customer.

So what's the advantage of using Yarn or Kubernetes, over pure Java, in doing a task like this? And are there other alternative technologies that can be used for this task?

-- inor
cloud
java
kubernetes
rest
yarn

1 Answer

7/21/2018

In a cloud environment, you want your Java service to be "highly available" and, when dealing with "millions of customer records" per client, even "secure." This is where Kubernetes and Yarn come in.

If you are running one VM, with a Java process saving sensitive customer data unencrypted to the local file system- what happens when:

  • the VM is compromised by an attacker. All data compromised.
  • the Java process crashes. New customers can't be onboarded.
  • the VM crashes. New customers can't be onboarded, and onboarding work in progress is lost.
  • the process that does the importing of customer data crashes.

You get the idea, there are an infinite number of failure and compromise scenarios.

Kubernetes and Yarn, in different ways, support architectural patterns that allow you to run multiple Java upload and import processes across a collection of VMs so that there can be sensible handling for the various failure cases, and sensible custodial machinery for the sensitive aspects of this process, at scale, with live data.

-- Jonah Benton
Source: StackOverflow