I have a Neo4j service, but before the deployment starts up, I need to pre-fill it with data (about 2GB of data). Currently, I wrote a Kubernetes Job
to transform the data from a CSV and format it for the database using the neo4j-admin
tool. It saves the formatted data to a persistent volume. After waiting for the job to complete, I mount the volume in the Neo4j container and the container is effectively read-only on this data for the rest of its life.
Is there a better way to do this more automatically?
I don't want to have to wait for the job to complete to run another command to create the Neo4j deployment. I looked into initContainers, but that isn't suitable because I don't want to redo the data filling when a pod is re-created. I just want subsequent pods to read from the same persistent volume. Is there a way to wait for the job to complete first?
I assume that your neo4j application data won't be updated from your neo4j deployment based on you said that the deployment loads the volume as read-only.
If that is the case why do you want kubernetes to do the data loading? Use object storage like s3 or azure data lake and ensure that there is some data workflow pipeline that will update the object storage. There are many tools that provides data pipeline features such as oozie, airflow.
In your deployment, then you can refer to the object storage via Persistent Volume Claim.
As Jobs can't natively spawn new objects once finished (and if exited gracefully, using PreStop to invoke further actions won't work), you might want to monitor the API objects instead.
Programatically accessing the API to determine when the Job is finished and then, create your Deployment object might be a feasible, automated way to do it.
Doing it this way, you don't have to worry for redoing the data processing with initContainers as you can essentially call the deployment and remount the already existing volume.
Also, using the official Go library allows you to either run within the cluster, in a pod or externally.