I have a base image and several completely orthogonal "dimensions" of completely static overlays that map into data directories, each with several options, that I want to permute to produce final container(s) in my deployments. As a degenerate example, the base image (X) will need one of each of (A,B,C), (P,D,Q), and (K,L,M) at deployment time. What I'm doing now is building separate images for each permutation that I end up actually needing: e.g. XADM, XBDK, etc. The problem is that as the number of dimensions of static data overlays expands and the number of choices inside each dimension gets larger, I run into serious combinatorial explosion issues - it might take 10 minutes for our CI/CD system to build each image (some of the overlays are large) and since it is the base image that changes most often, the layers don't cache well.
Thoughts so far:
- generate each layer (ABCPDQKLM) as a separate container that populates a volume which then gets mounted by each of my X containers. This is is fine, though I NEVER need the layers to be writable and don't especially want to pay for persistent storage associated with volumes that feel like they should be superfluous.
- reorder my layers to be slowest-to-fastest changing. I can get some improvement from doing this, but I still hit the combinatorics issue: I probably still have to build all the combinations I need, but at least my CI/CD build time will be improved. I think it results in poorer overall layer caching, but trading off space for time might be reasonable and the result per tenant is still good and doesn't incur any volume storage during deployment.
I'm not happy about either option (or my current solution). Any ideas would be welcome.
Edits/Questions:
- "static" means read-only, but as a practical matter, the A/B/C overlays might each be a few 100MB of directory structure to be mounted/present in a specific place in the container's file system. In every case, it is data that is going to be used (even memory-mapped!) by the programs in the base image, so it needs to be at least very effectively cached near each of the CPUs that is going to be using it. I like the performance characteristics of having the data baked into the containers, but perhaps I should be trusting the storage layer more to keep the data properly cached/replicated near the real CPUs. Doing so means trading off registry space charges against PV storage charges, but that may be a minor consideration.
- Basically, each "dimension" is a type of trained machine learning model. I need to compose the dimensions by choosing the right set of trained models to fit the domain required for each of many production tenants.