Is it feasible to store Cassandra data on other distributed file system such as MapR and hdfs?

4/27/2020

I just wanted to know the impact of storing data of apache Cassandra to any other distributed file system.

For example- let's say i am having Hadoop cluster of 5 node and replication factor of 3.

Similarly for cassandra i am having 5 node of cluster with replication factor of 3 for all keyspaces. all data will be stored at hdfs location with same Mount path.

For example- node-0 Cassandra data directory -"/data/user/cassandra-0/"

And Cassandra logs directory - "/data/user/cassandra-0/logs/

With such kind of Architecture i need comments on following points-

  1. As suggested in datastax documentation casaandra data and commitlog directory should be different, which is not possible in this case. With default configuration cassandra commitlog size is 8192MB. So as per my understanding if i am having a disk of 1TB and if disk got full or any disk level error will stop entire cassandra clusters??

  2. Second question is related to underlying storage mechanism. Going with two level of data distribution by specifying replication factor 3 for hdfs and 3 for cassandra, then is it same data (sstables) will be stored at 9 location? Significant memory loss please suggest on this??

-- andy
cassandra
cassandra-3.0
datastax
distributed-computing
kubernetes

1 Answer

4/28/2020

Cassandra doesn't support out of the box storage of data on the non-local file systems, like, HDFS, etc. You can theoretically hack source code to support this, but it makes no sense - Cassandra handles replication itself, and doesn't need to have additional file system layer.

-- Alex Ott
Source: StackOverflow