Data should be managed at the distributed persistence layer

Why Data Problems Cannot Be Solved At the Storage Layer

When applications move to production, one common requirement is to support Disaster Recovery (DR) or High Availability (HA), so that in the unlikely event of one site encountering an outage, the application can continue to operate without disruption, thus providing better resilience and availability. A true DR/HA solution, by its definition, will require the application to span across multiple sites, where each site can reside in different availability zones, data centers, or cloud regions. This obviously will require the data to be replicated or shared across sites.

Behind the scenes, DR typically refers to the active-standby setup, where the data is one-way replicated from the active site to the standby site. HA, in contrast, is an active-active setup, where the data is shared between two or more active sites.

The conventional wisdom is that data replication or sharing should be solved at the storage level. After all, the storage layer deals with the read/write IO between data and block devices, so it can easily keep track of which block is updated or has not been copied to the remote site(s). Historically the storage layer might be the only way to handle DR/HA because the application and its data persistence layer (e.g., database) is not designed to handle cross-site data replication or sharing with tolerance of network latency and partition.

The storage layer obviously continues to innovate. Today most modern storage vendors provide proprietary hardware/software stacks that manage storage replication, with fancy data compression, deduplication, multi-tier storage, incremental replication schemes, etc. (Please note this is still mostly in the domain of one-way data replication — cross-site data sharing is a lot harder to deal with, especially with simultaneous writers on multiple sites.)

However, a lot has changed above the storage layer. There’s a saying that everybody thinks that whatever layer of the stack they sit at is where the real problems are, and that every layer above them is easy, presumably because the layers above are just dumb consumers… Well, not any more! With the software world evolving, more and more intelligence has been built into the consumer/client side, requiring less interaction with the provider side, and thus improving performance. The application’s data persistence service is further refined and subcategorized into multiple specialized middleware services, such as SQL database, NoSQL database, Message Queue, Object Store, etc. With more logic built into these services, one shortcoming of the underlying storage layer becomes obvious. The storage layer is only aware of the read/write IO at the file or block level; it has no knowledge of transaction, cache, or other intelligence that happens above itself. To keep data consistent, it should have a database service commit the transaction and flush the cache before initiating the storage replication. Without tapping into the database service, the storage will not know when is the right time to do replication. With the proliferation of different kinds of application persistence services, it will be hard for the storage service to be transaction aware for every service.

Things get even more complicated when the persistence service is distributed. Let’s take etcd, a key component of Kubernetes as an example. Kubernetes uses etcd as its persistence layer for all desired states and configurations. In multi-master mode, Kubernetes will have 3 or 5 etcd nodes form a quorum. Behind the scenes, etcd employs the RAFT distributed computing algorithm to elect one node out of the quorum as leader, and the rest of nodes as follower. All write operations are directed to the leader, and all followers are watching the leader for any changes it needs to replicate. The leader is not static, so periodically the leader and follower will exchange heartbeat messages to verify they are both alive. If the leader is down or the quorum is partitioned, the majority of the quorum (> ½ of quorum) nodes will re-elect a leader and continue to operate. This kind of distributed and dynamic activity makes it very difficult for a storage layer to track updates.

Instead of having the storage layer try to solve the multi-site data replication and sharing problem, why not just let the application data persistence services solve these problems themselves? In the Cloud-native world, many of the new generation of application persistence services are distributed and some can even tolerate network latency and partition to allow being distributed across multiple sites. For example, for RDBMS, there is a new class of database services, NewSQL, where examples each can be a RDBMS across multiple sites — CockroachDB (Postgres compatible), TiDB (MySQL compatible), YugaByte, as well as Google’s Spanner cloud service. For NoSQL, given its key-value based structure, it is even easier to support multi-site deployment; many of the popular solutions like Cassandra, MongoDB support this feature, as do cloud services like Microsoft Cosmos DB. Message Buses like Kafka and NATS support multi-site deployment as well. For Object Stores, services such as Minio and even EdgeFS can be a good fit for multi-site. A lot of these services not only allow multi-site deployments, but also have built-in sophisticated multi-site consensus algorithm like Paxos and Multi-RAFT to ensure data consistency and availability, and are able to survive a network partition as long as more than half of cluster nodes are still alive and interconnected. It is far more efficient to have data consensus mechanisms built on top of the persistence service’s own appropriate data structures, instead of forcing it to fit into the pigeonholes like files and blocks that the storage layer has visibility into.

To conclude, modern application persistence services are more intelligent than before. Data problems can actually be solved in the application persistence services nicely, instead of relying on the underlying storage layer. However, there is no free lunch here — it will require applications to be refactored to use these new cloud-native persistence services.

Time to modernize!

Tags:

Operations