Pivotal Engineering Journal

Technical articles from Pivotal engineers.

Storing Stateful Data that Outlives a Container or a Cluster; Optimizing for Local Volumes

Storage for the data of stateful Kubernetes apps must outlive a Kubernetes container or a cluster, and can be optimized with Local Volumes; the third blog of a series of 4 on Stateless Kubernetes Apps

Posted on by
Edit this post on GitHub.

(This blog is the third installment of a four-part series)

Kubernetes can automatically provision “remote persistent” volumes with random names

Several types of storage volumes have built-in Kubernetes storage classes that enable provisioning volumes in a dynamic fashion, creating remote persistent volumes as necessary when a container is spun up for the first time. This provisioning of storage is useful for a scenario where the cluster lifetime is definitive, such as within a development cluster. In such a development environment, containers can come and go within the cluster, and any re-created container will remount any persistent disk that was previously created for the same container, as long as the cluster lives. The dynamically-generated volumes have names generated by the underlying storage class, typically a random string.

Storage that is automatically provisioned is also deleted by Kubernetes, in general, when the corresponding PersistentVolume object is deleted, such as when the cluster is deleted. In other words, the default “Reclaim Policy” of typical stateless containers are set to instruct Kubernetes to delete the volume when finished. This can be changed in a storage specification.

Even when retained, these volumes would be hard to track and remount in a new cluster because their names are typically generated as a random string.

StatefulSet offers a predictable, automated naming pattern with a default “retain” policy

StatefulSets offer a feature with their attribute “VolumeClaimTemplate” that controls the name of any generated volume. Combined with their attribute for “persistentVolumeReclaimPolicy”, which defaults to “Retain”, StatefulSets can easily generate volumes with well-defined names that persists past the lifetime of the initial set.

Reaching outside of Kubernetes to create volumes

In production, a typical deployment strategy requires that storage for long-lived data continues to persist no matter what happens with any cluster using that data (assuming no process explicitly deletes the data). It is best to provision with names that provide easy tracking and content identification. This can be done within a given IaaS platform and provided to the Kubernetes cluster, to be mounted by name.

For example, in Google Cloud Platform, a command like

gcloud compute disks create --size=20GB my-sample-volume-for-content-xyz

will create a volume with a given name. This volume will continue to exist for as long as the GCE account specifies. One way to access this volume within Kubernetes is to refer to the volume by name, such as with a pod yaml like:

apiVersion: v1
kind: Pod
metadata:
 name: my-host
 labels:
   app: my-app
spec:
 hostname: my-host
 containers:
 - name: gpdb
   image: gcr.io/my-project/my-image
   env:
 volumes:
 - name: pgdata
   persistentVolumeClaim:
     claimName: my-claim-gce
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
 name: my-claim-gce
 labels:
   app: my-app
spec:
 accessModes:
   - ReadWriteOnce
 storageClassName: "" # the storageClassName has to be specified but can be empty
 resources:
   requests:
     storage: 10Gi
 selector:
   matchLabels:
     app: my-app
---
apiVersion: v1
kind: PersistentVolume
metadata:
 name: my-host-pv
 labels:
   app: my-app 
spec:
 capacity:
   storage: 10Gi
 accessModes:
   - ReadWriteOnce
 gcePersistentDisk:
   pdName: my-sample-volume-for-content-xyz # this is the linkage with a pre-created volume

Local Persistent Volumes may offer performance gains, at the cost of complexity

Kubernetes 1.10 added, as a beta feature, access to Local Persistent Volumes.

Particularly in “raw block” mode, local persistent volumes imply a significant performance gain, but at the cost of deployment challenges. If a stateful app’s performance depends on storage throughput, this trade-off may be worthy of investigation. For example, Salesforce has described their preference for local persistent volumes.

Local persistent volumes are, by definition, local to the nodes on which they have been physically attached and mounted. This contrasts with remote persistent volumes, wherein Kubernetes causes a container to perceive a mounted volume, but the Kubernetes network layer meditates the communication between container and volume. In other words, a remote persistent volume can be easily remounted on another node, while a local persistent volume cannot. Therefore, when stateful data is already present on a local persistent volume, a stateful app must help the Kubernetes system schedule the the appropriate container on the appropriate node that has the appropriate local data. Managing this topology is much more complex and much less flexible than having a remote persistent volume where any node can generally mount any remote volume.

Rescheduling containers onto the nodes where their data already resides

Kubernetes has some automatic affinity when replacing a container into an existing deployment. Remote Persistent Volumes that were mounted when a container was initially launched will generally be matched and remounted to a container that is recreated, on any node, while the original deployment is still in effect.

However, when a wholesale change happens, such as when a Kubernetes cluster is wiped and a new one is recreated, how can an app find any existing data, particularly in light of local volumes that cannot be moved?

One strategy is to use DaemonSets to investigate all nodes and attach labels that will help Kubernetes assign containers to an appropriate location.

In other words, the steps include:

  • A short-lived daemon runs on each node, perhaps as a privileged container, investigating any storage found (particularly local), mounting, initializing and validating as necessary, and finally labeling the node appropriately
  • The stateful app’s orchestration (e.g., an operator) adds selectors to container specifications to ensure each stateful container will be scheduled on a node that matches its storage expectation

This kind of deployment might fail if there is a gap in the storage, such as a local volume gone missing. At such times, manual intervention may be necessary.

Stateful Apps, a 4-part series