Kubernetes can use RBD images as persistent block storage for Linux containers. However, there is only one container can mount a RBD volume in read-write mode. If multiple containers write to the same RBD volume without high level coordination, data corruption will likely occur, as reported in a recent case.
One intuitive solution is to make persistent block storage provider be restrictive on client mount. For instance, Google Compute Engine’s Persistent Disk allows only one read-write mount.
Another approach is fencing. A RBD image writer needs to hold an exclusive lock on an image during mount. If the writer fails to acquire the lock, it is safe to assume the image is being used by others. The writer shouldn’t attempt to mount the RBD volume in this case. As a result, only one writer can use the image and no more data corruption.
This is the how the RBD volume mount fencing pull request does for Kubernetes. I tried the following test and found this fixes the mount racing problem.
I have two Fedora 21 hosts. Each loads my fix and runs as a local cluster :
# ./hack/local-up-cluster.sh
# start the rest of the kubectl routines
Then each local cluster creates a Pod using RBD volume:
#./cluster/kubectl.sh create -f examples/rbd/rbd.json
Watch RBD image lock:
# rbd lock list foo --pool kube
There is 1 exclusive lock on this image.
Locker ID Address
client.4494 kubelet_lock_magic_host 10.16.154.78:0/1026846
On both clusters, get the Pod status. I see one cluster has a running Pod and another cluster sees Pod pending.
Running Pod:
# ./cluster/kubectl.sh get pod
NAME READY REASON RESTARTS AGE
rbd 1/1 Running 0 5m
The other Pod:
# ./cluster/kubectl.sh get pod
NAME READY REASON RESTARTS AGE
rbd3 0/1 Image: kubernetes/pause is ready, container is creating 0 4m
Then I delete the running Pod, the second one immediately becomes running.
So with this fix, Pods do get fenced off.