Use Azure File Storage in Kubernetes

My recent work on integrating Microsoft Azure File Storage with Kubernetes storage is available for testing.

Azure File Storage is basically a SMB 3.0 file share. Each time a VM needs a file share, you can use your storage account to create one. There is one limitation for Linux VM: for lack of encryption in kernel CIFS implementation, Linux VM must colocate with the file share in the same Azure region. Thus, for now Kubernetes hosts must live in Azure Computing VMs to access their Azure file share.

It is also possible to use Azure Block Blob storage for Kubernetes, though that’ll require more efforts and new APIs from Azure.

 

Data Volumes for Containers

The last part of the trilogy is a survey of the current-in-market storage technologies that supply data volumes for Containers.

In one of the previous posts, I listed innovations and technologies sparkled by the introduction of Docker’s volume plugins. A rough categorization of the technologies are as the following.

  • Enablement. Obviously most of those volume plugins fall into this category. They connect various storage backend (Glusterfs, Ceph RBD, NAS, Cloud Storage, etc) to Containers’ mount namespace so Containers can store and retrieve data from these backends.
  • Data Protection. Some technologies enable data protection (backup, snapshot, and replication), a core storage function. I have yet to spot any innovation beyond the traditional data protection though.
  • Mobility. Some technologies assist containers’ mobility. Containers can relocate to new homes and find data right there. I am not so certain all these technologies work flawlessly though.
  • Provisioning. Rexray’s blog did a better job than I.
  • Multi-tenancy. Some claim they support,  such as BlockBridge. A demo video is available (thanks for Ilya’s information), it appears a block storage is created based on tenant’s credentials.
  • Security (or something like that). Is that omission my fault?
  • Performance. Any innovation out there that clip up Containers’ performance?
  • Isolation. Any innovation out there to keep noisy neighbors quiet?

In Kubernetes, volume drivers live in a similar dimension. We have achieved significant progress. We support many on-premise and Cloud Storage kinds. The list keeps growing. We are addressing issues that Containers users and infrastructure administrators care such as provisioning, security, multi-tenancy, etc. These technologies can help different Container Engine deployment (Docker, rkt, hyper, etc).

Storage Issues in Containers

(Continued from last post on Performance in Virtualized Storage)

Storage issues in Containers are somewhat different from those in hypervisors. Docker containers have two types: storage driver, used by container images, and volume driver, used by so-called data volume.

Storage drivers are responsible for translating images to Containers’ root filesystem. Docker supports device-mapper, AUFS, OverlayFS, Btrfs, and recently s3 storage drivers. Storage drivers usually support snapshot (though can be emulated) and thin provisioning.

Not all drivers are the same. My colleague Jeremy Eder has benchmarked extensively on storage drivers in his blog.

Most of the performance issues, also expressed in Problem 6 in this LWN article, are caused by (false) sharing: one container’s I/O activity will be felt by others in the shared underlying storage, aka noisy neighbor problem.

Naturally, solutions are invariably concentrating on jailbreaking shared storage.

IceFS, despite what the name suggested, is meant to work for hypervisors originally. Nonetheless, the idea is rich enough to shed light in container storage. IceFS provides physical and namespace isolation for hypervisor (and potentially container) consumers. Such isolation improves reliability and lessens noisy neighbor problem. I have yet to spot snapshot and thin provisioning for possible Docker adoption though.

SpanFS is like IceFS on isolation but more aggressive: I/O stacks are also isolated, and thus buffer allocation and scheduling for different containers are completely isolated (locks and noise? no more!). The result is astounding. Certain microbenchmark pointed to 10x faster than ext4.

Split-Level I/O is somewhat along that line too: I/O stacks are not only isolated but also tagged for each process/container/VM. Thus priority notation and resource accounting are well under control. This corrects priority inversion and ignorance caused by noisy neighbors.

Performance in Virtualized Storage

When they present a block device to guest VM, hypervisors (ESX and QEMU/KVM) usually emulate the block device on top of the host filesystem, illustrated roughly as the following:

hypervisor

Architecture wise, this is a clear approach: it separates out storage from hypervisors.

However, as pointed out [1], for write-most and latency-sensitive workload, this architecture delivers suboptimal performance (as low as half of wire speed).

There are some projects aiming for better performance:

  • VirtFS. VirtFS essentially bypasses stacks in guest VM. This project, however, appears inactive.
  • Ploop.  Strictly speaking, Ploop is not designed for hypervisors. It nonetheless possesses some of the solutions: layout aware, page cache bypass, etc.
  • As in [2], a special handling of dirty pages using a journaling device. This helps some (mostly write-most) workload.
  • TCMU userspace bypass This could be a more general framework. It needs a performance focus though.

Similarly, object storage that use file store also have concern for performance loss in nested filesystems layers. Ceph is exploring directions on newstore and seemingly favors owning the block allocation, thus bypassing file store.

Reference

  1. “Understanding performance implications of nested file systems in a virtualized environment”, D Le, H Huang, H Wang, FAST2012
  2. “Host-side filesystem journaling for durable shared storage”, A Hatzieleftheriou, SV Anastasiadis, FAST2015

Latest Status on Docker Mount Propagation Development

Some recent heated discussion on this topic compels me to update the latest development.

Docker has donated libcontainer to Open Container Project, so my Docker pull request moved to runc as well.

Here is the build and test walk through.


# get my code
git clone https://github.com/rootfs/docker
cd docker
git checkout rootfs_mount
make
# replace docker with my docker binary
ln -fs `pwd`/bundles/1.8.0-dev/binary/docker-1.8.0-dev /usr/bin/docker
# clean up docker images
mv /var/lib/docker /var/lib/docker.old
# !!!! edit /usr/lib/systemd/system/docker.service and change MountFlags=slave to MountFlags=shared !!!

# reload/restart docker service
for SERVICES in docker; do systemctl stop $SERVICES; systemctl reload $SERVICES;  systemctl daemon-reload; systemctl restart $SERVICES;     systemctl enable $SERVICES;     systemctl status $SERVICES ; done

# Build a docker image that contains nfs-util rpm, below is my dockerfile
# cat Dockerfile
#From centos
#RUN yum install -y nfs-utils

# build the image
docker build -t  centos-nfs .

# run it in privileged mode, not share propagation mode using --rootmount=slave, and mount an nfs filesystem

docker run -ti --net=host --rootmount=slave --privileged -v /test:/test centos-nfs mount -t nfs nfs-server:/home/git/dev /test/mnt -o nolock,vers=3
mount |grep /test/mnt

[Empty output]

# run it in privileged mode, share propagation mode using --rootmount=shared, and mount an nfs filesystem
docker run -ti --net=host --rootmount=shared --privileged -v /test:/test centos-nfs mount -t nfs nfs-server:/home/git/dev /test/mnt -o nolock,vers=3
mount |grep /test/mnt

nfs-server:/home/git/dev on /test/mnt type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.16.154.75,mountvers=3,mountport=20048,mountproto=udp,local_lock=all,addr=10.16.154.75)

RBD Volume Mount Fencing in Kubernetes

Kubernetes can use RBD images as persistent block storage for Linux containers. However, there is only one container can mount a RBD volume in read-write mode. If multiple containers write to the same RBD volume without high level coordination, data corruption will likely occur, as reported in a recent case.

One intuitive solution is to make persistent block storage provider be restrictive on client mount. For instance, Google Compute Engine’s Persistent Disk allows only one read-write mount.

Another approach is fencing. A RBD image writer needs to hold an exclusive lock on an image during mount. If the writer fails to acquire the lock, it is safe to assume the image is being used by others. The writer shouldn’t attempt to mount the RBD volume in this case. As a result, only one writer can use the image and no more data corruption.

This is the how the RBD volume mount fencing pull request does for Kubernetes. I tried the following test and found this fixes the mount racing problem.

I have two Fedora 21 hosts. Each loads my fix and runs as a local cluster :


# ./hack/local-up-cluster.sh

# start the rest of the kubectl routines

Then each local cluster creates a Pod using RBD volume:


#./cluster/kubectl.sh create -f examples/rbd/rbd.json

Watch RBD image lock:


# rbd lock list foo --pool kube
There is 1 exclusive lock on this image.
Locker ID Address 
client.4494 kubelet_lock_magic_host 10.16.154.78:0/1026846 


On both clusters, get the Pod status. I see one cluster has a running Pod and another cluster sees Pod pending.

Running Pod:


# ./cluster/kubectl.sh get pod
NAME READY REASON RESTARTS AGE
rbd       1/1         Running   0                 5m


The other Pod:


# ./cluster/kubectl.sh get pod
NAME READY REASON   RESTARTS AGE
rbd3     0/1        Image: kubernetes/pause is ready, container is creating 0 4m

Then I delete the running Pod, the second one immediately becomes running.

So with this fix, Pods do get fenced off.

Share Host’s Mount Namespace with Docker Containers

This is a follow-up to my previous post about using Super Privileged Container (SPC) to mount a remote filesystem on the host. That approach drew criticisms on hacking into mount helpers.

I made a Docker patch so that Docker daemon doesn’t isolate the host’s mount namespace from containers. Containers are thus able to see and update host’s mount namespace. This feature is turned on through a Docker client option –hostns=true.

A running instance is as the following:

First start a container and set –hostns=true:

#docker run --privileged --net=host --hostns=true -v /:/host -i -t centos bash

On another terminal, wait after the container is up and you get a bash shell, see the container’s mount namespace:

# pid=`ps -ef |grep docker |grep -v run|grep -v grep|awk '{print $2}'`; b=`ps -ef |grep bash|grep ${pid}|awk '{print $2}'`; cat /proc/${b}/mountinfo

And below, I spotted the following line, indicating the container and host share the same mount namespace.

313 261 253:1 / /host rw,relatime shared:1 - ext4 /dev/mapper/fedora--server_host-root rw,data=ordered

Then on the container’s shell, install glusterfs-fuse package and mount a remote Glusterfs volume:

# yum install glusterfs-fuse attr -y
# mount -t glusterfs gluster_server:kube_vol /host/shared

Go back to the host terminal and check if the host can see Glusterfs volume:

# findmnt |grep glusterfs |tail -1
└─/shared gluster_server:kube_vol fuse.glusterfs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072

So far so good!

A Tale of Two Virtualizations

In my previous post on Intel’s Clear Linux project, I had a few questions on how Intel got KVM move fast to match containers. Basically Clear Linux aims to bring hypervisors the first class citizen in container world.

Today I looked into another, yet similar technology called hyper. hyper establishes itself as hypervisor agnostic, high performing, and secure alternative to Docker and KVM. Love or hate it, hyper is able to run bother hypervisor and container in its own environment.

The architecture, as I peeked from the source, shares with Docker. A cli client interacts with a hyper daemon through REST. Daemon, by invoking QEMU and Docker engines, creates/destroys/deletes either VM or container. hyper understands Docker images (it uses Docker daemon API), QEMU (directly exec QEMU commands with well tuned configuration), Pod (appears similar with Kubernetes POD, except for the QEMU provisions).

hyper comes with hyperstart, a replacement to init(1), aiming for fast startup. To use hyperstart, you have to bake a initrd.

With these two similar initiatives of converging hypervisors and containers, I am now daydreaming of the near future when we don’t have to make trade-offs between VM and container in the single framework (KVM or Docker).

Kubernetes Storage Options

Below are what’s already in, what’s queued, and what’s missing.

TYPE FORMAT DURATION PROVIDER NOTE
EmptyDir File Ephermal Local host  Available
HostDir File Persistent Local host  Available
GitRepo File Persistent Git repository  Available
GCE PD Block Persistent GCE  Available
AWS EBS Block Persistent AWS  Available
NFS File Persistent NFS Server  Available
iSCSI Block Persistent iSCSI target provider  Available
Glusterfs File Persistent Glusterfs Servers  Available
Ceph RBD Block Persistent Ceph Cluster  Available
Ceph FS File Persistent Ceph Cluster Pull Request
OpenStack Cinder Block Persistent Cinder Pull Request
CIFS File Persistent CIFS Server MISSING
LVM Block Persistent Local MISSING

Yet Another Containerized Ceph Cluster

I have a working container to help set up a test bed to verify my pull requests to add Ceph RBD and Ceph FS as persistent volumes for Kubernets.

The (single node) container is based on CentOS 6 and uses ceph-deploy to create cluster and monitor. There were some OSD pg problems, and Sebastien Han helped resolve them. A single run.sh is your friend to create and start the container.

And this is not the end of story. In the coming days, we are going to experiment deploying Ceph on Kubernetes, for real or for play. Stay tuned.