Data Volumes for Containers

October 29, 2015November 5, 2015 ~ rootfs ~ 5 Comments

The last part of the trilogy is a survey of the current-in-market storage technologies that supply data volumes for Containers.

In one of the previous posts, I listed innovations and technologies sparkled by the introduction of Docker’s volume plugins. A rough categorization of the technologies are as the following.

Enablement. Obviously most of those volume plugins fall into this category. They connect various storage backend (Glusterfs, Ceph RBD, NAS, Cloud Storage, etc) to Containers’ mount namespace so Containers can store and retrieve data from these backends.
Data Protection. Some technologies enable data protection (backup, snapshot, and replication), a core storage function. I have yet to spot any innovation beyond the traditional data protection though.
Mobility. Some technologies assist containers’ mobility. Containers can relocate to new homes and find data right there. I am not so certain all these technologies work flawlessly though.
Provisioning. Rexray’s blog did a better job than I.
Multi-tenancy. Some claim they support, such as BlockBridge. A demo video is available (thanks for Ilya’s information), it appears a block storage is created based on tenant’s credentials.
Security (or something like that). Is that omission my fault?
Performance. Any innovation out there that clip up Containers’ performance?
Isolation. Any innovation out there to keep noisy neighbors quiet?

In Kubernetes, volume drivers live in a similar dimension. We have achieved significant progress. We support many on-premise and Cloud Storage kinds. The list keeps growing. We are addressing issues that Containers users and infrastructure administrators care such as provisioning, security, multi-tenancy, etc. These technologies can help different Container Engine deployment (Docker, rkt, hyper, etc).

Storage Issues in Containers

October 28, 2015October 28, 2015 ~ rootfs ~ Leave a comment

(Continued from last post on Performance in Virtualized Storage)

Storage issues in Containers are somewhat different from those in hypervisors. Docker containers have two types: storage driver, used by container images, and volume driver, used by so-called data volume.

Storage drivers are responsible for translating images to Containers’ root filesystem. Docker supports device-mapper, AUFS, OverlayFS, Btrfs, and recently s3 storage drivers. Storage drivers usually support snapshot (though can be emulated) and thin provisioning.

Not all drivers are the same. My colleague Jeremy Eder has benchmarked extensively on storage drivers in his blog.

Most of the performance issues, also expressed in Problem 6 in this LWN article, are caused by (false) sharing: one container’s I/O activity will be felt by others in the shared underlying storage, aka noisy neighbor problem.

Naturally, solutions are invariably concentrating on jailbreaking shared storage.

IceFS, despite what the name suggested, is meant to work for hypervisors originally. Nonetheless, the idea is rich enough to shed light in container storage. IceFS provides physical and namespace isolation for hypervisor (and potentially container) consumers. Such isolation improves reliability and lessens noisy neighbor problem. I have yet to spot snapshot and thin provisioning for possible Docker adoption though.

SpanFS is like IceFS on isolation but more aggressive: I/O stacks are also isolated, and thus buffer allocation and scheduling for different containers are completely isolated (locks and noise? no more!). The result is astounding. Certain microbenchmark pointed to 10x faster than ext4.

Split-Level I/O is somewhat along that line too: I/O stacks are not only isolated but also tagged for each process/container/VM. Thus priority notation and resource accounting are well under control. This corrects priority inversion and ignorance caused by noisy neighbors.

Performance in Virtualized Storage

October 27, 2015October 28, 2015 ~ rootfs ~ Leave a comment

When they present a block device to guest VM, hypervisors (ESX and QEMU/KVM) usually emulate the block device on top of the host filesystem, illustrated roughly as the following:

Architecture wise, this is a clear approach: it separates out storage from hypervisors.

However, as pointed out [1], for write-most and latency-sensitive workload, this architecture delivers suboptimal performance (as low as half of wire speed).

There are some projects aiming for better performance:

VirtFS. VirtFS essentially bypasses stacks in guest VM. This project, however, appears inactive.
Ploop. Strictly speaking, Ploop is not designed for hypervisors. It nonetheless possesses some of the solutions: layout aware, page cache bypass, etc.
As in [2], a special handling of dirty pages using a journaling device. This helps some (mostly write-most) workload.
TCMU userspace bypass This could be a more general framework. It needs a performance focus though.

Similarly, object storage that use file store also have concern for performance loss in nested filesystems layers. Ceph is exploring directions on newstore and seemingly favors owning the block allocation, thus bypassing file store.

Reference

“Understanding performance implications of nested file systems in a virtualized environment”, D Le, H Huang, H Wang, FAST2012
“Host-side filesystem journaling for durable shared storage”, A Hatzieleftheriou, SV Anastasiadis, FAST2015

Latest Status on Docker Mount Propagation Development

July 31, 2015July 31, 2015 ~ rootfs ~ Leave a comment

Some recent heated discussion on this topic compels me to update the latest development.

Docker has donated libcontainer to Open Container Project, so my Docker pull request moved to runc as well.

Here is the build and test walk through.


# get my code
git clone https://github.com/rootfs/docker
cd docker
git checkout rootfs_mount
make
# replace docker with my docker binary
ln -fs `pwd`/bundles/1.8.0-dev/binary/docker-1.8.0-dev /usr/bin/docker
# clean up docker images
mv /var/lib/docker /var/lib/docker.old
# !!!! edit /usr/lib/systemd/system/docker.service and change MountFlags=slave to MountFlags=shared !!!

# reload/restart docker service
for SERVICES in docker; do systemctl stop $SERVICES; systemctl reload $SERVICES;  systemctl daemon-reload; systemctl restart $SERVICES;     systemctl enable $SERVICES;     systemctl status $SERVICES ; done

# Build a docker image that contains nfs-util rpm, below is my dockerfile
# cat Dockerfile
#From centos
#RUN yum install -y nfs-utils

# build the image
docker build -t  centos-nfs .

# run it in privileged mode, not share propagation mode using --rootmount=slave, and mount an nfs filesystem

docker run -ti --net=host --rootmount=slave --privileged -v /test:/test centos-nfs mount -t nfs nfs-server:/home/git/dev /test/mnt -o nolock,vers=3
mount |grep /test/mnt

[Empty output]

# run it in privileged mode, share propagation mode using --rootmount=shared, and mount an nfs filesystem
docker run -ti --net=host --rootmount=shared --privileged -v /test:/test centos-nfs mount -t nfs nfs-server:/home/git/dev /test/mnt -o nolock,vers=3
mount |grep /test/mnt

nfs-server:/home/git/dev on /test/mnt type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.16.154.75,mountvers=3,mountport=20048,mountproto=udp,local_lock=all,addr=10.16.154.75)

RBD Volume Mount Fencing in Kubernetes

July 1, 2015July 1, 2015 ~ rootfs ~ 1 Comment

Kubernetes can use RBD images as persistent block storage for Linux containers. However, there is only one container can mount a RBD volume in read-write mode. If multiple containers write to the same RBD volume without high level coordination, data corruption will likely occur, as reported in a recent case.

One intuitive solution is to make persistent block storage provider be restrictive on client mount. For instance, Google Compute Engine’s Persistent Disk allows only one read-write mount.

Another approach is fencing. A RBD image writer needs to hold an exclusive lock on an image during mount. If the writer fails to acquire the lock, it is safe to assume the image is being used by others. The writer shouldn’t attempt to mount the RBD volume in this case. As a result, only one writer can use the image and no more data corruption.

This is the how the RBD volume mount fencing pull request does for Kubernetes. I tried the following test and found this fixes the mount racing problem.

I have two Fedora 21 hosts. Each loads my fix and runs as a local cluster :


# ./hack/local-up-cluster.sh

# start the rest of the kubectl routines

Then each local cluster creates a Pod using RBD volume:


#./cluster/kubectl.sh create -f examples/rbd/rbd.json

Watch RBD image lock:


# rbd lock list foo --pool kube
There is 1 exclusive lock on this image.
Locker ID Address 
client.4494 kubelet_lock_magic_host 10.16.154.78:0/1026846

On both clusters, get the Pod status. I see one cluster has a running Pod and another cluster sees Pod pending.

Running Pod:


# ./cluster/kubectl.sh get pod
NAME READY REASON RESTARTS AGE
rbd       1/1         Running   0                 5m

The other Pod:


# ./cluster/kubectl.sh get pod
NAME READY REASON   RESTARTS AGE
rbd3     0/1        Image: kubernetes/pause is ready, container is creating 0 4m

Then I delete the running Pod, the second one immediately becomes running.

So with this fix, Pods do get fenced off.

Share Host’s Mount Namespace with Docker Containers

June 9, 2015June 10, 2015 ~ rootfs ~ 8 Comments

This is a follow-up to my previous post about using Super Privileged Container (SPC) to mount a remote filesystem on the host. That approach drew criticisms on hacking into mount helpers.

I made a Docker patch so that Docker daemon doesn’t isolate the host’s mount namespace from containers. Containers are thus able to see and update host’s mount namespace. This feature is turned on through a Docker client option –hostns=true.

A running instance is as the following:

First start a container and set –hostns=true:

#docker run --privileged --net=host --hostns=true -v /:/host -i -t centos bash

On another terminal, wait after the container is up and you get a bash shell, see the container’s mount namespace:

# pid=`ps -ef |grep docker |grep -v run|grep -v grep|awk '{print $2}'`; b=`ps -ef |grep bash|grep ${pid}|awk '{print $2}'`; cat /proc/${b}/mountinfo

And below, I spotted the following line, indicating the container and host share the same mount namespace.

313 261 253:1 / /host rw,relatime shared:1 - ext4 /dev/mapper/fedora--server_host-root rw,data=ordered

Then on the container’s shell, install glusterfs-fuse package and mount a remote Glusterfs volume:

# yum install glusterfs-fuse attr -y
# mount -t glusterfs gluster_server:kube_vol /host/shared

Go back to the host terminal and check if the host can see Glusterfs volume:

# findmnt |grep glusterfs |tail -1
└─/shared gluster_server:kube_vol fuse.glusterfs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072

So far so good!

A Tale of Two Virtualizations

June 3, 2015 ~ rootfs ~ 2 Comments

In my previous post on Intel’s Clear Linux project, I had a few questions on how Intel got KVM move fast to match containers. Basically Clear Linux aims to bring hypervisors the first class citizen in container world.

Today I looked into another, yet similar technology called hyper. hyper establishes itself as hypervisor agnostic, high performing, and secure alternative to Docker and KVM. Love or hate it, hyper is able to run bother hypervisor and container in its own environment.

The architecture, as I peeked from the source, shares with Docker. A cli client interacts with a hyper daemon through REST. Daemon, by invoking QEMU and Docker engines, creates/destroys/deletes either VM or container. hyper understands Docker images (it uses Docker daemon API), QEMU (directly exec QEMU commands with well tuned configuration), Pod (appears similar with Kubernetes POD, except for the QEMU provisions).

hyper comes with hyperstart, a replacement to init(1), aiming for fast startup. To use hyperstart, you have to bake a initrd.

With these two similar initiatives of converging hypervisors and containers, I am now daydreaming of the near future when we don’t have to make trade-offs between VM and container in the single framework (KVM or Docker).

Kubernetes Storage Options

May 27, 2015May 28, 2015 ~ rootfs ~ 1 Comment

Below are what’s already in, what’s queued, and what’s missing.

TYPE	FORMAT	DURATION	PROVIDER	NOTE
EmptyDir	File	Ephermal	Local host	Available
HostDir	File	Persistent	Local host	Available
GitRepo	File	Persistent	Git repository	Available
GCE PD	Block	Persistent	GCE	Available
AWS EBS	Block	Persistent	AWS	Available
NFS	File	Persistent	NFS Server	Available
iSCSI	Block	Persistent	iSCSI target provider	Available
Glusterfs	File	Persistent	Glusterfs Servers	Available
Ceph RBD	Block	Persistent	Ceph Cluster	Available
Ceph FS	File	Persistent	Ceph Cluster	Pull Request
OpenStack Cinder	Block	Persistent	Cinder	Pull Request
CIFS	File	Persistent	CIFS Server	MISSING
LVM	Block	Persistent	Local	MISSING

Yet Another Containerized Ceph Cluster

April 14, 2015April 14, 2015 ~ rootfs ~ Leave a comment

I have a working container to help set up a test bed to verify my pull requests to add Ceph RBD and Ceph FS as persistent volumes for Kubernets.

The (single node) container is based on CentOS 6 and uses ceph-deploy to create cluster and monitor. There were some OSD pg problems, and Sebastien Han helped resolve them. A single run.sh is your friend to create and start the container.

And this is not the end of story. In the coming days, we are going to experiment deploying Ceph on Kubernetes, for real or for play. Stay tuned.

Emerging Technology Deep Dive

cloud, container, big data, storage, and more

Container