Use Glusterfs as Persistent Storage in Kubernetes

Following my previous post of mounting Glusterfs inside Super Privileged Container, I am pleased to announce that Glusterfs can be also used as a persistent storage for Kubernetes.

My recent Kubernetes pull request makes Glusterfs a new Kubernetes volume plugin. As explained in the example POD,  there are a number of advantages of using Glusterfs.

First, mount storm can be alleviated. Since Glusterfs is a scale-out filesystem, mount can be dispatched to any replica. This is especially helpful in scaling considering you may have containers started simultaneously on hundreds and thousands of nodes and each host mounts the remote filesystem at the same time. Such mount storm leads to latency or even service unavailability. In this Glusterfs volume plugin, however, mounts are balanced on different Gluster hosts and mount storm is thus alleviated.

Second, HA is built into this Glusterfs HA. As seen in the example POD, an array of Glusterfs hosts can be provided, Kubelet node pick one randomly and mount from there. If that host is unresponsive, kubelet goes to the next and so on until a successful mount is observed. This mechanism thus requires no other 3rd party solution (e.g. DNS round robin, etc).

The last feature of using this Glusterfs volume is that there is a support for using Super Privileged Container to enable Kubernets host to mount. This is illustrated in the helper utility in the example POD.

How to Mount Glusterfs on Docker Host?

Background

A Docker host (such as CoreOS and RedHat Atomic Host) usually is a minimal OS without Gluster client package. If you want to mount a Gluster filesystem, it is quite hard to do it on the host.

Solution

I just worked out a solution to create a Super Privileged Container and run mount in the SPC’s namespace but create the mount in host’s namespace. The idea is to inject my own mount before mount(2) is called, so we can reset the namespace, thank Colin for the mount patch idea. But since I don’t want to patch any existing util, I followed Sage Weil’s suggestion and used ld.preload instead. This idea can thus be applied to gluster, nfs, cephfs, and so on, once we update the switch here The code is at my repo. Docker image is hchen/install-glusterfs-on-fc21

How it works

First pull my Docker image

# docker pull hchen/install-glusterfs-on-fc21

Then run the image in Super Privileged Container mode

#  docker run  --privileged -d  --net=host -e sysimage=/host -v /:/host -v /dev:/dev -v /proc:/proc -v /var:/var -v /run:/run hchen/install-glusterfs-on-fc21

Get the the container’s PID:

# docker inspect --format  {{.State.Pid}}  <your_container_id>

My PID is 865, I use this process’s namespace to run the mount, note the /mnt is in host’s name space

# nsenter --mount=/proc/865/ns/mnt mount -t glusterfs <your_gluster_brick>:<your_gluster_volueme>  /mnt

Alas, you can check on your Docker host to see this gluster fs mount at /mnt.

iSCSI as on-premise Persistent Storage for Kubernetes and Docker Container

Why iSCSI Storage?

iSCSI has been widely adopted in data centers. It is the default implementation for OpenStack Cinder. Cinder defines a common block storage interface so storage vendors can supply their own plugins to present their storage products to Nova compute. As it happens, most of the vendor supplied plugins use iSCSI.

Containers: How to Persist Data to iSCSI Storage?

Persisting data inside a container can be done in two ways.

Container sets up iSCSI session

The iSCSI session is initiated inside the container, iSCSI traffic goes through Docker NAT to external iSCSI target. This approach doesn’t require host’s support and is thus portable. However, the Container is likely to suffer from suboptimal network performance, because Docker NAT doesn’t deliver good performance, as reseachers at IBM found. Since iSCSI is highly senstive to network performance, delay or jitters will cause iSCSI connection timeout and retries. This approach is thus not preferred for mission-critical services.

Host sets up iSCSI session

Host initiates the iSCSI session, attaches iSCSI disk, mounts the filesystem on the disk to a local directory, and shares the filesystem with Container. This approach doesn’t need Docker NAT and is conceivably higher performing than the first approach. This approach is implemented in the iSCSI persistent storage for Kubernetes, discussed in the following.

What is Kubernetes?

Kubernetes is an open source Linux Container orchestrator developed by Google, Red Hat, etc. Kubernetes creates, schedules, minotors, and deletes containers across a cluster of Linux hosts. Kubernetes defines Containers as “pod”, which is declared in a set of json files.

How Containers Persist Data in Kubernetes?

A Container running MySQL wants persistent storage so the database can survive. The persistent storage can either be on local host or ideally a shared storage that the host clusters can all access so that when the container is migrated, it can find the persisted data on the new host. Currently Kubernetes provides three storage volume types: empty_dir, host_dir, and GCE Persistent Disk.

  • empty_dir. empty_dir is not meant to be long lasting. When the pod is deleted, the data on empty_dir is lost.
  • host_dir. host_dir presents a directory on the host to the container. Container sees this directory through a local mountpoint. Steve Watts has written an excellent blog on provisioning NFS to containers by way of host_dir.
  • GCE Persistent Disk. You can also use the persistent storage service available at Google Compute Engine. Kubernetes allows containers to access data residing on GCE Persisent Disk.

iSCSI Disk: a New Persistent Storage for Kubernetes

Since on-premise enterprise data centers and OpenStack providers have already invested in iSCSI storage. When they deploy Kubernetes, it is logical that they want Containers access data living on iSCSI storage. It is thus desirable for Kubernetes to support iSCSI disk based persistent volume

Implementation

My Kubernetes pull request provides a solution to this end. As seen in this high level architecture _config.yml When kubelete creates the pod on the node(previously known as minion), it logins into iSCSI target, and mounts the specified disks to the container’s volumes. Containers can then access the data on the persistent storage. Once the container is deleted and iSCSI disks are not used, kubelet logs out of the target. A Kubernetes pod can use iSCSI disk as persistent storage for read and write. As exhibited in this pod example, this pod declares two containers: both uses iSCSI LUNs. Container iscsipd-ro mounts the read-only ext4 filesystem backed by iSCSI LUN 0 to /mnt/iscsipd, and Container iscsipd-ro mounts the read-write xfs filesystem backed by iSCSI LUN 1 to /mnt/iscsipd.

How to Use it?

Here is my setup to setup Kubernetes with iSCSI persistent storage. I use Fedora 21 on Kubernetes node. First get my github repo

# git clone -b iscsi-pd-merge https://github.com/rootfs/kubernetes

then build and install on the Kubernetes master and node. Install iSCSI initiator on the node:

# yum -y install iscsi-initiator-utils

then edit /etc/iscsi/initiatorname.iscsi and /etc/iscsi/iscsid.conf to match your iSCSI target configuration. I mostly follow these instructions to setup iSCSI initiator and these instructions to setup iSCSI target. Once you have installed iSCSI initiator and new Kubernetes, you can create a pod based on my example. In the pod JSON, you need to provide portal (the iSCSI target’s IP address and port if not the default port 3260), target’s iqn, lun, and the type of the filesystem that has been created on the lun, and readOnly boolean. Once your pod is created, run it on the Kubernetes master:

#cluster/kubectl.sh create -f your_new_pod.json

Here is my command and output:

    # cluster/kubectl.sh create -f examples/iscsi-pd/iscsi-pd.json 
    current-context: ""
    Running: cluster/../cluster/gce/../../_output/local/bin/linux/amd64/kubectl create -f examples/iscsi-pd/iscsi-pd.json
    iscsipd
    # cluster/kubectl.sh get pods
    current-context: ""
    Running: cluster/../cluster/gce/../../_output/local/bin/linux/amd64/kubectl get pods
    POD                                    IP                  CONTAINER(S)        IMAGE(S)                 HOST                      LABELS              STATUS
    iscsipd                                172.17.0.6          iscsipd-ro          kubernetes/pause         fed-minion/10.16.154.75   <none>              Running
                                                           iscsipd-rw          kubernetes/pause                                                    

On the Kubernetes node, I got these in mount output

    #mount |grep kub
    /dev/sdb on /var/lib/kubelet/plugins/kubernetes.io/iscsi-pd/iscsi/10.16.154.81:3260/iqn.2014-12.world.server:storage.target1/lun/0 type ext4 (ro,relatime,stripe=1024,data=ordered)
    /dev/sdb on /var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/volumes/kubernetes.io~iscsi-pd/iscsipd-ro type ext4 (ro,relatime,stripe=1024,data=ordered)
    /dev/sdc on /var/lib/kubelet/plugins/kubernetes.io/iscsi-pd/iscsi/10.16.154.81:3260/iqn.2014-12.world.server:storage.target1/lun/1 type xfs (rw,relatime,attr2,inode64,noquota)
    /dev/sdc on /var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/volumes/kubernetes.io~iscsi-pd/iscsipd-rw type xfs (rw,relatime,attr2,inode64,noquota)

Run docker inspect and I found the Containers mounted the host directory into the their /mnt/iscsipd directory.

    # docker ps
    CONTAINER ID        IMAGE                     COMMAND                CREATED             STATUS              PORTS                    NAMES
    cc9bd22d9e9d        kubernetes/pause:latest   "/pause"               3 minutes ago       Up 3 minutes                                 k8s_iscsipd-rw.12d8f0c5_iscsipd.default.etcd_4ab78fdc-b927-11e4-ade6-d4bed9b39058_e3f49dcc                               
    a4225a2148e3        kubernetes/pause:latest   "/pause"               3 minutes ago       Up 3 minutes                                 k8s_iscsipd-ro.f3c9f0b5_iscsipd.default.etcd_4ab78fdc-b927-11e4-ade6-d4bed9b39058_3cc9946f                               
    4d926d8989b3        kubernetes/pause:latest   "/pause"               3 minutes ago       Up 3 minutes                                 k8s_POD.8149c85a_iscsipd.default.etcd_4ab78fdc-b927-11e4-ade6-d4bed9b39058_c7b55d86                                      
    #docker inspect --format   {{.Volumes}}  cc9bd22d9e9d
    map[/mnt/iscsipd:/var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/volumes/kubernetes.io~iscsi-pd/iscsipd-rw /dev/termination-log:/var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/containers/iscsipd-rw/cc9bd22d9e9db3c88a150cadfdccd86e36c463629035b48bdcfc8ec534be8615]
    #docker inspect --format  {{.Volumes}}  a4225a2148e3
    map[/dev/termination-log:/var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/containers/iscsipd-ro/a4225a2148e38afc1a50a540ea9fe2e747886f1011ac5b3be4badee938f2fc5f /mnt/iscsipd:/var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/volumes/kubernetes.io~iscsi-pd/iscsipd-ro]