Kubernetes Storage Options

Below are what’s already in, what’s queued, and what’s missing.

EmptyDir File Ephermal Local host  Available
HostDir File Persistent Local host  Available
GitRepo File Persistent Git repository  Available
GCE PD Block Persistent GCE  Available
AWS EBS Block Persistent AWS  Available
NFS File Persistent NFS Server  Available
iSCSI Block Persistent iSCSI target provider  Available
Glusterfs File Persistent Glusterfs Servers  Available
Ceph RBD Block Persistent Ceph Cluster  Available
Ceph FS File Persistent Ceph Cluster Pull Request
OpenStack Cinder Block Persistent Cinder Pull Request
CIFS File Persistent CIFS Server MISSING
LVM Block Persistent Local MISSING

NAS Service Indeed Coming to AWS

Amazon finally loads NAS service into AWS.

I am not surprised by such news: my previous post roughly outlined an architecture of a private offering. Amazon must implement NAS in a different way: under the hood, they don’t need to have S3 or EBS as storage media. In fact, this EFS is much like OpenStack Manila.

Now, the question is what is going to happen to NetApp and other smaller Cloud-based NAS vendors, who charges double fees: AWS usage + NAS usage.

Use Glusterfs as Persistent Storage in Kubernetes

Following my previous post of mounting Glusterfs inside Super Privileged Container, I am pleased to announce that Glusterfs can be also used as a persistent storage for Kubernetes.

My recent Kubernetes pull request makes Glusterfs a new Kubernetes volume plugin. As explained in the example POD,  there are a number of advantages of using Glusterfs.

First, mount storm can be alleviated. Since Glusterfs is a scale-out filesystem, mount can be dispatched to any replica. This is especially helpful in scaling considering you may have containers started simultaneously on hundreds and thousands of nodes and each host mounts the remote filesystem at the same time. Such mount storm leads to latency or even service unavailability. In this Glusterfs volume plugin, however, mounts are balanced on different Gluster hosts and mount storm is thus alleviated.

Second, HA is built into this Glusterfs HA. As seen in the example POD, an array of Glusterfs hosts can be provided, Kubelet node pick one randomly and mount from there. If that host is unresponsive, kubelet goes to the next and so on until a successful mount is observed. This mechanism thus requires no other 3rd party solution (e.g. DNS round robin, etc).

The last feature of using this Glusterfs volume is that there is a support for using Super Privileged Container to enable Kubernets host to mount. This is illustrated in the helper utility in the example POD.

Vault 2015 Notes: Second Day Afternoon

Afternoon session in Ted Ts’o’s talk on lazytime mount option.  On surface, his talk shared many with the paper I wrote years back. He emphasized on tail latency than average (and I agree).  After spending time on showing where latency irregularity coming from, he pointed out that mtime update was problematic. Not flushing mtime was scary but Ted said other information (i_size, etc) could help you find if files were modified. And if no i_size change (like database), application usually didn’t care about mtime. He added dirty flags to inode as hint for fdata_sync (no mtime change) or fsync (mtime change). ext4 is the current new lazytime compliant filesystem. He ended with a ftrace demo. His multithreaded random write fio benchmark on RAM Disk showed double bandwidth, lockstat showed locking contention on journal went away. He also mentioned of DIO read lock removal after eliminating the chance of reading stale data on the write path. That dioread_nolock (?) option enabled ext4 to DIO read in parallel at raw high speed Flash speed.

Next topic was loaded with all buzz: Multipath, PCI-e, NVM. He showed a chart pointing to software was the last one to reduce latency in NVM era.

Vault 2015 Notes: Second Day Morning

Maxim’s FUSE improvement talk.  The writeback cache was the first slide when I arrived. The writeback cache reduced write latency and parallel writeback processing.  It accumulated page cache and kernel writeback would kick off the actual I/O. I vaguely heard “tripled”.  The performance comparison showed both baseline and improvement parity (~30% better) and commodity vs. Dell EQL SAN (mixed). The future improvement included eliminate global lock, variable message size, multi-queue, NUMA affinity. FUSE daemon might be able to talk to multiple queues in /dev/fuse and thus avoided contention. Oracle was said to submit patches to just do those things. The patches were said to improve performance quite a bit. Ben England from Red Hat asked about zero copy inside FUSE. Maxim pondered on kernel bypass for a second but hesitated to come a conclusion. Jeff Darcy asked about if FUSE API change needed to take advantage of these features, answers seemed to be not much. Second and following questions on invalidation writeback cache while one client still held them, answers seemed to be “depend” (expect “stale” data). Writeback cache could be disabled but on a volume level.

Anand’s talk on Glusterfs and NFS Ganesha. Ganesha became much better than last time I worked on it. Stackable FSALs, RDMA (libmooshika), dynamic exports. His focus was on CMAL (cluster manager abstraction layer), i.e. making active/active NFS heads possible. And you don’t need to have a clustered filesystem to use the CMAL framework. CMAL is able to migrate service IP. The clustered Ganesha with Glusterfs used VIP and Pacemaker/Corosync (could it scale?). Each Ganesha node is notified by DBUS message to initiate migration. The active/active tricks seemed to be embeded in the protocol NLM protocol (for v3 via SM_NOTIFY) and STALE_CLIENTID/STALE_STATEID (for v4). Jeff Layton didn’t object such architecture. Anand’s next topic was pNFS with Glusterfs, File Layout of course, anonymous FD was mentioned.  This appeared a more economic and scalable solution alternative. Questions on Ganesha vs. in-kernel NFS server performance parity, cluster scalability.

Venky’s Glusterfs compliance topic started with a low key tone. But think about it, there are many opportunities in his framework.: BitRot detection, tiering, dedupe, compression were quickly talked. It is easy to double that list and point to a use case. The new Glusterfs journal features callback mechanism, supports richer format. The “log mining” is on individual bricks, it could require some programming to get the (especially distributed) volume level picture. The metadata journals contain enough information, so say if you like to run forensics utilities, they could be very helpful to plot the data lifecycle.