Use Glusterfs as Persistent Storage in Kubernetes

Following my previous post of mounting Glusterfs inside Super Privileged Container, I am pleased to announce that Glusterfs can be also used as a persistent storage for Kubernetes.

My recent Kubernetes pull request makes Glusterfs a new Kubernetes volume plugin. As explained in the example POD,  there are a number of advantages of using Glusterfs.

First, mount storm can be alleviated. Since Glusterfs is a scale-out filesystem, mount can be dispatched to any replica. This is especially helpful in scaling considering you may have containers started simultaneously on hundreds and thousands of nodes and each host mounts the remote filesystem at the same time. Such mount storm leads to latency or even service unavailability. In this Glusterfs volume plugin, however, mounts are balanced on different Gluster hosts and mount storm is thus alleviated.

Second, HA is built into this Glusterfs HA. As seen in the example POD, an array of Glusterfs hosts can be provided, Kubelet node pick one randomly and mount from there. If that host is unresponsive, kubelet goes to the next and so on until a successful mount is observed. This mechanism thus requires no other 3rd party solution (e.g. DNS round robin, etc).

The last feature of using this Glusterfs volume is that there is a support for using Super Privileged Container to enable Kubernets host to mount. This is illustrated in the helper utility in the example POD.

Vault 2015 Notes: Second Day Afternoon

Afternoon session in Ted Ts’o’s talk on lazytime mount option.  On surface, his talk shared many with the paper I wrote years back. He emphasized on tail latency than average (and I agree).  After spending time on showing where latency irregularity coming from, he pointed out that mtime update was problematic. Not flushing mtime was scary but Ted said other information (i_size, etc) could help you find if files were modified. And if no i_size change (like database), application usually didn’t care about mtime. He added dirty flags to inode as hint for fdata_sync (no mtime change) or fsync (mtime change). ext4 is the current new lazytime compliant filesystem. He ended with a ftrace demo. His multithreaded random write fio benchmark on RAM Disk showed double bandwidth, lockstat showed locking contention on journal went away. He also mentioned of DIO read lock removal after eliminating the chance of reading stale data on the write path. That dioread_nolock (?) option enabled ext4 to DIO read in parallel at raw high speed Flash speed.

Next topic was loaded with all buzz: Multipath, PCI-e, NVM. He showed a chart pointing to software was the last one to reduce latency in NVM era.

Vault 2015 Notes: Second Day Morning

Maxim’s FUSE improvement talk.  The writeback cache was the first slide when I arrived. The writeback cache reduced write latency and parallel writeback processing.  It accumulated page cache and kernel writeback would kick off the actual I/O. I vaguely heard “tripled”.  The performance comparison showed both baseline and improvement parity (~30% better) and commodity vs. Dell EQL SAN (mixed). The future improvement included eliminate global lock, variable message size, multi-queue, NUMA affinity. FUSE daemon might be able to talk to multiple queues in /dev/fuse and thus avoided contention. Oracle was said to submit patches to just do those things. The patches were said to improve performance quite a bit. Ben England from Red Hat asked about zero copy inside FUSE. Maxim pondered on kernel bypass for a second but hesitated to come a conclusion. Jeff Darcy asked about if FUSE API change needed to take advantage of these features, answers seemed to be not much. Second and following questions on invalidation writeback cache while one client still held them, answers seemed to be “depend” (expect “stale” data). Writeback cache could be disabled but on a volume level.

Anand’s talk on Glusterfs and NFS Ganesha. Ganesha became much better than last time I worked on it. Stackable FSALs, RDMA (libmooshika), dynamic exports. His focus was on CMAL (cluster manager abstraction layer), i.e. making active/active NFS heads possible. And you don’t need to have a clustered filesystem to use the CMAL framework. CMAL is able to migrate service IP. The clustered Ganesha with Glusterfs used VIP and Pacemaker/Corosync (could it scale?). Each Ganesha node is notified by DBUS message to initiate migration. The active/active tricks seemed to be embeded in the protocol NLM protocol (for v3 via SM_NOTIFY) and STALE_CLIENTID/STALE_STATEID (for v4). Jeff Layton didn’t object such architecture. Anand’s next topic was pNFS with Glusterfs, File Layout of course, anonymous FD was mentioned.  This appeared a more economic and scalable solution alternative. Questions on Ganesha vs. in-kernel NFS server performance parity, cluster scalability.

Venky’s Glusterfs compliance topic started with a low key tone. But think about it, there are many opportunities in his framework.: BitRot detection, tiering, dedupe, compression were quickly talked. It is easy to double that list and point to a use case. The new Glusterfs journal features callback mechanism, supports richer format. The “log mining” is on individual bricks, it could require some programming to get the (especially distributed) volume level picture. The metadata journals contain enough information, so say if you like to run forensics utilities, they could be very helpful to plot the data lifecycle.

Vault 2015 Notes: First Day Afternoon

Afternoon talks were brain tests. There were many good and interesting topics. I started from Sage’s librados talk.  In addition to RGW, RBD, and CephFS, Ceph’s librados is also open to developers/users. Sage’s talk was to promote librados to app developers. In fact, it was the building block for RBD, RGW, and CephFS. He started with simple Hello Word type of snippets, then in a more complicated atomic compound and conditional models, following by models on K/V values (random access, structure data). The new RADOS methods run inside I/O path (a .so file) on a per object basis. This is very interesting, you can implement any plugins to add values to your data, e.g. checksum, archive, replication, encryption, etc. The watch/notify mechanism was extensively reviewed, this could implement cache invalidation on this.  He mentioned dynamic object in LUA from Noah Wakins that used LUN clent wrapper for librados and made programming RADOS classes easy, VAULTAIRE (preserving all data points no MRTG, a data vault for metrics), ZLOG – CORFU (a high performancing distributed shared log for flash ???), radosfs (hey, not my RadosFS), glados (gluster fs xlator on RADOS), iRODS, Synnefo, and dropbox like app, libradosstriper. He concluded the talk with a list of others in the CAP space: Gluster, Swift, Riak, Cassandra.

Next talk on NFSv4.2 and beyond. Interesting to see the NFSv4 timeline, 12 years into production since working group created. But labeled NFS was much accelerated. Security labels were into RHEL 7 supporting SELinux enforced by server. Sparse file in kernel 3.18 but not in RHEL, it reduced network traffic by not sending holes, good for virtualization. Space reservation (fallocate) in 3.19, not in RHEL yet.  Server side copy (no glibc support yet?),  IO hint (io_advice). If you have an idea, supply a patch and RFC.

Last talk on Ceph today (4 in a row!) was from SanDisk. The 512TB InfiniFlash was mentioned. He explained a collection of patches to Ceph OSD to make all-flash OSD high performing 6~7x on read. Code in Hammer. He siad TCmalloc increased contetio in sharded thread pool. This was not in JEmalloc. My poor eye sight spotted a ~350K IOPS read with queue depth 100, and they were said to saturate the box (which was 780K IOPS and 7Gb/s).

Also during breakout, I peeked into Facebook’s storage box. A 30-bay 1U server, fan-only cooling (and still able to run without A/C!), no visible vibration reducer.

Vault 2015 Notes: First Day Morning

After surviving the morning commute, I found myself 10 minutes late for the first talk.

The first talk was a joint topic on different aspects of the future and current storage system: Persistent Memory, Multiqueue (mentioned new IO scheduler), SMR, SCSI queue tree (better maintenance), LIO/SCST merger, iSCSI performance reconciling multiqueue and multi-connection conflicts by proposing new IETF iSCSI extension for Linux, kernel Rescan.

Second topic from SanDisk is about Data Center architectures. I came into a revelation that the Data Centers were consolidated into different resource poolings and scaling granularity.   As I reckoned the recent industry consolidation: Avago’s big acquistions making it relevant as a fabrics provider, SanDisk’s ascend into Enterprise storage was also leapfrogging, and multiple storage vendors had acquired some sorts of data management outfits (Pentaho/HDS for instance). This topic reviewed heterogeneous replication (one on SSD, more on HDD), erasure coding on Flash. SanDisk’s contributions/patches to Ceph and NoSQL improved performance by several X’s, future reducing price/performance gap.

Next session in Brfs was interesting, though I lost most part of it due to limited seating in the room. I vaguely remembered Chris was excited about CRC verification, improved scrub code, upcoming inline dedup, sub-volume quota, and new tests that made critical issues consistently reproducible, less write amplification using RocksDB, etc. I also had a good time learning how Facebook used and improved Glusterfs.

The pNFS talk was most about the basics but Christopher did attract my attention when he mentioned using SCSI3 reservation for fencing during error handling, and mentioned the projects/products I worked on before.  His then went to explain how his new pNFS server was structured and coded. The server used XFS and heavily reused the existing code base (like direct IO, no separate layout modules, etc). The performance was said to be linearly scaled.  And yes, he did mentioned omission of small files through pNFS protocol. The source code is kernel 4.0

NAS Server in the Cloud?

What does it mean, really?

Cloud evangelists forecast the future of data center are in the Cloud, yet I am not convinced that leads to the demise of storage servers. I believe storage vendors will find a new home for their products: Cloud.

Actually, NetApp, a storage vendor who sells NAS boxes, already transforms itself to an AWS server image provider. The server image just provides the same function as the NAS boxes do.

Why people still need storage servers, even in Cloud?


Cloud storage like S3 and Swift are object store, while most enterprise applications still work with file and block based storage.
Shifting data center into the Cloud must first deal with this API level differences.


Cloud storage technologies may vary from one vendor to another, with absence of industry wide protocols. This poses a migration risk for
Cloud hoppers. In contrast, NFS/CIFS/iSCSI/FC protocols are found in all storage servers, as long as in-cloud storage server exists,
such migration risk is much diminished.


It is undeniable that storage vendors like EMC, NetApp, HP, Hitachi, and IBM pride themselves on technologies (and patents) that Cloud storage don’t yet have. Their value proposition won’t evaporate any time soon.

How does it look like?

My very rough component level comparison is illustrated here.


Tachyon 0.6.0 Coming to Apache Bigtop

BIGTOP-1722 is now resolved. Tachyon 0.6.0 is now released in Apache Bigtop.

Tachyon released the long anticipated 0.6.0 version. This release is loaded with many new features, including hierarchical storage layers, Vagrant deploy (AWS EC2, OpenStack, Docker, VirtualBox), Netty server, and many bug fixes (including using Glusterfs as under filesystem).

I will write a few tutorials on deploying Tachyon 0.6.0 and run MapReduce tasks after I come back from Vault and Spark Summit East.