Glusterfs – Simplicity is the Key !!


** GlusterFS

*GlusterFS is a powerful cluster filesystem written in user space which uses FUSE to hook itself with VFS layer.

*Filesystem in Userspace (FUSE) lets non-privileged users create their own file systems without editing kernel code. User run file system code in user space while the FUSE module provides only a “bridge” to the actual kernel interfaces.

* Though GlusterFS is a File System, it uses already tried and tested disk file systems like ext3, ext4, xfs, etc. to store the data.

* No Metadata because it uses Elastic Hashing Algorithm

Startup Guide

Quick Setup


Distribution depends on how you are defining the create volume command, like in below example we have g1:brick1 and g2:brick1 one pair and g1:brick2 with g2:brick2 another distribution pair.

gluster volume create gv0_vol replica 2 transport tcp g1:/data/brick1/gv0 g2:/data/brick1/gv0 g1:/data/brick2/gv0 g2:/data/brick2/gv0  force

** Distribution is based on file name hashing


Data migration
gluster volume replace-brick volume gluster3:/mnt/volume gluster6:/mnt/gluster start
gluster volume rebalance test-volume fix-layout start force

Geo replicaiton
ssh-copy-id   ***.***.***.***

gluster volume create gv0_rep replica 2 transport tcp g2:/data/rep1/gv0 g2:/data/rep2/gv0


Replicated volume Vs Geo-Replication


Elastic Hashing Algorithm

*Gluster designed a system which does not separate metadata from data, and
which does not rely on any separate metadata server, whether centralized or distributed.
* In the Gluster algorithmic approach, we take a given pathname/filename (which is unique in any directory tree) and run it through the hashing algorithm. Each pathname/filename results in a unique numerical result.
* We store files in library way(alphabatic order) .
* An alphabetic algorithm would never work in practice, that is why we pick hash.
People familiar with hash algorithms will know that hash functions are generally chosen for properties such as determinism (the same starting string will always result in the same ending hash), and uniformity (the ending results tend to be uniformly distributed mathematically).
* Storage system servers can be added or removed on-the-fly with data automatically rebalanced across the cluster
*File system configuration changes are accepted at runtime and propagated throughout the cluster allowing changes to be made dynamically as workloads fluctuate or for performance tuning.
* The number of bricks should be a multiple of the replica count for a distributed replicated volume


* NFS is traditionally difficult to scale and achieve high availability. the same GlusterFS could do.

* If the file is not where the hash code calculates to, an extra lookup operation must be performed, adding slightly to latency.


Self Healing 

Previously, this self healing needed to be triggered manually, however there is now a self-heal daemon which runs in the background, and automatically initiates self-healing every 10 minutes on any files which require healing.


gluster volume heal gv123 statistics


Split Brain

File is said to be in split-brain when the copies of the same file in different bricks that constitute the replica-pair have mismatching data and/or meta-data contents such that they are conflicting each other and automatic healing is not possible. In this scenario, you can decide which is the correct file (source) and and which is the one that needs healing (sink) by looking at the mismatching files.

*When a client is witnessing brick disconnections, a file could be modified on different bricks at different times while the other brick is off-line in the replica. These situations lead to split-brain and the file becomes unusable and manual intervention is required to fix this issue.

* Client-side quorum is implemented to minimize split-brains.




* /opt/iozone/bin/iozone -+m ioz.cfg -+h <hostip> -w -c -e -i 0 -+n -r 64k -s 1g -t 2

* gluster volume top aurora01 write-perf bs 64 count 1 brick cent01:/disk1/brick list-cnt 10



* To get back the accidentally deleted file run rebalance

* If you are using distributed-replicate, the max file size is the capacity available on an individual brick .

*In versions before 3.6, the two bricks were treated as equal regardless of size, and would have been assigned an equal share of files.

This entry was posted in BlockStorage and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s