Docker CaaS

Containers-as-a-Service is a model where IT organizations and developers can work together to build, ship and run their applications anywhere.

There were traditional methods to build, test, package and run applications : Baremetal Deployment or Virtual deployment.

BareMetal Deployment: Difficult to manage migration/clone/scale/backup etc but better performance and reliability

Virtual Deployment: Easy to migrate/clone/scale/backup etc but issues with portability, performance and management

Although these deployments, use and sell were simplified by  ready-to-use cloud services from various vendors in the form of IaaS, PaaS and SaaS, but still there were gaps.

IaaS provide API access to compute, storage, and network resources and configuration to automate Datacenter infrastructure. e.g. AWS(EC2), Google Cloud Platform, Azure, Jyoent

PaaS providers uses IaaS. A PaaS provides  a self-service portal for managing computing infrastructure. PaaS allowing developers to develop, deploy and test applications. PaaS increases developer productivity.e.g. Google App Engine, Heroku, openshift, salesforce.

Below are some of the benefits of PaaS to application developers:

  • They don’t have to invest in physical infrastructure
  • Makes development possible for ‘non-experts’; with some PaaS offerings anyone can develop an application. PaaS provide OS, server software, DB, storage, network tools for design and develop and hosting.
  • Flexibility; customers can have control over the tools that are installed within their platforms and can create a platform that suits their specific requirements.
  • Adaptability; Features can be changed if circumstances dictate that they should.
  • Teams in various locations can work together; as an internet connection and web browser are all that is required, developers spread across several locations can work together on the same application.

SaaS uses the web to deliver applications that are managed by a third-party vendor and whose interface is accessed on the clients’ side. e.g. Google Apps, Cisco WebEx etc. SaaS replaces traditional on-device softwares.

Cloud computing layers.png

Docker CaaS (Containers-as-a-Service), allowing any Docker container to run on their platform, filling a void between IaaS (Infrastructure-as-a-Service) that requires a lot more system administration and configuration, and PaaS (Platform-as-a-Service) that is typically very limiting in terms of language support and libraries.

Containers are here to transform how build, test, ship and run applications securely on any infrastructure.

Containers as a service (CaaS) is a paid offering from cloud providers that includes compute resources, a container engine, and container orchestration tools.

Developers can use the framework, via API or a web interface, to facilitate and manage containers and application deployment.

There are two keyword in Container world :-
– Container Orchestration
– Container as a Service

Also there are many overlapping projects available in market to provide both keywords
e.g. To provide Container Orchestration we can use below
Amazon ECS
Docker Swarm
Apache Mesos
Azure Container Service (ACS supports two orchestration engines – Docker Swarm and Mesosphere DCOS)

To Provide CaaS we can use below projects
Amazon ECS (Docker or Kubernetes)
Google Container Engine (Kubernetes based)
Docker Universal Control Plane (Docker swarm/compose based)
CoreOS Tectonic (Rocket based)
Project Magnum (OpenStack)
Joyent’s Triton (Zones based)
Rackspace’s Carina (Docker swarm Based)
Cisco buys CaaS startup ContainerX
Oracle acquired StackEngine

Docker UCP
Enterprise-grade on-premises service for managing and deploying dockerized distributed application in any on-premises or virtual cloud environments. It’s built-in security features like LDAP/AD integration and rolebased access control (RBAC) allow IT teams to be in compliance with industry security regulations.

• GUI management for apps, containers, nodes, networks, images and volumes / built in Docker Compose
• Out of the box High Availability
• LDAP/AD Integration
• Role based access control for teams & orgs
• SSO and push/pull images from DTR(Docker Trusted Registry), directly from w/in UCP
• Out of the box TLS
• Docker native stack with Swarm, Compose, CS engine and DTR
• Monitoring and logging of UCP users & events

GKE (Google Container Engine)
• Users can interact with Google Container Engine using the gcloud command line interface or the Google Cloud Platform Console.                                                                                  • A Container Engine cluster is a group of Compute Engine instances running Kubernetes.  • Google Cloud includes Google Cloud Platform (GKE+Kubernetes + GCR (Google container registry)+ Google Cloud Shell($ gcloud)) + GSuites (gmail, map, machine learning tool, android apis etc)
•  Google Container Engine users organize one or more containers into pods that represent logical groups of related containers. Similarly, network proxies, bridges and adapters might be organized into the same pod.                                                                                                        •  Google Container Engine includes a replication controller that allows users to run their desired number of pod duplicates at any given time.

Posted in docker, Virtulization | Tagged , , , , , , | Leave a comment

Kubernetes Truely HA Cluster

Kubernetes Concepts 

  • Skydns is the DNS addon for service ip .
  • Jobs (kind:Job) are complementary to Replication Controllers. A Replication Controller manages pods which are not expected to terminate (e.g. web servers), and a Job manages pods that are expected to terminate (e.g. batch jobs). A Job can also be used to run multiple pods in parallel and one can control the parallelism.
  • Endpoints are nothing but collection of pod_ip:port
  • Port: is the abstracted Service port. Service is backed by a group of pods. These pods are exposed through endpoints.
  • TargetPort: is the port the container accepts traffic on
  • NodePort: When a new service get created  in kube-cluster, kube-proxy opens a port on all the nodes (also called as nodeport). Connections to that port will be proxied to the pods usinf selectors and labels
  • Services are a “layer 3” (TCP/UDP over IP) construct. In Kubernetes v1.1 the Ingress API was added (beta) to represent “layer 7” (HTTP) services.
  • A service defines a set of pods and a means by which to access them, such as single stable IP address (Cluster IP or VIP) and corresponding DNS name.
  • A replication controller ensures that a specified number of pod replicas are running at any one time. It ensure both scaling and failover. Pods like that could be accessible in cluster by each other.
  • A selector is an expression that matches labels in order to identify related resources, such as which pods are targeted by a load-balanced service.
  • A label is a key/value pair that is attached to a resource (e.g. pod).
  • A pod is a co-located group of containers and volumes.
  • NameSpace can define scope for resources , resource policies, resource constraints/limit for cpu/mem etc
  • By default kubernetes create deployment (newer concept of RC) for pods if RC is not defined. Deployment support rollback to previous deployment that was missing in RC.
  • Kube-Proxy is responsible for implementing a form of virtual IP(clusterIP). In Kubernetes v1.0 the proxy was purely in userspace. In Kubernetes v1.1 an iptables proxy was also added.
    • Proxy-mode: userspace : In this mode, kube-proxy watches the Kubernetes master for the addition and removal of Service and Endpoints  For each Service it opens a port (randomly chosen) on the local node. Any connections to this “proxy port” will be proxied to one of the Service’s backend Pods (as reported in Endpoints).
    • Iptable proxy : kube-proxy watches the Kubernetes master for the addition and removal of Service and Endpoints  For each Service it installs iptables rules which capture traffic to the Service’s clusterIP(which is virtual) and Port and redirects that traffic to one of the Service’s backend sets. For each Endpoints object it installs iptables rules which select a backend Pod.


  • Security in Kubernetes is applied to 4 type of consumers (3 infra consumer types and 1 service consumers type)
    • When a human access the cluster (e.g. using kubectl), he is authenticated by the apiserver as a particular User Account.
    • All infrastructure components (kubelets, kube-proxies, controllers, scheduler) should have an infrastructure user that they can authenticate with and be authorized to perform only the functions they require against the APIServer.
    • Processes in containers inside pods can also contact the apiserver. When they do, they are authenticated as a particular Service Account. This cover inter-container and container-apiserver communication.
    • When a outside cluster consumer contact a service using kube-proxy. It is being authenticated as per Service account via service itself.
  • Apiserver is responsible for perforing authentication and authorization for users of kube-infrastructure e.g. kubectl.
  • Kubelet handles locating and authenticating to the apiserver
  • A secret stores sensitive data, such as authentication tokens/certificates, which can be made available to containers/application upon request.
  • Namespace is a mechanism to partition resources created by users into a logically named group.
  • A security context is a set of constraints that are applied to a container/pod in order to achieve the following goals
    • Ensure a clear isolation between container and the underlying host it runs on using user namespaces feature of docker
    • Limit the ability of the container to negatively impact the infrastructure or other containers by using Docker features such as the ability to add or remove capabilities (cpu/memory etc) .

Security Implementation :

  • Create a secure image registry server.
  • Run apiserver with https and ABAC authorization
  • Configure Kublet/Kube-Proxy to contact at https port of apiserver .
  • kube-proxy maintains iptables routing from the clusterIP (VIP) to the nodeport. We can define iptabel firewall rules (e.g. allowed sources) to avoid insure access.
  • A pod runs in a security context under a service account that is defined by an administrator, and the  secrets a pod has access to is limited by that service account.
  • For Infrastructure users security would be implemented as below to secure apiserver access
    • Create namespace ->  Set Cluster Name and override cluster-level Properties for this namespace ->  Set credentials to the cluster and user in Namespace ->  Create Security Context to “Cluter+Namespace+User” combination
  • For Service consumers
    •  Create service account-> secure it with secret -> Create service under service account -> Create pods belonging to service
    • Define iptable rules for service access
  • create below certificates in /srv/kuberntes/
    • First a CA is created, the result is a cert/key pair (ca.crt/ca.key). You can use easyrsa to generate your PKI or OpenSSL
    • Then a certificate is requested and signed using this CA (server.cert/server.key), it will be used
      • by the api server to enable HTTPS and verify service account tokens
      • by the controller manager to sign service account tokens, so that pods can authenticate against the API using these tokens
    • Another certificate is requested and signed (kubecfg.crt/kubecfg.key) using the same CA, you can use it to authenticate your clients

Kubernetes HA Cluster



  • flannel is used because we want to use overlay network. Other options to flannel are Open vSwitch or any other SDN tool
  • While configuring cluster/ubuntu/ we should be aware that private ip ranges should not conflit with datacenter private ips. we can use any of these range – (10/8 prefix) – (172.16/12 prefix) – (192.168/16 prefix)
  • As of Kubernetes 1.3, DNS is a built-in service(based on skydns) launched automatically using the addon manager “cluster add-on” (/etc/kubernetes/addons). DNS would be used to resolve hostnames like into machine ips
  • Etcd Cluster: etcd provides features both TTL on objects, and a compare and swap operation, to implement an election algorithm. Kubernetes used both of these feature for master selection and HA.
  • Unelected instances can watch “/election” (or some other well known key) and if it is empty become elected by writing their ID to it.   The written value is given a TTL   that removes it after a set interval, and the elected instance must rewrite it periodically to remain elected. By the use of etcd’s atomic compare and swap operation, there is no risk of a clash between two instances being undetected.
  • Podmaster: 
    • Podmaster’s job is to implement a master election protocol using etcd “compare and swap”. If the apiserver node wins the election, it starts the master component it is managing (e.g. the scheduler), if it loses the election, it ensures that any master components running on the node (e.g. the scheduler) are stopped.
    • Podmaster is a small utility written in Go-lang that uses etcd’s atomic “CompareAndSwap” functionality to implement master election. The first master to reach the etcd cluster wins the race and becomes the master node, marking itself as with an expring key that it periodically extends. If it finds the key has expired, it attempts to take over using an atomic request. If it is the current master, it copies the scheduler and controller-manager manifests into the kubelet directory, and if it isn’t it removes them. As all it does is copy files, it could be used for anything that requires leader election, not just kubernetes!



  • Docker failover using monit
  • Kubelet failover using monit
  • Kube Master Process (apiserver, scheduler and controller) failover using kubelet
  • Kube Worker Process (Kube-proxy) failover using monit
  • Master Node Failover using podmaster and Loadbalancer
  • Etcd failover using etcd cluster

The easiest way to implement an HA Kubernetes cluster is to start with an existing single-master cluster. The instructions at describe easy installation for single-master clusters on a variety of platforms.

Now start using guide below


Posted in Clustering, Virtulization | Tagged , , , , , , | Leave a comment

ZFS – Dedup/Compression is the core


1) RAID-Z1 is similar to RAID 5 (allows one disk to fail), RAID-Z2 is similar to RAID 6 (allows two disks to fail) and RAID-Z3 (allows three disks to fail). The need for RAID-Z3 arose recently because RAID configurations with future disks (say 6–10 TB) may take a long time to repair, the worst case being weeks.

2) ZFS has no fsck repair tool equivalent, common on Unix filesystems, Instead, ZFS has a repair tool called “scrub” .

3) ZFS – data is being compressed first, then deduplicated

4) Logical Data (Original size of data without compression or dedup)
The amount of space logically consumed by a filesystem. This does not factor into compression, and can be viewed as the theoretical upper bound on the amount of space consumed by the filesystem. Copying the filesystem to another appliance using a different compression algorithm will not consume more than this amount. This statistic is not explicitly exported and can generally only be computed by taking the amount of physical space consumed and multiplying by the current compression ratio.

*Installation and other tips


2) modinfo zfs

3) zpool add zpool-2 raidz /dev/sdc1 /dev/sdc2 /dev/sdc3

4) RAID-Z configurations with single-digit groupings of disks should perform better.

5) zpool replace will copy all of the data from the old disk to the new one. After this operation completes,  the old disk is disconnected from the vdev.

6)Although additional vdevs can be added to a pool, the layout of the pool cannot be changed

7) ZFS deduplication is in-band, which means deduplication occurs when you write data to disk and impacts both CPU and memory resources. Deduplication tables (DDTs) consume memory and eventually spill over and consume disk space. At that point, ZFS has to perform extra read and write operations for every block of data on which deduplication is attempted. This causes a reduction in performance.

8) zdb -bb zpool-1 | grep -i ‘file\|directory\|LSIZE’ | grep -v DSL | grep -v object

9) zpool list or df -k

10) zdb -dd zpool-1 | grep plain

* Gluster with ZFS 



Posted in BlockStorage, OpenStack | Tagged | Leave a comment

Glusterfs – Simplicity is the Key !!


** GlusterFS

*GlusterFS is a powerful cluster filesystem written in user space which uses FUSE to hook itself with VFS layer.

*Filesystem in Userspace (FUSE) lets non-privileged users create their own file systems without editing kernel code. User run file system code in user space while the FUSE module provides only a “bridge” to the actual kernel interfaces.

* Though GlusterFS is a File System, it uses already tried and tested disk file systems like ext3, ext4, xfs, etc. to store the data.

* No Metadata because it uses Elastic Hashing Algorithm

Startup Guide

Quick Setup


Distribution depends on how you are defining the create volume command, like in below example we have g1:brick1 and g2:brick1 one pair and g1:brick2 with g2:brick2 another distribution pair.

gluster volume create gv0_vol replica 2 transport tcp g1:/data/brick1/gv0 g2:/data/brick1/gv0 g1:/data/brick2/gv0 g2:/data/brick2/gv0  force

** Distribution is based on file name hashing


Data migration
gluster volume replace-brick volume gluster3:/mnt/volume gluster6:/mnt/gluster start
gluster volume rebalance test-volume fix-layout start force

Geo replicaiton
ssh-copy-id   ***.***.***.***

gluster volume create gv0_rep replica 2 transport tcp g2:/data/rep1/gv0 g2:/data/rep2/gv0


Replicated volume Vs Geo-Replication


Elastic Hashing Algorithm

*Gluster designed a system which does not separate metadata from data, and
which does not rely on any separate metadata server, whether centralized or distributed.
* In the Gluster algorithmic approach, we take a given pathname/filename (which is unique in any directory tree) and run it through the hashing algorithm. Each pathname/filename results in a unique numerical result.
* We store files in library way(alphabatic order) .
* An alphabetic algorithm would never work in practice, that is why we pick hash.
People familiar with hash algorithms will know that hash functions are generally chosen for properties such as determinism (the same starting string will always result in the same ending hash), and uniformity (the ending results tend to be uniformly distributed mathematically).
* Storage system servers can be added or removed on-the-fly with data automatically rebalanced across the cluster
*File system configuration changes are accepted at runtime and propagated throughout the cluster allowing changes to be made dynamically as workloads fluctuate or for performance tuning.
* The number of bricks should be a multiple of the replica count for a distributed replicated volume


* NFS is traditionally difficult to scale and achieve high availability. the same GlusterFS could do.

* If the file is not where the hash code calculates to, an extra lookup operation must be performed, adding slightly to latency.


Self Healing 

Previously, this self healing needed to be triggered manually, however there is now a self-heal daemon which runs in the background, and automatically initiates self-healing every 10 minutes on any files which require healing.


gluster volume heal gv123 statistics


Split Brain

File is said to be in split-brain when the copies of the same file in different bricks that constitute the replica-pair have mismatching data and/or meta-data contents such that they are conflicting each other and automatic healing is not possible. In this scenario, you can decide which is the correct file (source) and and which is the one that needs healing (sink) by looking at the mismatching files.

*When a client is witnessing brick disconnections, a file could be modified on different bricks at different times while the other brick is off-line in the replica. These situations lead to split-brain and the file becomes unusable and manual intervention is required to fix this issue.

* Client-side quorum is implemented to minimize split-brains.




* /opt/iozone/bin/iozone -+m ioz.cfg -+h <hostip> -w -c -e -i 0 -+n -r 64k -s 1g -t 2

* gluster volume top aurora01 write-perf bs 64 count 1 brick cent01:/disk1/brick list-cnt 10



* To get back the accidentally deleted file run rebalance

* If you are using distributed-replicate, the max file size is the capacity available on an individual brick .

*In versions before 3.6, the two bricks were treated as equal regardless of size, and would have been assigned an equal share of files.

Posted in BlockStorage | Tagged , , | Leave a comment

OpenStack Swift for Disaster Recovery



Evolving Swift where a single cluster can be distributed over multiple, geographically dispersed sites, joined via high-latency network connections.
Disaster Recovery will be the mechanism for continued operations when you have multiple Swift environments in various locations. In this context DR is a continued workload operations in an alternative deployment, the recovery target clouds.

OpenStack Swift in itself has architecture to deal with disasters by way of data replication to Zones that are distributed across datacenter. Swift can uniquely place replicas according to drives, nodes, racks, PDUs, network segments and datacenter rooms.

A new concept of “Region” is introduced in Swift. A Region is bigger than a Zone and extends the concept of Tiered Zones. The proxy nodes will have an affinity to a Region and be able to optimistically write to storage nodes based on the storage nodes’ Region. Affinity makes the proxy server prefer local backend servers for object PUT requests over non-local ones.

More reading:-

Openstack Install Guide (


** Some distinguish HA from DR by networking scope – LAN for HA and WAN for DR, in the cloud context a better distinction is probably the autonomy of management.
** To add more capacity to the cluster, Add new capacity to the ring with increased weight.
** To add more regions to the cluster, Change ring and add replica count by a fractional amount e.g. 3 -> 3.1 in ring.
** Replication traffic needs to be bandwidth-limited across WAN links, both for responsiveness and for cost.
** Objects(Actual data), that can help in recreating entire Swift setup after the proxy server recovery. A simple rebalance of the Rings can be used to redistribute
the data to nodes added/recovered as a part of disaster recovery and mitigation.

** To check the replication location
swift-get-nodes -a /etc/swift/object.ring.gz AUTH_adasdbd771e3cd5f2da exampledir examplefile.txt


** To check disk utilization across the cluster
swift-recon -d –top 10

** To monitor cpu/memory
top -b -d 5 -u swift

** To monitor the replication and bandwidth utilization use speedometer on ubuntu
speedometer -b -r eth0 -t eth1
speedometer -b -r eth1

Below Graph from speedometer shows that data get transfered to SWIFT from client on eth0 , once transfer completed the replication to remote site starts on eth1.


Posted in ObjectStorage, OpenStack | Tagged | Leave a comment

Software Defined Networking (SDN) and OpenStack

Software-defined networking (SDN) is an approach to networking in which control decoupled from hardware and given to a software application called a controller.

1) SDN is :
a) Separation of data and control  planes and a vendor-agnostic interface (e.g. OpenFlow)              between the two.
b) A well-defined API for the networking (3rd parties can develop and sell network                          control and management apps).
c) Network virtualization (Underlying network infrastructure is abstracted from the                        applications, no vendor lock-in).

2)  SDN is Not :
a) Only Implementing Network Functions in Software or on Virtual Machine
b) Only Programmable Proprietary APIs for Network Device or Management System

3) The SDN Controller has complete control of the SDN Datapaths.

4) SDN Stack: 

Thanks to


a) At bottom, the data plane is comprised of network elements, whose SDN Datapaths                      expose their capabilities through the Control-Data-Plane Interface (CDPI) Agent.

b) On  top, SDN Applications exist in the application plane, and communicate their                           requirements via NorthBound Interface (NBI) Drivers. In the middle, the SDN                             Controller translates these requirements and exerts low-level control over the SDN                      Datapaths, while providing relevant information up to the SDN Applications.

c) The Management & Admin plane is responsible for setting up the network elements,
assigning the SDN Datapaths their SDN Controller, and configuring policies defining                   the scope of control given to the SDN Controller or SDN Application.

d)  This SDN network architecture can coexist with a non-SDN network, especially for                      the purpose of a migration to a fully enabled SDN network

** Openstack Integration with SDN

1) OpenStack Neutron is  a networking-as-a-service project within the OpenStack cloud computing initiative.

2) Neutron is an application-level abstraction of networking that relies on plug-in implementations to map the abstraction(s) to reality.

3) Neutron includes a set of APIs, plug-ins and authentication/authorization control software that enable interoperability and orchestration of network devices and technologies (including routers, switches, virtual switches and SDN controllers) within infrastructure-as-a-service  environments.

Example SDN Plug-ins :-

** OpenDaylight
OpenDaylight is an open source SDN  project with a modular, pluggable, and flexible controller platform at its core. This controller is implemented strictly in software and is contained within its own Java Virtual Machine (JVM). As such, it can be deployed on any hardware and operating system platform that supports Java.OpenDaylight has driver for Neutron.

** OpenFlow based

1) OpenFlow-based networking systems are one possible mechanism to be used by a plug-in to deliver a Neutron abstraction.

** More Reading

Network Function Virtulization

SDN is focused on the separation of the network control layer from its forwarding layer, while NFV decouples the network functions, such as network address translation (NAT), firewalling, intrusion detection, domain name service (DNS), caching, etc., from proprietary hardware appliances, so they can run in software. Both concepts can be complementary, although they can exist independently.


Posted in OpenStack | Tagged , , | Leave a comment

OpenStack Object Storage (SWIFT)


Swift is a multi-tenant, highly scalable and durable object storage system that was designed to store large amounts of unstructured data at low cost via a RESTful http API.

The main advantage of object storage is very low implementation cost compared to enterprise-grade storage, while ensuring scalability and data redundancy.


More Reading about SWIFT (multi cloud support)

Some Basic commands to work with SWIFT setup
1) python -c ‘import swift; print swift.__version__'”
2) GET auth token

curl -k -v -H ‘X-Storage-User: system:root’ -H ‘X-Storage-Pass: testpass’ http://xx.xx.xx.xx:8080/auth/v1.0

< HTTP/1.1 200 OK
< X-Storage-Url: http://xx.xx.xx.xx:8080/v1/AUTH_system
< X-Storage-Token: AUTH_2fbd62f8d6fc4ccd8a90d6a07

3) Delete file using curl

curl -v -X DELETE -i -H “X-Auth-Token: $OS_AUTH_TOKEN” http://xx.xx.xx.xx:8080/v1/AUTH_2fbd62f8d6fc4ccd8a90d6a07/myfiles/aaa.txt

4) Put a file using curl

curl -v -X PUT -i -T aaa.txt -H “X-Auth-Token: $OS_AUTH_TOKEN” http://xx.xx.xx.xx:8080/v1/AUTH_2fbd62f8d6fc4ccd8a90d6a07/myfiles/aaa.txt

5) Running swift bench for SWIFT setup for performance numbers
   With keystone
swift-bench -V 2.0 -A http://xx.xx.xx.xx:35357/v2.0 -U admin:admin -K admin
   Without keystone
swift-bench -A -U test:tester -K testing

6) To check if Kernel is stuck
ps axo pid,wchan:32,cmd
sudo strace -p PID

7) swift command to check/upload/download a object

swift -V 2.0 -A http://xx.xx.xx.xx:35357/v2.0 -U service:test -K testpass stat
swift -V 2.0 -A http://xx.xx.xx.xx:35357/v2.0 -U service:test -K testpass upload myfiles proxy.log
swift -V 2.0 -A http://xx.xx.xx.xx:35357/v2.0 -U service:test -K testpass download myfiles

SWIFT’s Object Placement Strategy (The Ring) Swift uses a data structure called “Ring” to map a URL for an object to a particular location in the cluster where the object is stored. It is static mapping, one could not change on the fly.

0) Replica placement is also handled by the ring.

The ring data structure consists of three top level fields: a list of devices in the                 cluster,  a list of lists of device ids indicating partition to device                             assignments, and an integer indicating the number of bits to shift an MD5           hash to calculate the partition for the hash.
a) P0{“devs”: [{“device”: “sd0”, “id”: 0, “ip”: “”, “meta”: “”, “port”: 6001,                              “region”: 1, “replication_ip”: “”, “replication_port”: 6001, “weight”:                                     200.0, “zone”: 1},
b) e.g. Partition Assignment List _replica2part2dev _replica2part2dev[2][7989] =                                    device for 3rd replica of partition 7989

0.2) The Ring maintains this mapping using zones, devices, partitions, and replicas. Each partition in the Ring is replicated three times by default across the cluster, and the locations for a partition are stored in the mapping maintained by the Ring. The Ring is also responsible for determining which devices are used for handoff should a failure occur.

0.3) For a given partition number, each replica’s device will not be in the same zone as any other replica’s device.

0.4) The ring builder assigns each replica of each partition to the device that desires the most partitions at that point while keeping it as far away as possible from other replicas. The ring builder prefers to assign a replica to a device in a regions that has no replicas already; should there be no such region available, the ring builder will try to find a device in a different zone; if not possible, it will look on a different server; failing that, it will just look for a device that has no replicas; finally, if all other options are exhausted, the ring builder will assign the replica to the device that has the fewest replicas already assigned. Note that assignment of multiple replicas to one device will only happen if the ring has fewer devices than it has replicas.

0.5) To check ring detail use swift-ring-builder command

e.g. swift-ring-builder /tmp/container.builder list_parts z1

0.6) more details

1) Regions, zones, servers and drives form a hierarchy for data placement.
1.1) Regions are used only when distributing a cluster over geographic sites.
1.2) A zone is defined as a unique domain of something that can fail, such as power or a networking segment.

2) OpenStack Swift places three copies of every object across the cluster in as unique-as-possible locations: first by region, then zone, then server, then drive.
A quorum is required — at least two of the three writes must be successful before the client is notified that the upload was successful.

3) As a distributed storage system, the ring is deployed to every node in the cluster, both proxies and object servers.

4) All objects have their own metadata.

5) The Ring maps Partitions to physical locations of object/container/account on disk.
An account database contains the list of containers in that account. A container database contains the list of objects in that container.

6) After Object placement the Container database is updated asynchronously to reflect that there is a new object in it.

9) The Container Server’s primary job is to handle listings of objects. It does not know where those objects are, just what objects are in a specific container.
The listings are stored as SQLite database files, and replicated across the cluster similar to how objects are.
Statistics are also tracked that include the total number of objects, and total storage usage for that container.

10) The Account Server is very similar to the Container Server, excepting that it is responsible for listings of containers rather than objects.

11) If a replicator detects that a remote drive has failed, the replicator uses the get_more_nodes interface for the ring to choose an alternate node with which to synchronize.

13) When a disk fails, replica data is automatically distributed to the other zones to ensure there are three copies of the data.


1) Post Grizzly token format default to PKI in place of UUID. change in keystone.conf provider and format to UUID if you want to see token in short form though PKI tokens are then much more secure since the service can trust where the token is coming from and much more efficient since it doesn’t have to validate it on every request like done for UUID token.

2) Sample keystonerc
export OS_SERVICE_TOKEN=b83d2580bf023
export OS_SERVICE_ENDPOINT=http://xx.xx.xx.xx:35357/v2.0
export OS_USERNAME=admin
export OS_PASSWORD=asmin
export OS_TENANT_NAME=admin
export OS_AUTH_URL=http://xx.xx.xx.xx:35357/v2.0

3) Sample command to connect to keystone DB and remove all the token entry. This cloud slow performance during performance test.

mysql -u root -p
show databases;
use keystone;
show tables;
mysql> SELECT COUNT(*) FROM token;
| COUNT(*) |
| 1931349 |
1 row in set (1.64 sec)

keystone-manage token_flush

DELETE FROM token WHERE expires <= NOW();


mysql -u “root” “-popenstack” “keystone” -e “truncate token;”
-rw-rw—- 1 mysql mysql 6621757440 May 2 11:06 ibdata1

mysql -u “root” “-popenstack” “keystone” -e “show table status;”

========================FILE PUT CALL FLOW====================

*******************Proxy server – Check Bucket *******************************
Apr 1 00:15:16 vm2 proxy-server Authenticating user token
Apr 1 00:15:16 vm2 proxy-server Removing headers from request environment: X-Identity-Status,X-Domain-Id,X-Domain-Name,X-Project-Id,X-Project-Name,X-Project-Domain-Id,X-Project-Domain-Name,X-User-Id,X-User-Name,X-User-Domain-Id,X-User-Domain-Name,X-Roles,X-Service-Catalog,X-User,X-Tenant-Id,X-Tenant-Name,X-Tenant,X-Role
Apr 1 00:15:16 vm2 proxy-server Storing 1d7b54a9453148109bb6dd token in memcache
Apr 1 00:15:16 vm2 proxy-server Using identity: {‘roles’: [u’admin’], ‘user’: u’test’, ‘tenant’: (u’9d6496dd85bc406e94ae26eebf’, u’service’)} (txn: txcda0af985c5f46c5b3656-00533)
Apr 1 00:15:16 vm2 proxy-server allow user with role admin as account admin (txn: txcda0af985c5f46c5b3656-00533a6784) (client_ip: xx.xx.xx.xx)

Apr 1 00:15:16 vm2 proxy-server xx.xx.xx.xx xx.xx.xx.xx 01/Apr/2014/07/15/16 PUT /v1/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12 HTTP/1.0 201 – – 1d7b54a9453148109bb6dd2628022334 – – – txcda0af985c5f46c5b3656-00533a6784 – 0.0651 – –

Apr 1 00:15:16 vm2 container-server – – [01/Apr/2014:07:15:16 +0000] “PUT /sdb2/15/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12” 201 – “txcda0af985c5f46c5b3656-00533a6784” “PUT http://xx.xx.xx.xx:8080/v1/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12&#8221; “proxy-server 4797” 0.0246
Apr 1 00:15:16 vm2 account-server – – [01/Apr/2014:07:15:16 +0000] “PUT /sdb2/711/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12” 201 – “txcda0af985c5f46c5b3656-00533a6784” “PUT; “container-server 4795” 0.0141 “”

**********************Proxy server – check File *********************************
Apr 1 00:15:16 vm2 proxy-server Authenticating user token
Apr 1 00:15:16 vm2 proxy-server Removing headers from request environment: X-Identity-Status,X-Domain-Id,X-Domain-Name,X-Project-Id,X-Project-Name,X-Project-Domain-Id,X-Project-Domain-Name,X-User-Id,X-User-Name,X-User-Domain-Id,X-User-Domain-Name,X-Roles,X-Service-Catalog,X-User,X-Tenant-Id,X-Tenant-Name,X-Tenant,X-Role
Apr 1 00:15:17 vm2 proxy-server Storing f62461c755de4b418b868cd473aa60cc token in memcache
Apr 1 00:15:17 vm2 proxy-server Using identity: {‘roles’: [u’admin’], ‘user’: u’ceph’, ‘tenant’: (u’9d6496dd85bc406e94ae26eebf3ff317′, u’service’)} (txn: tx60e267bb98664bca84185-00533a6784)
Apr 1 00:15:17 vm2 proxy-server allow user with role admin as account admin (txn: tx60e267bb98664bca84185-00533a6784) (client_ip: xx.xx.xx.xx)
Apr 1 00:15:17 vm2 proxy-server xx.xx.xx.xx xx.xx.xx.xx 01/Apr/2014/07/15/17 HEAD /v1/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12/expirer.error HTTP/1.0 404 – – f62461c755de4b418b868cd473aa60cc – – – tx60e267bb98664bca84185-00533a6784 – 0.0260 – –

Apr 1 00:15:17 vm2 object-server – – [01/Apr/2014:07:15:17 +0000] “HEAD /sdb2/644/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12/expirer.error” 404 – “HEAD http://xx.xx.xx.xx:8080/v1/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12/expirer.error&#8221; “tx60e267bb98664bca84185-00533a6784” “proxy-server 4797” 0.0013

**********************Proxy server – PUT File *********************************
Apr 1 00:15:17 vm2 proxy-server Authenticating user token
Apr 1 00:15:17 vm2 proxy-server Removing headers from request environment: X-Identity-Status,X-Domain-Id,X-Domain-Name,X-Project-Id,X-Project-Name,X-Project-Domain-Id,X-Project-Domain-Name,X-User-Id,X-User-Name,X-User-Domain-Id,X-User-Domain-Name,X-Roles,X-Service-Catalog,X-User,X-Tenant-Id,X-Tenant-Name,X-Tenant,X-Role
Apr 1 00:15:17 vm2 proxy-server Returning cached token f62461c755de4b418b868cd473aa60cc
Apr 1 00:15:17 vm2 proxy-server Using identity: {‘roles’: [u’admin’], ‘user’: u’ceph’, ‘tenant’: (u’9d6496dd85bc406e94ae26eebf3ff317′, u’service’)} (txn: tx6ec6e2950e424897bb9e3-00533a6785)
Apr 1 00:15:17 vm2 proxy-server allow user with role admin as account admin (txn: tx6ec6e2950e424897bb9e3-00533a6785) (client_ip: xx.xx.xx.xx)
Apr 1 00:15:17 vm2 proxy-server xx.xx.xx.xx xx.xx.xx.xx 01/Apr/2014/07/15/17 PUT /v1/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12/expirer.error HTTP/1.0 201 – – f62461c755de4b418b868cd473aa60cc – – – tx6ec6e2950e424897bb9e3-00533a6785 – 0.0476 – –

Apr 1 00:15:17 vm2 object-server – – [01/Apr/2014:07:15:17 +0000] “PUT /sdb2/644/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12/expirer.error” 201 – “PUT http://xx.xx.xx.xx:8080/v1/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12/expirer.error&#8221; “tx6ec6e2950e424897bb9e3-00533a6785” “proxy-server 4797” 0.0248
Apr 1 00:15:17 vm2 container-server – – [01/Apr/2014:07:15:17 +0000] “PUT /sdb2/15/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12/expirer.error” 201 – “tx6ec6e2950e424897bb9e3-00533a6785” “PUT http://xx.xx.xx.xx:8080/sdb4/644/AUTH_9d6496dd85bc406e94ae26eebf3ff317/tempfile12/expirer.error&#8221; “obj-server 4784” 0.0004




Posted in ObjectStorage | Tagged , , | Leave a comment