Hadoop !!

logo

Hadoop(NoSQL) if a fork off from Google’s Nutch project.

Apache Hadoop is an open source framework that allows for distributed processing of large data sets across computing clusters

Apache Hadoop has these major projects:
Hadoop Distributed File System (HDFS): A distributed File System for high-throughput access to large sets of data

Hadoop Common: The common utilities that support the other Hadoop modules.

Hadoop YARN: A framework for job scheduling and cluster “resource management”. YARN is a generic platform that run any distributed applicaiton and MR2 is a distributed application that run on top of YARN.

MapReduce 2 : A YARN-based system for parallel processing of large sets of data

Hadoop

Thanks to http://www.youtube.com/playlist?list=PL9ooVrP1hQOHrhnO86Z9m9tDi91W2d1b6

HADOOP stack  :-

YARN

Hadoop is best suited for:
Processing unstructured data
Complex parallel information processing
Large Data Sets/Files
Machine Learning
Critical fault tolerant data processing
Reports not needed in real time
Queries that cannot be expressed by SQL
Data processing Jobs needs to be faster

How Hadoop processes data (MapReduce:- analogy is java servlet)
1) Hadoop provides framework MapReduce for processing the stored big data. The important innovation of MapReduce
is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes.

2) You can run your indexing job by sending your code to each of the dozens of servers in your cluster,
and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole.
(MapReduce)you map the operation out to all of those servers and then you reduce the results back into a single result set.

How Hadoop stores files (HDFS)
1) Hadoop lets you store files bigger than what can be stored on one particular node or server.

2) When you want to load all of your organization’s data into Hadoop, what the software does is bust
that data into pieces that it then spreads across your different servers. There’s no one place where
you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are
multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated
from a known good copy.

3) Hadoop is designed to run on a large number of machines that don’t share any memory or disks.

4) Each server must have access to the data. This is the role of HDFS, the Hadoop Distributed File System.
HDFS ensures data is replicated with redundancy across the cluster.

Hadoop programming
Programming Hadoop at the MapReduce level is a case of working with the Java APIs, and manually loading data files into HDFS.
Working directly with Java APIs can be tedious and error prone. It also restricts usage of Hadoop to Java programmers. Hadoop offers two solutions for making Hadoop programming easier. PIG and HIVE

1) Pig is a programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transformations on the data, and storing the final results.
2) Hive enables Hadoop to operate as a data warehouse. It superimposes structure on data in HDFS and then permits queries over the data using a familiar SQL-like syntax.

Build

http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-common/SingleCluster.html
to fix build issues:
https://issues.apache.org/jira/secure/attachment/12614482/HADOOP-10110.patch

==================================================

Openstack – Sahara with Hadoop   
1) Sahara = Managing  Hadoop + Provisioing infrastructure + tools
●  Create and manage clusters
●  Define and run analysis jobs
●  All through a programmatic interface Or a web console
2) Sahara ReST API  http://docs.openstack.org/developer/sahara/restapi/rest_api_v1.0.html

Sahara

Thanks to http://www.slideshare.net/spinningmatt/sahara-dev-nation-2014

============================================

Hadoop Ecosystem:- 

Hadoop Ecosystem

Thanks to  http://techblog.baghel.com/index.php?itemid=132

Whats Next ….

Try out  http://hortonworks.com/products/hortonworks-sandbox/

 

Advertisements
This entry was posted in OpenStack and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s