Hadoop(NoSQL) if a fork off from Google’s Nutch project.
Apache Hadoop is an open source framework that allows for distributed processing of large data sets across computing clusters
Apache Hadoop has these major projects:
Hadoop Distributed File System (HDFS): A distributed File System for high-throughput access to large sets of data
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop YARN: A framework for job scheduling and cluster “resource management”. YARN is a generic platform that run any distributed applicaiton and MR2 is a distributed application that run on top of YARN.
MapReduce 2 : A YARN-based system for parallel processing of large sets of data
Thanks to http://www.youtube.com/playlist?list=PL9ooVrP1hQOHrhnO86Z9m9tDi91W2d1b6
HADOOP stack :-
Hadoop is best suited for:
Processing unstructured data
Complex parallel information processing
Large Data Sets/Files
Critical fault tolerant data processing
Reports not needed in real time
Queries that cannot be expressed by SQL
Data processing Jobs needs to be faster
How Hadoop processes data (MapReduce:- analogy is java servlet)
1) Hadoop provides framework MapReduce for processing the stored big data. The important innovation of MapReduce
is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes.
2) You can run your indexing job by sending your code to each of the dozens of servers in your cluster,
and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole.
(MapReduce)you map the operation out to all of those servers and then you reduce the results back into a single result set.
How Hadoop stores files (HDFS)
1) Hadoop lets you store files bigger than what can be stored on one particular node or server.
2) When you want to load all of your organization’s data into Hadoop, what the software does is bust
that data into pieces that it then spreads across your different servers. There’s no one place where
you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are
multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated
from a known good copy.
3) Hadoop is designed to run on a large number of machines that don’t share any memory or disks.
4) Each server must have access to the data. This is the role of HDFS, the Hadoop Distributed File System.
HDFS ensures data is replicated with redundancy across the cluster.
Programming Hadoop at the MapReduce level is a case of working with the Java APIs, and manually loading data files into HDFS.
Working directly with Java APIs can be tedious and error prone. It also restricts usage of Hadoop to Java programmers. Hadoop offers two solutions for making Hadoop programming easier. PIG and HIVE
1) Pig is a programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transformations on the data, and storing the final results.
2) Hive enables Hadoop to operate as a data warehouse. It superimposes structure on data in HDFS and then permits queries over the data using a familiar SQL-like syntax.
Openstack – Sahara with Hadoop
1) Sahara = Managing Hadoop + Provisioing infrastructure + tools
● Create and manage clusters
● Define and run analysis jobs
● All through a programmatic interface Or a web console
2) Sahara ReST API http://docs.openstack.org/developer/sahara/restapi/rest_api_v1.0.html
Thanks to http://techblog.baghel.com/index.php?itemid=132
Whats Next ….
Try out http://hortonworks.com/products/hortonworks-sandbox/