Tuesday, May 26, 2015

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
1. Sample programs:
http://www.hadoopbook.com/
1+.
Cloudera's distribution for Hadoop (CDH) is a comprehensive Apache Hadoop based management platform and Cloudera Enterprise
includes the tools, platform, and support necessary to use Hadoop in production.
This edition uses the new MapReduce API for most of the examples. The major change in Hadoop 2.0 is the new MapReduce runtime, MapReduce 2
1.2+
This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis
system. The storage is provided by HDFS and analysis by MapReduce. There are other
parts to Hadoop, but these capabilities are its kernel.
1.3+
for updating a small proportion of records in a database, a traditional
B-Tree (the data structure used in relational databases, which is limited by the
rate it can perform seeks) works well. For updating the majority of a database, a B-Tree
is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.
2.
for updating a small proportion of records in a database, a traditional
B-Tree (the data structure used in relational databases, which is limited by the
rate it can perform seeks) works well. For updating the majority of a database, a B-Tree
is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.
In many ways, MapReduce can be seen as a complement to a Rational Database Management
System (RDBMS).
MapReduce is a good fit for problems that need to analyze the whole dataset
in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries
or updates, where the dataset has been indexed to deliver low-latency retrieval and
update times of a relatively small amount of data.
MapReduce suits applications where
the data is written once and read many times, whereas a relational database is good for
datasets that are continually updated.
MapReduce works well on unstructured or semistructured
data because it is designed to interpret the data at processing time.
Normalization poses problems for MapReduce because it makes reading a record a
nonlocal operation, and one of the central assumptions that MapReduce makes is that
it is possible to perform (high-speed) streaming reads and writes.
2+
MapReduce is a linearly scalable programming model. The programmer writes two
functions—a map function and a reduce function—each of which defines a mapping
from one set of key-value pairs to another. These functions are oblivious to the size of
the data or the cluster that they are operating on, so they can be used unchanged for a
small dataset and for a massive one. More important, if you double the size of the input
data, a job will run twice as slow. But if you also double the size of the cluster, a job
will run as fast as the original one. This is not generally true of SQL queries.
MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated
hardware running in a single data center with very high aggregate bandwidth interconnects.
3.
Although this may change in the future, these are areas where HDFS is not a good
fit today:
Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds
range, will not work well with HDFS. Remember, HDFS is optimized for delivering
a high throughput of data, and this may be at the expense of latency.
Lots of small files
Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always made at the
end of the file. There is no support for multiple writers or for modifications at
arbitrary offsets in the file. (These might be supported in the future, but they are
likely to be relatively inefficient.)
4.
The namenode keeps a reference to every file and block in the filesystem in memory,
which means that on very large clusters with many files, memory becomes the limiting
factor for scaling
5.
The new API makes extensive use of context objects that allow the user code to
communicate with the MapReduce system. The new Context, for example, essentially
unifies the role of the JobConf, the OutputCollector, and the Reporter from
the old API.
6.
There are two types of nodes that control the job execution process: a jobtracker and
a number of tasktrackers. The jobtracker coordinates all the jobs run on the system by
scheduling tasks to run on tasktrackers.
7.
Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds
range, will not work well with HDFS. Remember, HDFS is optimized for delivering
a high throughput of data, and this may be at the expense of latency. HBase
(Chapter 13) is currently a better choice for low-latency access.
8.
An HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode
(the master) and a number of datanodes (workers). The namenode manages the
filesystem namespace. It maintains the filesystem tree and the metadata for all the files
and directories in the tree.

No comments:

Post a Comment