Using Hadoop on a Cluster

Hadoop is a framework for Map-Reduce distributed computations. Map_reduce performs data decomposition i.e runs the same "function" parallely on different parts of a huge data set ('map') and then combines the results from those independent computations ('reduce'). Hadoop is a java implementation of map-reduce. Hadoop automatically splits the dataset and spawns map-reduce tasks that run on different physical machines. It also provides a high performance file system, Hadoop Distributed File System (HDFS) to enable high performance IO for the map-reduce tasks.

In this post, we will specifically look at ways to get hadoop running on compute clusters with a network file system. The Hadoop Cluster Setup guide explains how to setup hadoop to run on a compute cluster. Michael Noll's guide delves into this deeper and explains the bare-bone configurations that are needed for setting up a multinode hadoop cluster.

But, both of them assume that the cluster machines have sizeable individual storage and hadoop installations can be individually unpacked on all machines in a cluster.

Often, in real world clusters,

There is very limited storage space per compute node and vast shared network drives. Hence, we need to store the hadoop installation on such drives.
Also, Not all users need Hadoop. Hence, unpacking hadoop on all machines or running hadoop daemons on all nodes continuously may not be necessary. [in addition to maintaining the individual installations]
Often, individual users who need Hadoop will use a subset of nodes in the cluster [ decided by a dynamic resource allocator ] to run their jobs

Hence, there is a need to be able to dynamically start hadoop on 'only the allocated nodes' and run jobs on them, with a single hadoop installation sitting on a network drive.

We can accomplish this as follows manually [with bare minimum configuration][ There is a deluge of configuration parameters in hadoop] . All steps described can be suitably automated.

Download and unpack Hadoop to a location in your network file system. Let us call this location HADOOP_HOME. All scripts referred below are w.r.t to HADOOP_HOME. Michael Noll's guide has good information on configuring things like password-less ssh etc, that are needed for basic operation.
Change the JAVA_HOME variable in conf/hadoop-env.sh to point to your installation of Java 6.
The compute cluster you are running on usually provides a means to know the hostnames of the compute nodes that you are allocated, once you submit a job. Choose one node as master node and add it to conf/masters file. Add all nodes [including the one picked as master] to conf/slaves. These files are used by bin/start-dfs.sh and bin/start-mapred.sh to start the HDFS and MapReduce daemons respectively.
However, we need to enter some basic configurations in the conf/hadoop-site.xml. Once again, refer to Michael-Noll's page on these configurations.
If your compute node has some individual storage or whatever individual storage available is sufficient for your job. Please override the 'hadoop.tmp.dir' property to a path on the compute node. The default is /tmp/hadoop-${user.name}. Then, simply run the following commands on the name node to get the cluster started. bin/hadoop namenode -format, bin/start-dfs.sh, bin/start-mapred.sh
If the compute nodes do not have sufficient storage for your job, then we cannot automatically start HDFS daemons using start-dfs.sh. Each DataNode (Slave daemon of HDFS) will use the same value for hadoop.tmp.dir property, and try to obtain a lock on the directory. Consequently, only one of them will succeed and all other DataNode daemons die. Unfortunately, the variable substitutions that are allowed in Hadoop config files does not allow us to get the daemons to pick different directories.
In such a scenario, we need to create separate conf/hadoop-site.xml configuration files (all except hadoop.tmp.dir property can be same for all DataNodes) for each DataNode daemon on each slave and start daemons explicitly on those slave nodes by running the bin/start-daemon.sh script on those nodes. Use bin/hadoop namenode -format on the master node to format the HDFS. Use "bin/hadoop-daemon.sh --config start NameNode" on the master node to run the NameNode daemon on it.bin/hadoop-daemon.sh --config start datanode will start the DataNode daemon on that node, by picking the hadoop-site.xml config file from directory.
You can still use bin/start-mapred.sh to start the mapreduce daemons - TaskTracker and JobTracker.
Submit any mapreduce job.

All these tasks can be automated by using "ssh remotemachine command ' to start the daemons remotely.

Comments

UnknownAugust 26, 2015 at 9:47 PM

Vương Lâm không suy nghĩ chút nào, trong nháy mắt khi quái nhân tóc bạc biến mất liền quát to.

Tiếng nổ ầm ầm vang vọng khắp không trung. Âm thanh này lớn tới kinh người, trong nháy mắt liền khiến tu sĩ khắp bốn phía phải hít sâu một hơi.
đồng tâm
game mu
cho thuê phòng trọ
cho thuê phòng trọ
nhac san cuc manh
tổng đài tư vấn luật miễn phí
văn phòng luật
số điện thoại tư vấn luật
thành lập công ty

Hai mắt Thân Công Hổ sững lại, nhìn thằng vào Vương Lâm, trong mắt lộ vẻ vui mừng lẫn sợ hãi.

Không chỉ có hắn, Chiến Không Liệt cũng như vậy.

- Là hắn !

Đường Ngôn Phong nhướng mày. Hắn nhận ra Vương Lâm chính là Hứa Mộc mà trước đây mình không nắm chắc có thể chiến thắng. Dòng chảy bên ngoài thân thể Vương Lâm vỡ tan, thân ảnh quái nhân tóc bạc kia lại xuất hiện nhưng lập tức bị sự sụp đổ này lan tới bên người. Ánh mắt hắn lộ ra những tia sáng kỳ dị, cười to, hung hăng hút mạnh một cái. Lập tức dòng chảy nguyên lực nọ bị hút vào trong miệng hắn.

Thân thể Vương Lâm mau chóng lui lại, đuổi theo Lý Nguyên, vừa
ReplyDelete
Replies

Add comment

throw new Exception()

Search This Blog

Using Hadoop on a Cluster

Labels

Comments

Post a Comment

Popular posts from this blog

Learning Spark Streaming #1

Setting up Hadoop/YARN/Spark/Hive on Mac OSX El Capitan

HDFS Client Configs for talking to HA Hadoop NameNodes