Skip to main content

Using Hadoop on a Cluster

Hadoop is a framework for Map-Reduce distributed computations. Map_reduce performs data decomposition i.e runs the same "function" parallely on different parts of a huge data set ('map') and then combines the results from those independent computations ('reduce'). Hadoop is a java implementation of map-reduce. Hadoop automatically splits the dataset and spawns map-reduce tasks that run on different physical machines. It also provides a high performance file system, Hadoop Distributed File System (HDFS) to enable high performance IO for the map-reduce tasks.

In this post, we will specifically look at ways to get hadoop running on compute clusters with a network file system. The Hadoop Cluster Setup guide explains how to setup hadoop to run on a compute cluster. Michael Noll's guide delves into this deeper and explains the bare-bone configurations that are needed for setting up a multinode hadoop cluster.

But, both of them assume that the cluster machines have sizeable individual storage and hadoop installations can be individually unpacked on all machines in a cluster.

Often, in real world clusters,
  • There is very limited storage space per compute node and vast shared network drives. Hence, we need to store the hadoop installation on such drives.
  • Also, Not all users need Hadoop. Hence, unpacking hadoop on all machines or running hadoop daemons on all nodes continuously may not be necessary. [in addition to maintaining the individual installations]
  • Often, individual users who need Hadoop will use a subset of nodes in the cluster [ decided by a dynamic resource allocator ] to run their jobs
Hence, there is a need to be able to dynamically start hadoop on 'only the allocated nodes' and run jobs on them, with a single hadoop installation sitting on a network drive.

We can accomplish this as follows manually [with bare minimum configuration][ There is a deluge of configuration parameters in hadoop] . All steps described can be suitably automated.
  • Download and unpack Hadoop to a location in your network file system. Let us call this location HADOOP_HOME. All scripts referred below are w.r.t to HADOOP_HOME. Michael Noll's guide has good information on configuring things like password-less ssh etc, that are needed for basic operation.
  • Change the JAVA_HOME variable in conf/ to point to your installation of Java 6.
  • The compute cluster you are running on usually provides a means to know the hostnames of the compute nodes that you are allocated, once you submit a job. Choose one node as master node and add it to conf/masters file. Add all nodes [including the one picked as master] to conf/slaves. These files are used by bin/ and bin/ to start the HDFS and MapReduce daemons respectively.
  • However, we need to enter some basic configurations in the conf/hadoop-site.xml. Once again, refer to Michael-Noll's page on these configurations.
  • If your compute node has some individual storage or whatever individual storage available is sufficient for your job. Please override the 'hadoop.tmp.dir' property to a path on the compute node. The default is /tmp/hadoop-${}. Then, simply run the following commands on the name node to get the cluster started. bin/hadoop namenode -format, bin/, bin/
  • If the compute nodes do not have sufficient storage for your job, then we cannot automatically start HDFS daemons using Each DataNode (Slave daemon of HDFS) will use the same value for hadoop.tmp.dir property, and try to obtain a lock on the directory. Consequently, only one of them will succeed and all other DataNode daemons die. Unfortunately, the variable substitutions that are allowed in Hadoop config files does not allow us to get the daemons to pick different directories.
  • In such a scenario, we need to create separate conf/hadoop-site.xml configuration files (all except hadoop.tmp.dir property can be same for all DataNodes) for each DataNode daemon on each slave and start daemons explicitly on those slave nodes by running the bin/ script on those nodes. Use bin/hadoop namenode -format on the master node to format the HDFS. Use "bin/ --config start NameNode" on the master node to run the NameNode daemon on it.bin/ --config start datanode will start the DataNode daemon on that node, by picking the hadoop-site.xml config file from directory.
  • You can still use bin/ to start the mapreduce daemons - TaskTracker and JobTracker.
  • Submit any mapreduce job.
All these tasks can be automated by using "ssh remotemachine command ' to start the daemons remotely.


  1. Vương Lâm không suy nghĩ chút nào, trong nháy mắt khi quái nhân tóc bạc biến mất liền quát to.

    Tiếng nổ ầm ầm vang vọng khắp không trung. Âm thanh này lớn tới kinh người, trong nháy mắt liền khiến tu sĩ khắp bốn phía phải hít sâu một hơi.
    đồng tâm
    game mu
    cho thuê phòng trọ
    cho thuê phòng trọ
    nhac san cuc manh
    tổng đài tư vấn luật miễn phí
    văn phòng luật
    số điện thoại tư vấn luật
    thành lập công ty

    Hai mắt Thân Công Hổ sững lại, nhìn thằng vào Vương Lâm, trong mắt lộ vẻ vui mừng lẫn sợ hãi.

    Không chỉ có hắn, Chiến Không Liệt cũng như vậy.

    - Là hắn !

    Đường Ngôn Phong nhướng mày. Hắn nhận ra Vương Lâm chính là Hứa Mộc mà trước đây mình không nắm chắc có thể chiến thắng. Dòng chảy bên ngoài thân thể Vương Lâm vỡ tan, thân ảnh quái nhân tóc bạc kia lại xuất hiện nhưng lập tức bị sự sụp đổ này lan tới bên người. Ánh mắt hắn lộ ra những tia sáng kỳ dị, cười to, hung hăng hút mạnh một cái. Lập tức dòng chảy nguyên lực nọ bị hút vào trong miệng hắn.

    Thân thể Vương Lâm mau chóng lui lại, đuổi theo Lý Nguyên, vừa


Post a Comment

Popular posts from this blog

Learning Spark Streaming #1

I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming . In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed. Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide .  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second. As with any stream processing system, there are three big aspects to the framework itself. Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input. The receiver based approach is a little compl

Setting up Hadoop/YARN/Spark/Hive on Mac OSX El Capitan

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely Hadoop (Hadoop2/HDFS) YARN (So you can submit MR jobs) Spark (We will illustrate with Spark Shell, but should work on YARN mode as well) Hive (So we can create some tables and play with it)  We will use the latest stable Cloudera distribution, and work off the jars. Most of the methodology is borrowed from here , we just link the four pieces together nicely in this blog.  Download Stuff First off all, make sure you have Java 7/8 installed, with JAVA_HOME variable setup to point to the correct location. You have to download the CDH tarballs for Hadoop, Zookeeper, Hive from the tarball page (CDH 5.4.x page ) and untar them under a folder (refered to as CDH_HOME going forward) as hadoop, zookeeper $ ls $HOME /bin/cdh/5.4.7 hadoop

HDFS Client Configs for talking to HA Hadoop NameNodes

One more simple thing, that had relatively scarce documentation out on the Internet. As you might know, Hadoop NameNodes finally became HA in 2.0 . The HDFS client configuration, which is already a little bit tedious, became more complicated. Traditionally, there were two ways to configure a HDFS client (lets stick to Java) Copy over the entire Hadoop config directory with all the xml files, place it somewhere in the classpath of your app or construct a Hadoop Configuration object by manually adding in those files. Simply provide the HDFS NameNode URI and let the client do the rest.          Configuration conf = new Configuration(false);         conf.set("", "hdfs://localhost:8020"); // this is deprecated now         conf.set("fs.defaultFS", "hdfs://localhost:8020");         FileSystem fs = FileSystem.get(conf); Most people prefer 2, unless you need way more configs from the actual xml config files, at which po