Skip to main content

Using Hadoop on a Cluster

Hadoop is a framework for Map-Reduce distributed computations. Map_reduce performs data decomposition i.e runs the same "function" parallely on different parts of a huge data set ('map') and then combines the results from those independent computations ('reduce'). Hadoop is a java implementation of map-reduce. Hadoop automatically splits the dataset and spawns map-reduce tasks that run on different physical machines. It also provides a high performance file system, Hadoop Distributed File System (HDFS) to enable high performance IO for the map-reduce tasks.

In this post, we will specifically look at ways to get hadoop running on compute clusters with a network file system. The Hadoop Cluster Setup guide explains how to setup hadoop to run on a compute cluster. Michael Noll's guide delves into this deeper and explains the bare-bone configurations that are needed for setting up a multinode hadoop cluster.

But, both of them assume that the cluster machines have sizeable individual storage and hadoop installations can be individually unpacked on all machines in a cluster.

Often, in real world clusters,
  • There is very limited storage space per compute node and vast shared network drives. Hence, we need to store the hadoop installation on such drives.
  • Also, Not all users need Hadoop. Hence, unpacking hadoop on all machines or running hadoop daemons on all nodes continuously may not be necessary. [in addition to maintaining the individual installations]
  • Often, individual users who need Hadoop will use a subset of nodes in the cluster [ decided by a dynamic resource allocator ] to run their jobs
Hence, there is a need to be able to dynamically start hadoop on 'only the allocated nodes' and run jobs on them, with a single hadoop installation sitting on a network drive.

We can accomplish this as follows manually [with bare minimum configuration][ There is a deluge of configuration parameters in hadoop] . All steps described can be suitably automated.
  • Download and unpack Hadoop to a location in your network file system. Let us call this location HADOOP_HOME. All scripts referred below are w.r.t to HADOOP_HOME. Michael Noll's guide has good information on configuring things like password-less ssh etc, that are needed for basic operation.
  • Change the JAVA_HOME variable in conf/ to point to your installation of Java 6.
  • The compute cluster you are running on usually provides a means to know the hostnames of the compute nodes that you are allocated, once you submit a job. Choose one node as master node and add it to conf/masters file. Add all nodes [including the one picked as master] to conf/slaves. These files are used by bin/ and bin/ to start the HDFS and MapReduce daemons respectively.
  • However, we need to enter some basic configurations in the conf/hadoop-site.xml. Once again, refer to Michael-Noll's page on these configurations.
  • If your compute node has some individual storage or whatever individual storage available is sufficient for your job. Please override the 'hadoop.tmp.dir' property to a path on the compute node. The default is /tmp/hadoop-${}. Then, simply run the following commands on the name node to get the cluster started. bin/hadoop namenode -format, bin/, bin/
  • If the compute nodes do not have sufficient storage for your job, then we cannot automatically start HDFS daemons using Each DataNode (Slave daemon of HDFS) will use the same value for hadoop.tmp.dir property, and try to obtain a lock on the directory. Consequently, only one of them will succeed and all other DataNode daemons die. Unfortunately, the variable substitutions that are allowed in Hadoop config files does not allow us to get the daemons to pick different directories.
  • In such a scenario, we need to create separate conf/hadoop-site.xml configuration files (all except hadoop.tmp.dir property can be same for all DataNodes) for each DataNode daemon on each slave and start daemons explicitly on those slave nodes by running the bin/ script on those nodes. Use bin/hadoop namenode -format on the master node to format the HDFS. Use "bin/ --config start NameNode" on the master node to run the NameNode daemon on it.bin/ --config start datanode will start the DataNode daemon on that node, by picking the hadoop-site.xml config file from directory.
  • You can still use bin/ to start the mapreduce daemons - TaskTracker and JobTracker.
  • Submit any mapreduce job.
All these tasks can be automated by using "ssh remotemachine command ' to start the daemons remotely.


  1. Vương Lâm không suy nghĩ chút nào, trong nháy mắt khi quái nhân tóc bạc biến mất liền quát to.

    Tiếng nổ ầm ầm vang vọng khắp không trung. Âm thanh này lớn tới kinh người, trong nháy mắt liền khiến tu sĩ khắp bốn phía phải hít sâu một hơi.
    đồng tâm
    game mu
    cho thuê phòng trọ
    cho thuê phòng trọ
    nhac san cuc manh
    tổng đài tư vấn luật miễn phí
    văn phòng luật
    số điện thoại tư vấn luật
    thành lập công ty

    Hai mắt Thân Công Hổ sững lại, nhìn thằng vào Vương Lâm, trong mắt lộ vẻ vui mừng lẫn sợ hãi.

    Không chỉ có hắn, Chiến Không Liệt cũng như vậy.

    - Là hắn !

    Đường Ngôn Phong nhướng mày. Hắn nhận ra Vương Lâm chính là Hứa Mộc mà trước đây mình không nắm chắc có thể chiến thắng. Dòng chảy bên ngoài thân thể Vương Lâm vỡ tan, thân ảnh quái nhân tóc bạc kia lại xuất hiện nhưng lập tức bị sự sụp đổ này lan tới bên người. Ánh mắt hắn lộ ra những tia sáng kỳ dị, cười to, hung hăng hút mạnh một cái. Lập tức dòng chảy nguyên lực nọ bị hút vào trong miệng hắn.

    Thân thể Vương Lâm mau chóng lui lại, đuổi theo Lý Nguyên, vừa


Post a Comment

Popular posts from this blog

Learning Spark Streaming #1

I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming. In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed.

Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide.  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second.

As with any stream processing system, there are three big aspects to the framework itself.

Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input.The receiver based approach is a little complicated IMHO, and …

Thoughts On Adding Spatial Indexing to Voldemort

This weekend, I set out to explore something that has always been a daemon running at the back of my head. What would it mean to add Spatial Indexing support to Voldemort, given that Voldemort supports a pluggable storage layer.. Would it fit well with the existing Voldemort server architecture? Or would it create a frankenstein freak show where two systems essentially exist side by side under one codebase... Let's explore..

Basic Idea The 50000 ft blueprint goes like this.

Implement a new Storage Engine on top Postgres sql (Sorry innoDB, you don't have true spatial indexes yet and Postgres is kick ass)Implement a new smart partitioning layer that maps a given geolocation to a subset of servers in the cluster (There are a few ways to do this. But this needs to be done to get an efficient solution. I don't believe in naive spraying of results to all servers)Support "geolocation" as a new standard key serializer type in Voldemort. The values will still be  opaque b…

Enabling SSL in MAMP

Open /Applications/MAMP/conf/apache/httpd.conf, Add the line LoadModule ssl_module modules/         or uncomment this out if already in the conf file
Also add lines to listen on 80, if not there alreadyListen 80ServerName localhost:80 Open /Applications/MAMP/conf/apache/ssl.conf. Remove all lines as well as . Find the line defining SSLCertificateFile and SSLCertificateKeyFile, set it to SSLCertificateFile /Applications/MAMP/conf/apache/ssl/server.crt SSLCertificateKeyFile /Applications/MAMP/conf/apache/ssl/server.keyCreate a new folder /Applications/MAMP/conf/apache/ssl. Drop into the terminal and navigate to the new foldercd /Applications/MAMP/conf/apache/sslCreate a private key, giving a password openssl genrsa -des3 -out server.key 1024Remove the password cp server.key server-pw.key openssl rsa -in server-pw.key -out server.keyCreate a certificate signing request, pressing return for default values openssl req -new -key server.key -o…