Skip to main content


Showing posts from May, 2009

Using Hadoop on a Cluster

Hadoop is a framework for Map-Reduce distributed computations. Map_reduce performs data decomposition i.e runs the same "function" parallely on different parts of a huge data set ('map') and then combines the results from those independent computations ('reduce'). Hadoop is a java implementation of map-reduce. Hadoop automatically splits the dataset and spawns map-reduce tasks that run on different physical machines. It also provides a high performance file system, Hadoop Distributed File System (HDFS) to enable high performance IO for the map-reduce tasks. In this post, we will specifically look at ways to get hadoop running on compute clusters with a network file system. The Hadoop Cluster Setup guide explains how to setup hadoop to run on a compute cluster. Michael Noll 's guide delves into this deeper and explains the bare-bone configurations that are needed for setting up a multinode hadoop cluster. But, both of them assume that the cluster mac