Thursday, January 12, 2012

Running Hadoop Pig Scripts with MapR Demo VM


What is Hadoop Pig?

Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions but you can also create your own user-defined functions to do special-purpose processing.

Pig Latin programs run in a distributed fashion on a cluster (programs are complied into Map/Reduce jobs and executed using Hadoop). For quick prototyping, Pig Latin programs can also run in "local mode" without a cluster (all processing takes place in a single local JVM).

How to install?

Download VMWare Player at
http://downloads.vmware.com/d/info/desktop_downloads/vmware_player/3_0
Download MapR Demo VM at
http://package.mapr.com/releases/v1.2.0/vmdemo/MapR-VM-1.2.0.12140GA-1-m3.tar.bzip2

1. Extract this using
bunzip2 MapR-VM-1.2.0.12140GA-1-m3.tar.bzip2 (This shows some errors but you can ignore them)
tar -xvf MapR-VM-1.2.0.12140GA-1-m3.tar (if output file is not .tar and is tar,out still its okay)
2. After untartting this transfer it to Windows machine
3. Install VMWare Player and start the Demo VM


How to Run it?

local mode


mapreduce mode


Problem to Solve

Marks of Students in Unit Test1 in Subjects Telugu, Hindi, English, Maths, Science and Social respectively. Now Class Teacher wants to Find all the students who are Failed?



Pig Script to solve this

A = load 'Unit1' using PigStorage('\t') as (Name:chararray,Telugu:int,Hindi:int,English:int,Maths:int,Science:int,Social:int);

B = filter A by Telugu < 35 OR Hindi < 35 OR English < 35 OR Maths < 35 OR Science < 35 OR Social < 35;


DUMP B;


Here is the Output of the Pig Script