Define the purpose of Pig and Hive and your understanding of the process and what it is used for in a Big Data environment.

Import a table into the Hadoop ecosystem through the use of Pig and Hive. Define the purpose of Pig and Hive and your understanding of the process and what it is used for in a Big Data environment. Be sure to include all supporting screenshots as necessary. 
Step 1 – Reading

  1. Read Chapter 5: MapReduce Details for Multimachine Clusters (in “Pro Hadoop”; books 24×7).

 
Step 2 – Task 3 – Processing data using Pig and Hive

  1. Hive

**Please ensure that all services are running properly as needed in the Cloudera Manager before starting this task**

  1. Open a terminal window and type the following:

$ nano employees                         (This will open a text editor with the filename of employees)
 
Enter the following table structure:
Mary, 7038121129, VA, 42000
Tim, 3145558989, TX, 86000
Bob, 3429091122, MN, 75500
Manisha, 7096664242, WV, 94200
Aditya, 2021119765, CA, 39000
Xinwuei, 4129098787, OH, 57600
 
Press CTRL+X and follow the prompts to save changes. Then enter the following commands:
$ hadoop fs –put employees
$ hive
 

  1. Please enter the following commands in Hive:

Ø  CREATE database databasename;                        (Where databasename can be ANY name)
Ø  SHOW databases;
Ø  CREATE table tablename (colname1 datatype, colname2 datatype, colname3, datatype, . . .) row format delimited fields terminated by ‘,’ ;                            (Where tablename can be ANY name, the datatype must be appropriate to the column data, and column name must reflect column contents)
Ø  DESCRIBE tablename                           (where tablename is the name of the table you have just create)
Ø  LOAD DATA INPATH ‘employees’ INTO TABLE tablename;       (this will load the data into the table)
Ø  SELECT * FROM tablename;                 (this will show the contents of the table with tablename)
Ø  SELECT count(*) FROM tablename;       (this will count the number of rows in the table with tablename)
 

  1. Please discuss the process you have just completed and upload your results with related screenshots as needed.
  2. Pig
  3. In a terminal, use nano to create two text files named “file1” and “file2” with the following values respectively:

 
10,30,0                  5,1,10
0,5,10                    10,5,0
10,20,10                20,20,20
 
CTRL+X and follow the prompts. Then type the following:
$ hadoop fs –put file1
$ hadoop fs –put file2
$ pig

  1. In pig, enter the following commands:

Ø  A = LOAD ‘/user/cloudera/file1’ using PigStorage (‘,’) as (a1:int,a2:int,a3:int);
Ø  Dump a;
Ø  B = LOAD ‘/user/cloudera/file2’ using PigStorage (‘,’) as (b1:int,b2:int,b3:int);
Ø  Dump b;
Ø  Describe a;
Ø  Describe b;
Ø  C = UNION a,b;
Ø  Dump c;
Ø  SPLIT c INTO d IF $0==0, e IF $0==10;
Ø  Dump d;
Ø  Dump e;
Ø  F = filter c by $1 > 5;
Ø  Dump f;
Ø  Illustrate f;
Ø  G = GROUP c by $2;
Ø  Dump g;

  1. Please discuss the operations you have done using Pig and the resultant logical operations that you have done.

 
Step 3 – Task 3 – Report
 
Write a report (4-6 pages) includes:

  • Following APA standards cover page and table of content,
  • Short research report on other components of Hadoop platform: Hive and Pig.
  • Create a file and loading data in the file; include a document on your understanding of the process and purpose, along with supporting screen shots.
  • Use Hive and Pig and generate the result, along with supporting screen shots.

 

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *