Practicum – Week 2 Journal Entry

Practicum – Week 2 Journal Entry

Objectives

1. Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL).

2. Solve challenging big data processing tasks by finding highly efficient solutions; very inefficient solutions will lose marks.

3. Experience processing three different types of real data

a. Standard multi-attribute data (video game sales data)

b. Time series data (Twitter feed)

c. Bag of words data

4. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls).

a) [Hive] https://cwiki.apache.org/confluence/display/Hive/LanguageManual

b) [Spark] http://spark.apache.org/docs/latest/api/scala/index.html#package

c) [Spark SQL] https://spark.apache.org/docs/latest/sql-programming-guide.html

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

https://spark.apache.org/docs/latest/api/sql/index.html

· Hint: If you are not sure what a spark API call does, try to write a small example and try it in the spark shell

Submission checklist

· Ensure that all of your solutions read their input from the full data files (not the small example versions)

· Check that all of your solutions run without crashing on the CloudxLab interface.

· Delete all output files

· Zip the whole assignment folder and submit via LMS

Copying, Plagiarism

This is an individual assignment. You are not permitted to work as a part of a group when writing this assignment.

Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. For individual assignments, plagiarism includes the case where two or more students work collaboratively on the assignment. The Department of Computer Science and Computer Engineering treats plagiarism very seriously. When it is detected, penalties are strictly imposed.

Expected quality of solutions