Practicum – Week 2 Journal Entry
Practicum – Week 2 Journal Entry
Objectives
1. Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL).
2. Solve challenging big data processing tasks by finding highly efficient solutions; very inefficient solutions will lose marks.
3. Experience processing three different types of real data
a. Standard multi-attribute data (video game sales data)
b. Time series data (Twitter feed)
c. Bag of words data
4. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls).
a) [Hive] https://cwiki.apache.org/confluence/display/Hive/LanguageManual
b) [Spark] http://spark.apache.org/docs/latest/api/scala/index.html#package
c) [Spark SQL] https://spark.apache.org/docs/latest/sql-programming-guide.html
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
https://spark.apache.org/docs/latest/api/sql/index.html
· Hint: If you are not sure what a spark API call does, try to write a small example and try it in the spark shell
Submission checklist
· Ensure that all of your solutions read their input from the full data files (not the small example versions)
· Check that all of your solutions run without crashing on the CloudxLab interface.
· Delete all output files
· Zip the whole assignment folder and submit via LMS
Copying, Plagiarism
This is an individual assignment. You are not permitted to work as a part of a group when writing this assignment.
Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. For individual assignments, plagiarism includes the case where two or more students work collaboratively on the assignment. The Department of Computer Science and Computer Engineering treats plagiarism very seriously. When it is detected, penalties are strictly imposed.
Expected quality of solutions