Prerequisites:
Knowledge of a general purpose computer programming language, such as JAVA, Python, Ruby, or C++, or at least taking STSCI 4060 in parallel with this course; STSCI 5060 or basic SQL knowledge; STSCI 5010 or basic knowledge of SAS programming; STSCI 3520 or STSCI 4030 or basic knowledge of R programming.
Permission of instructor required.
Enrollment preference given to: students in the MPS Program in Applied Statistics.
This course covers the concepts, challenges, industry trends, management and analysis of big data using the Hadoop system. Topics include: basics of the Apache Hadoop platform and Hadoop ecosystem; the Hadoop distributed file system (HDFS); MapReduce or its alternative, a parallel programming model for distributed processing of large data sets; common big data tools, such as Pig (a procedural data processing language for Hadoop parallel computation), Hive (a declarative SQL-like language to handle Hadoop jobs), HBase (the most popular NoSQL database), and YARN; case studies; and integration of Hadoop with statistical software packages, e.g., SAS and R.