Note: This is archived material from January 2013.

INFX 598 C/D: Introduction to Data Science

Winter 2013
University of Washington School of Information
Lectures: Tuesday and Thursday 1:30-3:20, MGH 420

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

This course offers students an introduction to the growing field of "Data Science" as practiced by leading data scientists in industry and research. As “big data” become the norm in modern business and research environments, there is a growing demand for individuals who are able to make decisions and derive meaningful insight from large-scale, heterogeneous data. This requires a diverse mix of skills, from data munging and ETL; to distributed storage and computing; to machine learning and econometrics; to effective visualization and communication.

Through a combination of hands-on exercises and guest lectures by experts in the field, this course provides an overview of several key concepts, skills, and technologies used by practicing data scientists. While the lectures will be intelligible to a general audience, successful completion of assignments requires college-level exposure to statistics and programming.

Prerequisites

A data scientist is often referred to as someone who knows more statistics than a computer scientist and more computer science than a statistician. Students enrolled in the course must have college-level exposure to both statistics and programming. Students who do not meet the following requirements, and in particular the programming requirement, will find it difficult to satisfactorily complete problem sets, and should consider taking the course at a later date.

Programming: Students should be able to comfortably program in a high level programming language like Java, python, php, or C/C#/C++. Note that html, javascript, and VBA are not sufficient in this context. “Comfortably” implies that students should be able to write simple programs from scratch, like a web scraper, or a text parser, or a simple game of scrabble or tic-tac-toe. Real-world programming experience (e.g. an intensive summer programming internship) will be as useful than formal CS coursework.

Statistics: Students should have had prior exposure to regression methods and should understand concepts of hypothesis testing and statistical significance. Experience with R or MatLab will be especially valuable in completing the problem sets.

Course Outline:

Section 1: Introduction to Data Science

January 8: What is Data Science?

January 10: Industry Overview

Section 2: Identifying questions and developing an empirical framework

January 15: Developing an Empirical Framework

January 17: Business experiments

Section 3: Data capture, munging, storage, and organization

January 22: Data Capture and ETL

January 24: Storage and Organization: Databases, Scalable SQL and NoSQL

Section 4: Basic analytics

January 29: Distributions, t-tests, and the importance of basic statistics

January 31: How far can basic statistics get you?

February 5: Regression

Section 5: Advanced analytics

February 7: Machine Learning I: Introduction

February 12: Machine Learning 2: Supervised Learning

February 14: Machine Learning 3: Unsupervised Learning

February 19: Social Network Analysis

Visualizing and Communicating Data

February 21: Visualizing Quantitative Information

February 26: Communicating Results Effectively

February 28: Flex session

Scaling to terabytes and petabytes

March 5: Scaling: What works and what doesn’t (and what might in the future)

March 7: Common applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Perspectives from industry and academia

March 12: Group Project Presentations

March 14: The future of Data Science and Big Data

Grading

Students are required to read the material assigned each week before section, and will be expected to actively engage in discussion of these materials. Course grades will based on a final group project, two problem sets, discussion leadership, and overall classroom participation. The two problem sets will be programming assignments, using R and Python. The final project will be done in groups of 3-4 students; further details will be provided at the start of the quarter.

Group final project: 40%
Technical problem set 1: 20%
Technical problem set 2: 20%
Class Discussion Leadership: 10%
Class Participation: 10%
Extra Credit: 5%

Grading Policy

All assignments are to be submitted on Canvas by 12pm on the due date.
Assignments turned in *up to* 24 hours after the due date will be penalized 20%.
Any assignments turned in more than 24 hours late will receive no credit.

A Note on Programming Languages

Most in-class code demonstration will involve either R or Python. Guest lecturers may give examples in other languages but will explain what the code means. Homework assignments will generally require R or Python. If you would prefer to use a different language successfully, you may do so, but you will be “on your own” and should not expect technical support.

Academic Integrity Policy

Discussion with instructors and classmates is encouraged, but each student must turn in individual, original work and cite appropriate sources where appropriate.

Readings and Class Discussion

Required and optional readings will be announced in class and posted on the course website. Students will be assigned by the instructor into groups of 2-4, and each group will be responsible for guiding classroom discussion for a given week (two classes). Unless otherwise noted by the instructor, a discussion will last roughly 10-20 minutes, and will consist of (i) a 5-10 minute summary of all required and optional readings for that week, and (ii) 10-15 minutes of open discussion, moderated by the discussion leaders. Discussion leaders are also responsible for posting a 500-1000 word entry on the class blog/wiki that summarizes the required and optional readings for the week, as well as any salient points from the in-class discussion. This post must be made by 12pm on the Friday following second class.

Over the course of the quarter, we will host several prominent visiting speakers. On days when one or more speakers are present, there will be no discussion of the readings (in such instances, all readings for a week will be discussed on the day without a speaker). Instead, the discussion leaders will be responsible for researching the visiting speaker, meeting the speaker and walking him/her to the classroom, and moderating the classroom discussion and Q&A following the speaker’s presentation. Moderators should prepare several intelligent, challenging questions to ask the speaker immediately after the talk finishes, and if ever there is a lull in the discussion.

Assignments

Group Project (40%): Students will form groups of 2-4, and complete one of the following assignments. Each group must submit a 200-300 word description of the proposed project by 12pm on January 24.

Analysis of a dataset “in the wild”
Sectoral Analysis

Problem Set 1 (20 points): Programming Assignment 1: Data processing and basic analytics
The goal of this problem set is to familiarize yourself with implementing basic statistical analysis on a small dataset using R. It is important that you read Chapters 1-3 carefully, and work to understand exactly how the authors are using R to manipulate the underlying data. If you merely hack your way through the exercises without working through the material in the chapters, you will come away with a very incomplete understanding of how to perform such analysis.
Assignment: After reading Chapters 1-3 (pp 1-64) of Everitt & Hothorn, complete the following exercises (2 points each, 18 points total) : 1.1, 1.2, 1.3, 1.4; 2.1, 2.3, 2.4; 3.1, 3.2. Extra credit (2 points): 3.3. Submit two files:

A polished .pdf with your solutions. 2 points will be given for overall presentation, so take care to appropriately label your graphs and tables. At the top of your submission, write your name and the names of any students with whom you worked, if applicable. Also write an estimate of how long, in hours, it took you to complete the exercise -- this will not affect your grade, it is for calibration.
Your commented R code as a plain text file

Problem Set 2 (20 points): Programming Assignment 2: Advanced analytics and visualization

Details TBD

Classroom Discussion Leadership (10%):

See description in above section, “Readings and Class Discussion”

Extra Credit (5%): Several popular books have been written on topics thematically relevant to this course. Excerpts from some of these books are required reading. For extra credit, write a short (400-800 words) review of one of the books listed below. The review should summarize the core points of the book and provide a critical analysis that draws attention to strengths and shortcomings of the book. Please do not review a book that you have previously read.

Davenport & Harris. Competing on Analytics
Davenport, Harris, & Morison. Analytics at Work
Ian Ayres. Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart
Steven Baker. The Numerati
Thomas Redman. Data Driven: Profiting from Your Most Important Business Asset
Sasha Issenberg: The Victory Lab
Nate Silver: The Signal and the Noise

Detailed Syllabus

Section 1: Introduction to Data Science

January 8:What is Data Science?

Readings:

Executive summary of: McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity

January 10: Industry Overview

Guest Speaker: Bob Davis (General Manager, Office 365, Microsoft)
DUE: Background and interests (online quiz through canvas)

Readings:

Thomas Davenport (2006). “Competing on Analytics”, Harvard Business Review, Jan. 2006, Vol. 84 Issue 1, pp. 99-107.
Chapter 1 of Provost & Fawcett: Data Science for Business

Section 2: Identifying questions and developing an empirical framework

January 15:Business experiments

Readings:

Andrew Gelman: “There are four ways to get fired from Ceasars”
Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105
Davenport (2009). “How to Design Smart Business Experiments”, Harvard Business Review pp. 69-76.
Ariely (2004). “Why Businesses Don’t Experiment”, Harvard Business Review, p. 34
Bertrand et al. (2009). “Does Ad Content Affect Consumer Demand?” Alliance, 14:3, p.18
INTRODUCTION TO: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-
[optional] Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-306
[Optional] Selections from Gerber & Green: Field Experiments

January 17:Developing an Empirical Framework

Guest Speaker: Scott Golder (Staff Sociologist, Context Relevant)

Readings:

Chapter 2 of Provost & Fawcett: Data Science for Business
Whom the Gods Would Destroy, They First Give Real-time Analytics
[Optional] Chapter 4 of Bernard: Social Research Methods
[Optional] Alamar and Mehrotra, “Beyond ‘Moneyball’: The rapidly evolving world of sports analytics”

Section 3: Data capture, munging, storage, and organization

January 22:Data Capture and ETL

Guest Speaker: Andrew Borthwick (Principal Scientist and Director of Data Research, Intelius)

Readings:

Chapters 1 & 2 of Spector, Data Manipulation with R
"Duplicate Record Detection: A Survey", by Elmagarmid, et. al.
"Record Linkage: Similarity Measures and Algorithms" by Koudas, et. al.
[optional] Andrew’s work at Intelius: here for blocking and here for pairwise decision making.
Bibliographic database assignment? De-dup the authors

January 24:Storage and Organization: Databases, Scalable SQL and NoSQL

DUE: 1-paragraph group project description
Guest Speaker: Bill Howe (Director of Research, Scalable Data Analytics, eScience Institute and Affiliate Assistant Professor, Department of Computer Science and Engineering, UW)

Readings:

Chapters 1&2 of: A Handbook of Statistical Analyses Using R
Stonebraker et al (2010). “MapReduce and Parallel DBMS’s: Friends or Foes?” Online at http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf
Rick Cattell, “Scalable SQL and NoSQL Data Stores”, SIGMOD Record, December 2010 (39:4)
[Optional] Cohen et al. (2009) “MAD Skills: New Analysis Practices for Big Data”

Section 4: Basic analytics

January 29: Distributions, t-tests, and the importance of basic statistics

Readings

Chapter 3 of: A Handbook of Statistical Analyses Using R
Chapters 1 & 2 of Freedman, Pisani, and Purvis: Statistics
[Optional] Chapters 6 & 19 of Freedman, Pisani, and Purvis: Statistics
[Optional] H. Stern: “Statistics and the College Football Championship,” The American Statistician, 2004.

January 31: How far can basic statistics get you?

Readings

Chapters 1 & 2 of Freedman: Statistical Models: Theory and Practice
Watch http://www.gapminder.org/videos/the-joy-of-stats/
[Optional] Watch http://www.ted.com/talks/lies_damned_lies_and_statistics_about_tedtalks.html
[Optional] Excerpts from Huff: How to Lie With Statistics
[Optional] Chapters TBD of Torgo: Data Mining with R: Learning with Case Studies

February 5: Regression

Readings

Chapters 3 & 4 of Provost & Fawcett: Data Science for Business
Chapter 6 of: A Handbook of Statistical Analyses Using R
[Optional] Excerpts from Kennedy: A Guide to Econometrics

Section 5: Advanced analytics

February 7: Machine Learning I: Introduction

DUE: Problem Set 1

Readings:

Chapter 1 of: Friedman, Hastie, Tibshirani (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition.
Chapter 1 of: Mining of Massive Datasets.
[Optional] Haydn Shaughnessy, “How Semantic Clustering Helps Analyze Consumer Attitudes”

February 12: Machine Learning 2: Supervised Learning

Readings:

Chapter 8 of The Signal and the Noise: “Less and Less and Less Wrong” (Bayes’ Theorem)
[Optional] C. Haruechaiyasak: “A Tutorial on Naive Bayes Classification”, 2008.
[Optional] “Polonium: Tera-Scale Graph Mining and Inference for Malware Detection.” Duen Horng (Polo) Chau et al. Proccedings of SIAM International Conference on Data Mining (SDM) 2011. April 28-30, 2011. Mesa, ArizonaSK-Learn User Guide, Sections

February 14: Machine Learning 3: Unsupervised Learning

Readings:

Chapter 7 [Read 7.1, 7.2, 7.3] of: Mining of Massive Datasets
[Optional] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein (1998). “Cluster analysis and display of genome-wide expression patterns” Proceedings of the National Academy of Sciences. Vol. 95 pp. 14863-14868
[Optional] Rajkumar Venkatesan (2007). “Cluster Analysis for Segmentation”, Darden Business Publishing

February 19: Social Network Analysis

Guest Speaker: TBD

Readings:

[Optional] N. Godbole et. al: “Large-scale sentiment analysis for news and blogs”, International Conference on Weblogs and Social Media, 2007.
[Optional] L. Page, et. al: “The PageRank citation ranking: Bringing order to the web”, Stanford, 1999.
[Optional] Albert R, Jeong H, Barabasi AL. (1999): “Diameter of the World Wide Web”. Nature, 401:130-131

Visualizing and Communicating Data

February 21: Visualizing Quantitative Information

Guest Speaker: Jock Mackinlay (Senior Director, Visual Analysis, Tableau Software)

Readings:

Excerpts from Edward Tufte, “Visual Display of Quantitative Information”
WSJ Guide to Information Visualization
[Optional] Excerpts from Beautiful Data

February 26: Communicating Results Effectively

Readings

“How to Lie with Charts” and “How to Lie with Maps”
TBD

February 28: Text Mining

Readings:

Scaling to terabytes and petabytes

March 5: Scaling: What works and what doesn’t (and what might in the future)

Readings:

Comparing Pig Latin and SQL for Constructing Data Processing Pipelines
Cloudera overview

March 7: Common applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Readings:

Chapter 2 (“Large-Scale File Systems and Map-Reduce”) of: Mining of Massive Datasets
[Optional] Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.): CLEF2010, pp. 64-69, 2010
[Optional] Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM. January 2010.

Perspectives from industry and academia

March 12: Group Project Presentations

Readings:

Review Chapters of: McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity.
Group Project Reports

March 14: The Future and Ethics of Data Science and Big Data

Guest Speaker: Bob Davis (General Manager, Office 365, Microsoft)

Readings:

Chris Anderson, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”, Wired magazine.
Counterpoint to Anderson TBD