INFO251 Spring 2021
(all readings and lecture recordings are available on bCourses)
January 19: Introduction
- Introductions
- Nuts and bolts of the class: structure, homework, policies, learning objectives
Required Readings (students are already expected to have this level of familiarity with Python):
- Chapters 3-5 and Chapter 9 of McKinney (2013): Python for Data Analysis. O’Reilly Media, Inc.
- Install python, IPython, and the numerical analysis libraries on your laptop and bring it to class. I highly recommend you install the Anaconda version, but if you want to assemble the packages yourself, make sure you have python, ipython notebook, numpy, scipy, and matplotlib
- Read and complete at least the "Introduction" to the following Python tutorial: http://interactivepython.org/courselib/static/pythonds/index.html
- Watch 10-minute tour of pandas: https://www.youtube.com/watch?v=dcqPhpY7tWk
- Strongly recommended: Read and complete lessons 1-7 of Learn Pandas (https://bitbucket.org/hrojas/learn-pandas)
January 20 (Lab): NO LAB TODAY
- There is no lab section on Jan 22, please do not show up!
January 21: Experimental Methods for Causal Inference
- A-B testing, Business Experiments, Randomized Control Trials
- Counterfactuals and Control Groups
- Correlation and Causation
- Experimental design and statistical power
Required Readings:
- Chapters 2-3 of Khandker et al. (2010), “Handbook on Impact Evaluation”
- Introduction (pp. 263-269) to: Bertrand et al. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-269
Optional Readings:
- Pages 1-47 of: Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit"
- Athey & Imbens (2016) The econometrics of Randomized Experiments
-
Lin, M., Lucas, H.C., Shmueli, G., 2013. Research Commentary—Too Big to Fail: Large Samples and the p-Value Problem. Information Systems Research 24, 906–917. doi:10.1287/isre.2013.0480
- Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105
- Ariely (2004). “Why Businesses Don’t Experiment”, Harvard Business Review, p. 34
- Kohavi, R., Longbotham, R., Sommerfield, D. and Henne, R. Controlled experiments on the Web: Survey and practical guide. Data Mining and Knowledge Discovery 18 (2009), 140–181.
-
Reiley, D., Rao, J.M. & Lewis, R.A. (2011) Here, There, and Everywhere: Correlated Online Behaviors Can Lead to Overestimates of the Effects of Advertising. WWW 2011.
January 26: Impact Evaluation
- Research designs for impact evaluation
- Identifying assumptions
- Differences-in-Difference
Required Readings:
- Sections 1-3 of Shultz: School subsidies for the poor
-
Varian, H.R., 2016. Causal inference in economics and marketing. PNAS 113, 7310–7315. doi:10.1073/pnas.1510479113
Optional readings
- David Albouy: Lecture notes on Differences in Differences Estimation
- Lewis, R., Rao, J.M. & Reiley, D.H. (2015) Measuring the effects of advertising: The digital frontier . In: Economic Analysis of the Digital Economy . University of Chicago Press. pp. 191–218.
-
Jensen, R., 2007. The Digital Provide: Information (Technology), Market Performance, and Welfare in the South Indian Fisheries Sector. The Quarterly Journal of Economics 122, 879–924.
January 27 (Lab): Python and Pandas
- Programming paradigms
- Working with data
- Crash course in python
January 28: Regression and Impact Evaluation
- Regression and causal inference
- Interactions and heterogeneity
- Fixed and random effects
Required Readings:
- Chapter 5 of Khandker et al. (2010), “Handbook on Impact Evaluation”
- Lecture notes on “Fixed Effects Models”
Optional Readings:
- A more systematic treatment: Gerber, A.S., Green, D.P., 2012. Field Experiments: Design, Analysis, and Interpretation. W. W. Norton & Company, New York.
February 2: Non-Experimental Methods for Causal Inference
- Instrumental Variables
Required Readings:
- Chapter 6 of Khandker (2010), “Handbook on Impact Evaluation”
Optional Readings
- Chapter 10 of Stock & Watson (2010) on “Instrumental Variables”
- Angrist, J.; Krueger, A. (2001). "Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments". Journal of Economic Perspectives 15(4): 69–85.
- Duflo (2001). Schooling and Labor Market Consequences of School Construction in Indonesia: Evidence from an Unusual Policy Experiment
- A more systematic treatment: Kennedy, P., 2008. A Guide to Econometrics. 6 edition. ed. Wiley-Blackwell, Malden, MA.
- Alexandre Belloni, Victor Chernozhukov, and Christian Hansen (2011): “LASSO Methods for Gaussian Instrumental Variables Models,” 2011 arXiv:[stat.ME], http://arxiv.org/abs/1012.1297.
- Jason Hartford, Greg Lewis, Kevin Leyton-Brown, Matt Taddy (2017): “Deep IV: A Flexible Approach for Counterfactual Prediction.” Proceedings of the 34th International Conference on Machine Learning, PMLR 70, 1414–1423
February 3 (Lab): Regression and Hypothesis Testing
- T-tests and regressions with Python
- Dummy variables, interactions, fixed effects
- Fixed effects
- Interaction terms
- Instrumental variables
February 4: Non-Experimental Methods for Causal Inference, continued
- Regression discontinuity
Required Readings:
- Chapter 7 of Khandker (2010), “Handbook on Impact Evaluation”
Optional Readings
- Read a simplified example RD analysis in Python
- Buddlemeyer & Skoufias (2004). An Evaluation of the Performance of Regression Discontinuity Design on PROGRESA.
-
Solis, A., 2017. Credit Access and College Enrollment. Journal of Political Economy 125, 562–622. doi:10.1086/690829
February 9: Intro to Machine Learning
- Supervised and unsupervised learning
- Representation
- Evaluation
- Optimization
- Generalization and overfitting
- Training and test data
- Cross-validation and bootstrapping
- Evaluation and baselines
- Features and feature selection
Required Readings:
- Chapters 1 & 2 of Daume (in preparation). A course in machine learning
- Chapter 5 of Witten, Frank, Hall: Data Mining
Optional Readings:
-
Mullainathan, S., Spiess, J., 2017. Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives 31, 87–106. https://doi.org/10.1257/jep.31.2.87
- Syed, A. (2011). A review of cross validation and adaptive model selection.
February 10 (Lab): Computational Efficiency
- Vectorized computation
February 11: Nearest Neighbors
- Instance-based learning
- Nearest neighbors
- Curse of dimensionality
Required Readings:
- Chapter 3 of Daume (in preparation). A course in machine learning
Optional Readings:
- Chapter 13 (sections 13.1 - 13.3) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning
- Chapter 6 of Provost & Fawcett: Data Science for Business
February 16: Gradient Descent
- Cost functions
- Gradient descent
- Convexity
Required Readings:
- Chapter 7 of Daume (in preparation). A course in machine learning
Optional Readings:
- Chapter 5 of Schutt & O’Neill (2013): Doing Data Science
February 17 (Lab): ML Experiments in Python
- Random numbers, training and test data
- Built-in methods for cross validation
- Comparing different measures of performance
February 18: Regularization and Linear Models, part 1
- Regularization
- Ridge and Lasso
- Logistic regression
- Support vector machines
- Kernel methods
Required Readings:
- Chapter 7 of Daume (in preparation). A course in machine learning
Optional Readings:
- Chapter 6 (section 6.2) of James et al. (2016): Introduction to Statistical Learning
- This post on interpreting logistic regression results
- Chapter 3 (sections 3.3 and 3.4) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning
February 23: Regularization and Linear Models, part 2
- Regularization
- Ridge and Lasso
- Logistic regression
- Support vector machines
- Kernel methods
Same readings as above
February 24 (Lab): Linear models and Regularization
- Lasso vs. Ridge
- Cross-validation to find optimal regularization parameter
- Computational efficiency revisited
February 25: Naive Bayes
- Probability review: Bayes rule, independence, distributions
- Generative models and Naive Bayes
- Maximum likelihood estimation and smoothing
Required Readings:
- Chapter 4 of Schutt & O’Neill (2013): Doing Data Science
- Reread section 4.2 of Whitten, Frank, Hall: Data Mining
- Michael Collin’s lecture notes on Naïve Bayes (especially pp. 1-4)
Optional Readings:
- Paul Graham (2002) on “Better Bayesian Filtering”.
- Kevin Murphy's example of Bayes' Rule for medical diagnosis
March 2: Mid-Semester Quiz
- Quiz #1
March 3 (Lab): Gradient descent (continued)
- Gradient descent
- Naive bayes
March 4: Decision Trees
- Building decision trees
- Information gain
Required Readings:
- Chapter 8 of James et al. (2016): Introduction to Statistical Learning
- Chapters 13 of Daume (in preparation). A course in machine learning
Optional Readings:
- Chapter 9 (section 9.2) and Chapter 15 of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
March 9: Random Forests
- Regression Trees
- Random Forests
- Boosting
- Feature Importance
Optional Readings:
- Feature importance measures for random forest: blog post
- A Kaggle master explains gradient boosting
March 10 (Lab): Neural networks
- Intro to TensorFlow
March 11: Neural Networks, part 1
- Biological underpinnings
- The perceptron
- Rosenblatt's algorithm
Required Readings:
- Chapters 4 and 10 of Daume (in preparation). A course in machine learning
Optional Readings:
- Chapter 11 (sections 11.3-11.4) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning
March 16: Neural Networks, part 2
- Multilayer networks
- Backpropagation
Required Readings:
- Chapters 4 and 10 of Daume (in preparation). A course in machine learning
Optional Readings:
- Chapter 11 (sections 11.3-11.4) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning
-
We will review these videos by Grant Sanderson on backpropagation in class:
- https://www.youtube.com/watch?v=Ilg3gGewQ5U
- https://www.youtube.com/watch?v=tIeHLnjs5U8
March 17 (Lab): Deep Learning
- Naive Bayes
March 18: Deep Learning, part 1
- What is "deep" about deep learning?
- Auto-encoders
- Convolutional Neural Networks
- RNNs / LTSM Networks
Required Readings:
- Andrew Ng's lecture notes on sparse autoencoders
- UFLDL's Deep Learning tutorial
Optional Readings:
- Single-Layer Neural Networks and Gradient Descent
- A step-by-step backpropagation tutorial
- Tutorial on ConvNets
- Understanding LSTM Networks
- Dean (2018). The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
--- SPRING BREAK ---
March 30: Bias in ML
- High-profile ML failures
- Sources of bias
- Notions of fairness
Required Readings:
- Obermeyer, Powers, Vogeli and Mullainathan. 2019. Dissecting racial bias in an algorithm used to manage the health of populations. https://science.sciencemag.org/content/366/6464/447
Optional Readings
- Solon Barocas, Moritz Hardt, Arvind Narayanan. 2020. Fairness and machine learning:Limitations and Opportunities. https://fairmlbook.org (Chapters 1,2, and 5)
March 31: Fair ML lab
- Download lab 1 here: https://colab.research.google.com/drive/1yYHoLqbM5in4T801mQ083XGpRqHwtFhm
- Download lab 2 here: https://colab.research.google.com/drive/1-kMXYl7LPX1qTnBdBi-ueYx15umPzM24
April 1: Fair ML
- Formalization
- Identifying bias
- Fairness constraints
- Technical "solutions"
Required Readings:
- Reading: Mulligan, Kroll, Kohli & Wong. 2019. This Thing Called Fairness: Disciplinary Confusion Realizing a Value in Technology. https://dl.acm.org/doi/10.1145/3359221
April 6: Common practical issues
- Bias-variance tradeoff
- Feature engineering
- Imbalanced data
Required Readings:
- Chapters 5 & 6 of Daume (in preparation). A course in machine learning
Optional Readings
- A plain-English tutorial on the bias-variance tradeoff
- Chapters 1-3 of Mastering Feature Engineering (early release)
- Chapter 2 of James et al. (2017). An Introduction to statistical Learning
- Andrew Gelman on Missing Data Imputation
- He, H., Garcia, E.A., 2009. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21, 1263–1284. doi:10.1109/TKDE.2008.239
-
Lakkaraju, H., Kleinberg, J., Leskovec, J., Ludwig, J., Mullainathan, S., 2017. The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables. KDD 2017, 275–284. https://doi.org/10.1145/3097983.3098066
April 7 (Lab): Supervised learning practicalities
- X
April 8: Common practical issues (Part 2)
- Imbalanced data
- Missing data
- Multi-class classification
- Model and feature selection
Required reading
- He, H., Garcia, E.A., 2009. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21, 1263–1284. doi:10.1109/TKDE.2008.239
Optional reading:
- Python tutorial on Cost-Sensitive Decision Trees for Imbalanced Classification
- Python tutorial on softmax classification
- Multinomial response models, from Rodríguez, G. (2007). Lecture Notes on Generalized Linear Models.
April 13: Supervised Learning Wrap-Up
- Modelling Trade-Offs
- Comparing classifiers
- Guiding principles
Required Readings:
- Chapter 13 of Daume (in preparation). A course in machine learning
- Domingos, “ A Few Useful Things to Know about Machine Learning .” Communications of the ACM, 55 (10), 78-87, 2012.
Optional Readings:
- Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D., 2008. “ Top 10 Algorithms in Data Mining ”. Knowledge and Information Systems 14, 1–37. doi:10.1007/s10115-007-0114-2
April 14 (Lab): TBD
- X
April 15: Unsupervised learning
- Cluster analysis
- Dimensionality Reduction
- Principal Component Analysis
- Case study: Eigenfaces
- Other methods for dimensionality reduction: SVD, NNMF, LDA
Required Readings
- Chapter 7 of Leskovec, Rajaraman, and Ullman (2014): Mining of Massive Datasets
Optional Readings
- Watch Pedro Domingos talk about the curse of dimensionality (segment 4 of week 4)
- Chapter 11 (sections 11.1 – 11.3) Leskovec, Rajaraman, and Ullman (2014): Mining of Massive Datasets.
- Chapter 15 of Daume (in preparation). A course in machine learning
- Justin Grimmer and Gary King. 2011. “General Purpose Computer-Assisted Clustering and Conceptualization.” Proceedings of the National Academy of Sciences. Copy at http://j.mp/2qzYYj2
- Chapter 6 of Provost & Fawcett: Data Science for Business
- Chapter 14 (sections 14.2, 14.5 - 14.10) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
- Turk & Pentland (1991) “ Eigenfaces for Recognition ”
April 20: Recommender Systems
- The Netflix challenge
- Content-based methods
- Learning features and parameters
- Nearest-neighbor collaborative filtering
Recommended Readings:
- Chapter 8 of Schutt & O’Neill (2013): Doing Data Science
- Domingos, “ A Few Useful Things to Know about Machine Learning .” Communications of the ACM, 55 (10), 78-87, 2012.
- Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D., 2008. “ Top 10 Algorithms in Data Mining ”. Knowledge and Information Systems 14, 1–37. doi:10.1007/s10115-007-0114-2
Optional Readings:
- The Guardian (2017), "How algorithms are pushing the tech giants into the danger zone"
- Chapter 9 of Leskovec, Rajaraman, and Ullman (2014): Mining of Massive Datasets.
- Yehuda Koren (2009) “ The BellKor Solution to the Net¢ix Grand Prize"
- Resnick et al (1994) “ GroupLens: an open architecture for collaborative filtering of netnews ”, CSCW ’94, pp. 175-186
- RM Bell, Y Koren (2007) “ Lessons from the Netflix prize challenge ”, ACM SIGKDD Explorations Newsletter
April 21 (Lab): Unsupervised learning
- k-Means clustering
- Dimensionality reduction: PCA
April 22: Machine learning and causal inference
- ML for measurement
- Inference after selection
- Selecting among many controls
- Selecting among many instruments
- Machine learning heterogeneous treatment effects
Required Readings:
- Section 4 of: Athey, S., 2018. The impact of machine learning on economics, in: The Economics of Artificial Intelligence: An Agenda. University of Chicago Press, pp. 507–547.
- Athey, S., Imbens, G., 2019. Machine Learning Methods Economists Should Know About. arXiv:1903.10075.
Optional Readings:
- Athey, S., Imbens, G., 2016. Recursive partitioning for heterogeneous causal effects. PNAS 113, 7353–7360. https://doi.org/10.1073/pnas.1510489113
- Athey, S., M. Bayati, N. Doudchenko, G. Imbens, and K. Khosravi (2017) "Matrix Completion Methods for Causal Panel Data Models." http://arXiv.org/abs/1710.10251
-
Belloni, A., Chernozhukov, V., Hansen, C., 2014. High-Dimensional Methods and Inference on Structural and Treatment Effects. Journal of Economic Perspectives 28, 29–50. https://doi.org/10.1257/jep.28.2.29
- Same authors (2011): “LASSO Methods for Gaussian Instrumental Variables Models ,” 2011 arXiv:[stat.ME], http://arxiv.org/abs/1012.1297 .
- Chernozhukov, V., Hansen, C., Spindler, M., 2015. Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach. Annual Review of Economics 7, 649–688. https://doi.org/10.1146/annurev-economics-012315-015826
- Künzel, S.R., Sekhon, J.S., Bickel, P.J., Yu, B., 2019. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences 116, 4156–4165.
- Sands and Gilchrist (Medium Post): Best of Both Worlds: An Applied Intro to ML For Causal Inference
-
Taylor, J., Tibshirani, R.J., 2015. Statistical learning and selective inference. Proceedings of the National Academy of Sciences 112, 7629–7634.Wager, S., Du, W., Taylor, J., Tibshirani, R.J., 2016. High-dimensional regression adjustments in randomized experiments. PNAS 113, 12673–12678.
April 27: Applied ML - start to finish
- Data => Features
- Training and cross-validation
- Evaluating performance
- Extensions
Optional Readings:
- Blumenstock et al (2015): Predicting Poverty with Mobile Phone Metadata
- Aiken et al. (2020): Targeting Development Aid with Machine Learning and Mobile Phone Data: Evidence from an Anti-Poverty Intervention in Afghanistan
April 28 (Lab): No Lab
April 29: Summary
- Recap / summary
- Quiz #2