Data Mining: Learning from Large Data Sets

Overview

Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In this graduate-level course, students will learn to apply, analyze and evaluate principled, state-of-the-art techniques from statistics, algorithms and discrete and convex optimization for learning from such large data sets.

Topics

  • Dealing with large data (Data centers; Map-Reduce/Hadoop; Amazon Mechanical Turk)
  • Fast nearest neighbor methods (Shingling, locality sensitive hashing)
  • Online learning (Online optimization and regret minimization, online convex programming, applications to large-scale Support Vector Machines)
  • Multi-armed bandits (exploration-exploitation tradeoffs, applications to online advertising and relevance feedback)
  • Active learning (uncertainty sampling, pool-based methods, label complexity)
  • Dimension reduction (random projections, nonlinear methods)
  • Data streams (Sketches, coresets, applications to online clustering)
  • Recommender systems

News

  • Class moved to CAB G11.
  • First class on 18.2. Tutorials start in the week of 24.2.
  • No recitations on Friday 18.4. You are welcome to join on Tuesday 15.4.
  • No recitations on the week of 21.4. - 25.4.
  • Project report template available here.
  • Project scores and solutions available here.

Details

  • VVZ Information: See here.
  • Recitations:
    • Tue 13-14 in CAB G61. Last names starting with A-E
    • Tue 14-15 in ETZ J 91. Last names starting with F-M
    • Tue 15-16 in ETZ J 91. Last names starting with N-S
    • Fri 14-15 in NO C 6. Last names starting with T-Z
  • Textbook: A. Rajaraman, J. Ullman. Mining of Massive Data Sets. Available here.

Homeworks

  • Self Assessment Questions [pdf]
  • Homework 1 [pdf]. Data set [zip]. Solution [zip].
  • Homework 2 [pdf]. Solution [zip].
  • Homework 3 [pdf]. Solution [pdf].
  • Homework 4 [pdf]. Solution [pdf].
  • Homework 5 [pdf]. Solution [pdf].
  • Homework 6 [pdf]. Solution [pdf].
  • Homework 7 [pdf]. Solution [pdf].

Lecture Notes

  • February 18: Introduction, MapReduce [pdf]
  • February 25: Approximate Retrieval; Min-hashing [pdf]
  • March 4: Locality sensitive hashing [pdf]
  • March 11: SVMs; online convex programming [pdf]
  • March 18: (Parallel) stochastic gradient descent [pdf]
  • March 25: Scaling up Kernel Machines [pdf]
  • April 1: Active Learning [pdf]
  • April 8: Large scale unsupervised learning (Online k-means, coresets) [pdf]
  • April 15: Large scale unsupervised learning continued (coresets, GMMs) [pdf]
  • April 22: Exploration--exploitation tradeoffs (k-armed bandits, upper confidence sampling) [pdf]
  • May 6: Contextual bandits [pdf]
  • May 13: Submodular functions (properties, algorithms and applications) [pdf]
  • May 27: Recommending sets (structured prediction, online submodular optimization) [pdf]

Recitations

  • Feb 25: Apache Hadoop: MapReduce, HDFS and Hadoop Streaming [pdf]
  • Mar 4: LSH & Approximate NN [pdf]
  • Mar 11: SVM [pdf]. Code [zip]
  • April 1: Active Learning [pdf]
  • April 15: Unsupervised Learning [pdf]
  • May 6: Exploration-Exploitation [pdf]
  • May 13: Submodular Functions [pdf]
  • May 27: Online Submodular Optimization [pdf]

Relevant Readings

  • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
  • Jure Leskovec, Eric Horvitz. Planetary-Scale Views on a Large Instant-Messaging Network. In Proceedings WWW 2008 [pdf]
  • Manuel Gomez Rodriguez, Jure Leskovec, Andreas Krause. Inferring Networks of Diffusion and Influence, In Proc. ACM Conference on Knowledge Discovery in Databases (KDD), 2010 [pdf]
  • James Hays, Alexei A. Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH 2007). August 2007, vol. 26, No. 3.
  • Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft. When is "Nearest Neighbor" Meaningful?. Proc. 7th International Conference on Database Theory (ICDT 1999) LNCS 1540: 217-235.
  • Aristides Gionis, Piotr Indyk, Rajeev Motwani. Similarity Search in High Dimensions via Hashing [ps]. 25th International Conference on Very Large Databases (VLDB) , 1999
  • Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. Proc. International Conference on Machine Learning 2003 [pdf]
  • Elad Hazan, Amit Agarwal, Satyen Kale. Logarithmic Regret Algorithms for Online Convex Optimization. Machine Learning Journal Volume 69 , Issue 2-3 Pages: 169 - 192 [pdf]
  • Martin Zinkevich, Markus Weimer, Alex Smola, Lihong Li. Parallelized Stochastic Gradient Descent. Advances in Neural Information Processing Systems 23, 2010 [pdf]
  • Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. Advances in Neural Information Processing Systems 24, 2011 [pdf]
  • Ali Rahimi, Ben Recht. Random Features for Large-Scale Kernel Machines. Advances in Neural Information Processing Systems 20, 2007 [pdf]
  • Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou. Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison. Advances in Neural Information Processing Systems 25, 2012 [pdf]
  • Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani. L1 norm support vector machines. Advances in Neural Information Processing Systems 2003 [html]
  • John Duchi, Shai Shalev-Shwartz, Yoram Singer, Tushar Chandra. Efficient Projections onto the l1-Ball for Learning in High Dimensions. ICML 2008 [pdf]
  • Nathan Ratliff, J. Andrew (Drew) Bagnell, and Martin Zinkevich. (Online) Subgradient Methods for Structured Prediction. AISTATS 2007 [pdf]
  • Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. Advances in Neural Information Processing Systems 2003 [pdf]
  • Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Proceedings of the Seventeenth International Conference on Machine Learning 2000 [ps]
  • Dan Feldman, Morteza Monemizadeh, Christian Sohler. A PTAS for k-Means Clustering Based on Weak Coresets. Proc. 23th Annu. ACM Symposium on Computational Geometry (SoCG) 2007 [pdf]
  • Chris Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Chapter 9
  • Percy Liang, Dan Klein. Online EM for Unsupervised Models. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) 2009 [pdf ]
  • Dan Feldman, Matthew Faulkner, Andreas Krause. Scalable Training of Mixture Models via Coresets. In Proc. Neural Information Processing Systems, 2011 [pdf ]
  • Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 2002. [pdf]
  • Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW-10, Raleigh, NC, April, 2010. [pdf]
  • Khalid El-Arini, Gaurav Veda, Dafna Shahaf and Carlos Guestrin. Turning Down the Noise in the Blogosphere. In Proc. of 15th International Conference on Knowledge Discovery and Data Mining (KDD 2009) [pdf]
  • Andreas Krause, Daniel Golovin. Submodular Function Maximization. Survey [pdf]
  • Matthew Streeter, Daniel Golovin. An Online Algorithm for Maximizing Submodular Functions. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2008 [pdf]
  • Matthew Streeter, Daniel Golovin, Andreas Krause. Online Learning of Assignments. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 [pdf]
  • Yisong Yue, Carlos Guestrin. Linear Submodular Bandits and their Application to Diversified Retrieval. Neural Information Processing Systems (NIPS), December, 2011 [pdf]

Discussion Forum

We maintain a discussion board at the VIS inforum. Use it to ask questions of general interest and interact with other students of this class. We regularly visit the board to provide answers.

Related Courses