Data Mining: Learning from Large Data Sets

Overview

Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In this graduate-level course, we will study principled, state-of-the-art techniques from statistics, algorithms and discrete and convex optimization for learning from such large data sets. The course will both cover theoretical foundations and practical applications.

Topics covered

  • Dealing with large data (Data centers; Map-Reduce/Hadoop; Amazon Mechanical Turk)
  • Fast nearest neighbor methods (Shingling, locality sensitive hashing)
  • Online learning (Online optimization and regret minimization, online convex programming, applications to large-scale Support Vector Machines)
  • Multi-armed bandits (exploration-exploitation tradeoffs, applications to online advertising and relevance feedback)
  • Active learning (uncertainty sampling, pool-based methods, label complexity)
  • Dimension reduction (random projections, nonlinear methods)
  • Data streams (Sketches, coresets, applications to online clustering)
  • Recommender systems

News

  • No class on March 20.

Details

  • VVZ Information: See here.
  • Recitations:
    • Thu 15-16 in NO C 6. Last names starting with A-M
    • Fri 14-15 in NO C 6. Last names starting with N-Z
  • Textbook: A. Rajaraman, J. Ullman. Mining of Massive Data Sets. Available here.

Homeworks

  • Homework 1 [pdf]. Supplemental material [zip]
  • Homework 2 [pdf]. Supplemental material [zip]
  • Homework 3 [pdf]
  • Homework 4 [pdf]
  • Homework 5 [pdf]. Supplemental material [py]

Solutions

  • Homework 1 [pdf]. Implementation: [zip]
  • Homework 2 [pdf]. Implementation: [zip]
  • Homework 3 [pdf]
  • Homework 4 [pdf]. Implementation: [py]
  • Homework 5 [pdf]. Implementation: [py]

Lecture Notes

  • Feb 21: Introduction [pdf]
  • March 6: Min-hashing [pdf]
  • March 13: Locality sensitive hashing [pdf]
  • March 27: SVMs; online convex programming [pdf]
  • April 3: Parallel learning; other loss functions and regularizers [pdf]
  • April 17: Regression, model selection [pdf]
  • April 24: Active Learning [pdf]
  • May 8: Large scale unsupervised learning (Online k-means, coresets) [pdf]
  • May 15: Large scale unsupervised learning (Online EM, coresets, anomaly detection) [pdf]
  • May 22: Exploration--exploitation tradeoffs (k-armed bandits, upper confidence sampling) [pdf]
  • May 29: Exploration--exploitation tradeoffs in recommender systems; submodular functions [pdf]

Recitations

  • March 8: Nearest neighbor search [pdf]
  • March 15: Probability & linear algebra [pdf]
  • March 22: LSH & HW1 [pdf]
  • March 29: Online SVM [pdf]

Old Exams

  • Data Mining Exam, 2011 Spring [pdf]

Relevant Readings

  • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
  • James Hays, Alexei A. Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH 2007). August 2007, vol. 26, No. 3.
  • Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft. When is "Nearest Neighbor" Meaningful?. Proc. 7th International Conference on Database Theory (ICDT 1999) LNCS 1540: 217-235.
  • Aristides Gionis, Piotr Indyk, Rajeev Motwani. Similarity Search in High Dimensions via Hashing [ps]. 25th International Conference on Very Large Databases (VLDB) , 1999
  • Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. Proc. International Conference on Machine Learning 2003 [pdf]
  • Martin Zinkevich, Markus Weimer, Alex Smola, Lihong Li. Parallelized Stochastic Gradient Descent. Advances in Neural Information Processing Systems 23, 2010 [pdf]
  • Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani. L1 norm support vector machines. Advances in Neural Information Processing Systems 2003 [html]
  • Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. Advances in Neural Information Processing Systems 2003 [pdf]
  • Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Proceedings of the Seventeenth International Conference on Machine Learning 2000 [ps]
  • Chris Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Chapter 9
  • Percy Liang, Dan Klein. Online EM for Unsupervised Models. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) 2009 [pdf ]
  • Dan Feldman, Morteza Monemizadeh, Christian Sohler. A PTAS for k-Means Clustering Based on Weak Coresets. Proc. 23th Annu. ACM Symposium on Computational Geometry (SoCG) 2007 [pdf]
  • Dan Feldman, Matthew Faulkner, Andreas Krause. Scalable Training of Mixture Models via Coresets. In Proc. Neural Information Processing Systems, 2011 [pdf ]
  • Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 2002. [pdf]
  • Niranjan Srinivas, Andreas Krause, Sham Kakade, Matthias Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Proc. International Conference on Machine Learning (ICML), 2010 [pdf]
  • Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW-10, Raleigh, NC, April, 2010. [pdf]
  • Lihong Li, Wei Chu, John Langford, Robert Schapire. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proc. WWW 2010 [pdf]
  • Khalid El-Arini, Gaurav Veda, Dafna Shahaf and Carlos Guestrin. Turning Down the Noise in the Blogosphere. In Proc. of 15th International Conference on Knowledge Discovery and Data Mining (KDD 2009) [pdf]
  • Matthew Streeter, Daniel Golovin, Andreas Krause. Online Learning of Assignments. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 [pdf]
  • Andreas Krause, Carlos Guestrin. Beyond Convexity - Submodularity in Machine Learning. Tutorial at ICML 2008. [pdf]

Related Courses