Data Mining: Learning from Large Data Sets


Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In this graduate-level course, we will study principled, state-of-the-art techniques from statistics, algorithms and discrete and convex optimization for learning from such large data sets. The course will both cover theoretical foundations and practical applications.

Topics covered

  • Dealing with large data (Data centers; Map-Reduce/Hadoop; Amazon Mechanical Turk)
  • Fast nearest neighbor methods (Shingling, locality sensitive hashing)
  • Online learning (Online optimization and regret minimization, online convex programming, applications to large-scale Support Vector Machines)
  • Multi-armed bandits (exploration-exploitation tradeoffs, applications to online advertising and relevance feedback)
  • Active learning (uncertainty sampling, pool-based methods, label complexity)
  • Dimension reduction (random projections, nonlinear methods)
  • Data streams (Sketches, coresets, applications to online clustering)
  • Recommender systems