**Theory and Algorithms in Data Science**

This seminar provides a forum for researchers working in foundational areas of Data Science
(and related fields). All talks are held at the Alan
Turing Institute. The next talk is on **29 Sep 2017** from **14h**
until **15h** in the **Enigma** room.

The organizers are
Mihai Cucuringu,
Armin Eftekhari,
Vidit Nanda and
Hemant Tyagi.
We represent the
Low-dimensional structure as well as the
Topology and geometry research interest groups within the Turing Institute. **If you'd like to speak in this seminar,
please email Vidit at vnanda@turing.ac.uk**

# Upcoming Talks

## 29 Sep 2017: **Greg Ongie**, University of Michigan

**Speaker**: Greg Ongie, University of Michigan

**Title**: Learning non-linear models with missing data

**Abstract**: Linear models have had great success for inference problems with missing data.
A key example is low-rank matrix completion, where the missing entries of a matrix are imputed by
exploiting low-dimensional linear structure. However, many real-world datasets fail to have linear
structure but still have low-dimensional non-linear structure that can be exploited. We introduce a
systematic framework for learning low-dimensional non-linear models in problems in missing data.
The approach has close connections to subspace clustering, manifold learning, and kernel methods
in machine learning. As a special case, we focus on the problem of matrix completion where we
assume data belongs to a low-dimensional algebraic variety, i.e., each data point is a solution to
a system of polynomial equations, and study the sampling requirements of this model. We propose an
efficient matrix completion algorithm that is able to recover synthetically generated data up to
the predicted sampling complexity bounds and outperforms traditional low-rank matrix completion
and subspace clustering approaches on real datasets in computer vision and genomics.

**Room and Time**: Enigma, 14h-15h

## 03 Oct 2017: **He Sun**, University of Bristol

**Speaker**: He Sun, University of Bristol

**Title**: Heat kernels in graphs: A journey from random walks to geometry, and back

**Abstract**: Heat kernels are one of the most fundamental concepts in physics and mathematics.
In physics, the heat kernel is a fundamental solution of the heat equation and connects the Laplacian
operator to the rate of heat dissipation. In spectral geometry, many fundamental techniques are based
on heat kernels. In finite Markov chain theory, heat kernels correspond to continuous-time random walks
and constitute one of the most powerful techniques in estimating the mixing time.

In this talk, we will briefly discuss this line of research and its relation to heat kernels in graphs. In particular, we will see how heat kernels are used to design the first nearly-linear time algorithm for finding clusters in real-world graphs. Some interesting open questions will be addressed in the end.

**Speaker Bio**: He Sun is a senior lecturer in Algorithms and Complexity in the School of
Informatics, University of Edinburgh. Prior to this, he held positions at the Max Planck Institute
for Informatics from 2010 to 2015, and the University of Bristol from 2015 to 2017. His main research
interests range over the fields of spectral graph theory, applied probability and statistics, matrix
analysis, and machine learning.

**Room and Time**: TBD, 14h-15h

## 09 Oct 2017: **Martin Lotz**, University of Manchester

**Speaker**: Martin Lotz, University of Manchester

**Title**: TBD

**Abstract**: TBD

**Speaker Bio**: Martin Lotz is a Lecturer in Mathematics at the University of Manchester. Previous to his
appointment at Manchester he was a post-doctoral researcher at the City University of Hong Kong, at the University
of Oxford, and as a Research Fellow at the University of Edinburgh, supported by a Leverhulme Trust and Seggie
Brown Fellowship. He studied Mathematics at the ETH (Swiss Federal Institute of Technology) in Zurich, and the
University of Paderborn, where he finished his doctorate in Algebraic Complexity Theory. His interests span a
wide range of mathematics and its applications. His current work explores connections between probability,
optimization, and applications in signal processing, data analysis and inverse problems. Previous work has been
in algebraic complexity theory, computational algebra and geometry, foundations of numerical analysis, and
combinatorics.

**Room and Time**: Augusta, 14h-15h

## 17 Oct 2017: **Varun Kanade**, University of Oxford

** Speaker **: Varun Kanade, University of Oxford

** Title **: From which world is your graph?

**Abstract**: Discovering statistical structure from links is a fundamental problem in the analysis of social networks. Choosing a misspecified model, or equivalently, an incorrect inference algorithm will result in an invalid analysis or even falsely uncover patterns that are in fact artifacts of the model. This work focuses on unifying two of the most widely used link-formation
models: the stochastic blockmodel (SBM) and the small world (or latent space) model (SWM). Integrating techniques from kernel learning, spectral graph theory, and nonlinear dimensionality reduction, we develop the first statistically sound polynomial-time algorithm to discover latent patterns in sparse graphs for both models. When the network comes from an SBM, the algorithm outputs a block structure. When it is from an SWM, the algorithm outputs estimates of each node's latent position. We report experiments performed on a twitter dataset collected between October and November 2016. Based on joint work with Cheng Li, Zhenming Liu and Felix Wong.

**Speaker Bio**: Varun Kanade is a Research Lecturer in the Department of Computer Science, Oxford University. He has been a Simons Postdoctoral Fellow at the University of California, Berkeley and an FSMP postdoctoral fellow at ENS, Paris. He obtained his Ph.D. from Harvard University in 2012. His research interests are in machine learning and theoretical computer science and focuses on mathematical foundations of machine learning, in particular computational and statistical tradeoffs and improving our understanding of methods that have been successful in practice.

**Room and Time**: Augusta, 14h-15h

## 31 Oct 2017: **Dino Sejdinovic**, University of Oxford

** Speaker **: Dino Sejdinovic,
University of Oxford

**Title**: TBD

**Abstract**: TBD

**Room and Time**: TBD, 14h-15h

## 07 Nov 2017: **Quentin Berthet**, University of Cambridge

** Speaker **: Quentin Berthet, University of Cambridge

** Title **: Bandit Optimization with Upper-Confidence Frank-Wolfe

**Abstract**: We consider the problem of bandit optimization, inspired by stochastic optimization and online learning problems with bandit feedback. In this problem, the objective is to minimize a global loss function of all the actions, not necessarily a cumulative loss. This framework allows us to study a very general class of problems, with applications in statistics, machine learning, and other fields. To solve this problem, we introduce the Upper-Confidence Frank-Wolfe algorithm, inspired by techniques for bandits and convex optimization. We show upper bounds on the optimization error of this algorithm over various classes of functions, and discuss the optimality of these results.

**Speaker Bio**: Quentin Berthet is a Turing Fellow and a Lecturer in the Statslab, in the DPMMS at the University of Cambridge, since 2015. He is a former student of the Ecole Polytechnique, received a Ph.D. from Princeton University in 2014, and was a CMI postdoctoral fellow at Caltech. His research interests are in Statistics, Optimization and Computer Science.

**Room and Time**: Augusta, 14h-15h

## 14 Nov 2017: **Piotr Fryzlewicz**, London School of Economics

** Speaker **: Piotr Fryzlewicz,
London School of Economics

**Title**: TBD

**Abstract**: TBD

**Room and Time**: TBD, 14h-15h

## 21 Nov 2017: **Rodrigo Mendoza Smith**, University of Oxford

** Speaker **: Rodrigo Mendoza Smith, University of Oxford

** Title **: TBD

**Abstract**: TBD

**Speaker Bio**: Rodrigo Mendoza Smith is a DPhil candidate in Applied Mathematics working with Prof. Jared Tanner. He is affiliated with the Numerical Analysis Group in Oxford and with the Alan Turing Institute in London. He is working on the design of provably fast numerical algorithms for sparse and low-rank data models. His research interests lie in numerical computing, compressed-sensing, optimisation, machine-learning, network theory and computational algebraic topology. His motivation and long term goal is to contribute to the understanding of Artificial Intelligence and its interplay with geometry, statistics and computability

**Room and Time**: Augusta, 14h-15h

## 28 Nov 2017: **Adilet Otemissov**, University of Oxford

** Speaker **: Adilet Otemissov, University of Oxford

** Title **: TBD

**Abstract**: TBD

**Speaker Bio**:

**Room and Time**: Augusta, 14h-15h

## 20 Dec 2017: **Rajen Shah**, University of Cambridge

** Speaker **: Rajen Shah, University of Cambridge

** Title **: The xyz algorithm for fast interaction search in high-dimensional data

**Abstract**: When performing regression on a dataset with p variables, it is often of interest to go beyond using main effects and include interactions as products between individual variables. However, if the number of variables p is large, as is common in genomic datasets, the computational cost of searching through O(p^2) interactions can be prohibitive. In this talk I will introduce a new randomised algorithm called xyz that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in p. The underlying idea is to transform interaction search into a much simpler closest pair problem. We will see how strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires O(p^u) operations for 1 < u < 2 depending on their strength. An application of xyz to a genome-wide association study shows how more than 10^11 interactions can be screened in minutes using a standard laptop. This is joint work with Gian Thanei and Nicolai Meinshausen (ETH Zurich).

**Speaker Bio**: Rajen Shah is a lecturer in Statistics at the Statistical Laboratory, University of Cambridge, having completed his PhD under the supervision of Richard Samworth (also at Cambridge) in 2014. His main research interests are in high-dimensional statistics and large-scale data analysis. Rajen is currently an Associate Editor for the Journal of the Royal Statistical Society Series B, and a Turing Fellow.

**Room and Time**: Augusta, 14h-15h

# Past Talks

## 19 Sep 2017: **Richard Samworth**, University of Cambridge

** Speaker **: Richard Samworth, University of Cambridge

** Title **: High dimensional change point estimation via sparse projection

**Abstract**: Change points are a very common feature of big data that arrive in the form of a data stream. We study high dimensional time series in which, at certain time points, the mean structure changes in a sparse subset of the co-ordinates. The challenge is to borrow strength across the co-ordinates to detect smaller changes than could be observed in any individual component series. We propose a two-stage procedure called 'inspect' for estimation of the change points: first, we argue that a good projection direction can be obtained as the leading left singular vector of the matrix that solves a convex optimization problem derived from the cumulative sum transformation of the time series. We then apply an existing univariate change point estimation algorithm to the projected series. Our theory provides strong guarantees on both the number of estimated change points and the rates of convergence of their locations, and our numerical studies validate its highly competitive empirical performance for a wide range of data-generating mechanisms. Software implementing the methodology is available in the R package 'InspectChangepoint'.

**Speaker Bio**: Richard Samworth obtained his PhD in Statistics from the University of Cambridge in 2004, and has remained in Cambridge since, becoming a professor in 2013 and the Professor of Statistical Science in 2017. His main research interests are in nonparametric and high-dimensional statistics, including problems in shape-constrained inference, classification, variable selection and changepoint estimation, among others.

**Room and Time**: Augusta, 14h-15h.