Theory and Algorithms in Data Science

This seminar provides a forum for researchers working in foundational areas of Data Science (and related fields). All talks are held at the Alan Turing Institute. The next talk is on 29 Sep 2017 from 14h until 15h in the Enigma room.

The organizers are Mihai Cucuringu, Armin Eftekhari, Vidit Nanda and Hemant Tyagi. We represent the Low-dimensional structure as well as the Topology and geometry research interest groups within the Turing Institute. If you'd like to speak in this seminar, please email Vidit at

Upcoming Talks

29 Sep 2017: Greg Ongie, University of Michigan

Greg Speaker: Greg Ongie, University of Michigan
Title: Learning non-linear models with missing data

Abstract: Linear models have had great success for inference problems with missing data. A key example is low-rank matrix completion, where the missing entries of a matrix are imputed by exploiting low-dimensional linear structure. However, many real-world datasets fail to have linear structure but still have low-dimensional non-linear structure that can be exploited. We introduce a systematic framework for learning low-dimensional non-linear models in problems in missing data. The approach has close connections to subspace clustering, manifold learning, and kernel methods in machine learning. As a special case, we focus on the problem of matrix completion where we assume data belongs to a low-dimensional algebraic variety, i.e., each data point is a solution to a system of polynomial equations, and study the sampling requirements of this model. We propose an efficient matrix completion algorithm that is able to recover synthetically generated data up to the predicted sampling complexity bounds and outperforms traditional low-rank matrix completion and subspace clustering approaches on real datasets in computer vision and genomics.

Room and Time: Enigma, 14h-15h

03 Oct 2017: He Sun, University of Bristol

He Speaker: He Sun, University of Bristol
Title: Heat kernels in graphs: A journey from random walks to geometry, and back

Abstract: Heat kernels are one of the most fundamental concepts in physics and mathematics. In physics, the heat kernel is a fundamental solution of the heat equation and connects the Laplacian operator to the rate of heat dissipation. In spectral geometry, many fundamental techniques are based on heat kernels. In finite Markov chain theory, heat kernels correspond to continuous-time random walks and constitute one of the most powerful techniques in estimating the mixing time.

In this talk, we will briefly discuss this line of research and its relation to heat kernels in graphs. In particular, we will see how heat kernels are used to design the first nearly-linear time algorithm for finding clusters in real-world graphs. Some interesting open questions will be addressed in the end.

Speaker Bio: He Sun is a senior lecturer in Algorithms and Complexity in the School of Informatics, University of Edinburgh. Prior to this, he held positions at the Max Planck Institute for Informatics from 2010 to 2015, and the University of Bristol from 2015 to 2017. His main research interests range over the fields of spectral graph theory, applied probability and statistics, matrix analysis, and machine learning.

Room and Time: TBD, 14h-15h

09 Oct 2017: Martin Lotz, University of Manchester


Speaker: Martin Lotz, University of Manchester
Title: TBD

Abstract: TBD

Speaker Bio: Martin Lotz is a Lecturer in Mathematics at the University of Manchester. Previous to his appointment at Manchester he was a post-doctoral researcher at the City University of Hong Kong, at the University of Oxford, and as a Research Fellow at the University of Edinburgh, supported by a Leverhulme Trust and Seggie Brown Fellowship. He studied Mathematics at the ETH (Swiss Federal Institute of Technology) in Zurich, and the University of Paderborn, where he finished his doctorate in Algebraic Complexity Theory. His interests span a wide range of mathematics and its applications. His current work explores connections between probability, optimization, and applications in signal processing, data analysis and inverse problems. Previous work has been in algebraic complexity theory, computational algebra and geometry, foundations of numerical analysis, and combinatorics.

Room and Time: Augusta, 14h-15h

17 Oct 2017: Varun Kanade, University of Oxford


Speaker : Varun Kanade, University of Oxford
Title : From which world is your graph?

Abstract: Discovering statistical structure from links is a fundamental problem in the analysis of social networks. Choosing a misspecified model, or equivalently, an incorrect inference algorithm will result in an invalid analysis or even falsely uncover patterns that are in fact artifacts of the model. This work focuses on unifying two of the most widely used link-formation models: the stochastic blockmodel (SBM) and the small world (or latent space) model (SWM). Integrating techniques from kernel learning, spectral graph theory, and nonlinear dimensionality reduction, we develop the first statistically sound polynomial-time algorithm to discover latent patterns in sparse graphs for both models. When the network comes from an SBM, the algorithm outputs a block structure. When it is from an SWM, the algorithm outputs estimates of each node's latent position. We report experiments performed on a twitter dataset collected between October and November 2016. Based on joint work with Cheng Li, Zhenming Liu and Felix Wong.

Speaker Bio: Varun Kanade is a Research Lecturer in the Department of Computer Science, Oxford University. He has been a Simons Postdoctoral Fellow at the University of California, Berkeley and an FSMP postdoctoral fellow at ENS, Paris. He obtained his Ph.D. from Harvard University in 2012. His research interests are in machine learning and theoretical computer science and focuses on mathematical foundations of machine learning, in particular computational and statistical tradeoffs and improving our understanding of methods that have been successful in practice.

Room and Time: Augusta, 14h-15h

31 Oct 2017: Dino Sejdinovic, University of Oxford


Speaker : Dino Sejdinovic, University of Oxford

Title: TBD

Abstract: TBD

Room and Time: TBD, 14h-15h

07 Nov 2017: Quentin Berthet, University of Cambridge


Speaker : Quentin Berthet, University of Cambridge
Title : Bandit Optimization with Upper-Confidence Frank-Wolfe

Abstract: We consider the problem of bandit optimization, inspired by stochastic optimization and online learning problems with bandit feedback. In this problem, the objective is to minimize a global loss function of all the actions, not necessarily a cumulative loss. This framework allows us to study a very general class of problems, with applications in statistics, machine learning, and other fields. To solve this problem, we introduce the Upper-Confidence Frank-Wolfe algorithm, inspired by techniques for bandits and convex optimization. We show upper bounds on the optimization error of this algorithm over various classes of functions, and discuss the optimality of these results.

Speaker Bio: Quentin Berthet is a Turing Fellow and a Lecturer in the Statslab, in the DPMMS at the University of Cambridge, since 2015. He is a former student of the Ecole Polytechnique, received a Ph.D. from Princeton University in 2014, and was a CMI postdoctoral fellow at Caltech. His research interests are in Statistics, Optimization and Computer Science.

Room and Time: Augusta, 14h-15h

14 Nov 2017: Piotr Fryzlewicz, London School of Economics


Speaker : Piotr Fryzlewicz, London School of Economics

Title: TBD

Abstract: TBD

Room and Time: TBD, 14h-15h

21 Nov 2017: Rodrigo Mendoza Smith, University of Oxford


Speaker : Rodrigo Mendoza Smith, University of Oxford
Title : TBD

Abstract: TBD

Speaker Bio: Rodrigo Mendoza Smith is a DPhil candidate in Applied Mathematics working with Prof. Jared Tanner. He is affiliated with the Numerical Analysis Group in Oxford and with the Alan Turing Institute in London. He is working on the design of provably fast numerical algorithms for sparse and low-rank data models. His research interests lie in numerical computing, compressed-sensing, optimisation, machine-learning, network theory and computational algebraic topology. His motivation and long term goal is to contribute to the understanding of Artificial Intelligence and its interplay with geometry, statistics and computability

Room and Time: Augusta, 14h-15h

28 Nov 2017: Adilet Otemissov, University of Oxford


Speaker : Adilet Otemissov, University of Oxford
Title : TBD

Abstract: TBD

Speaker Bio:

Room and Time: Augusta, 14h-15h

20 Dec 2017: Rajen Shah, University of Cambridge


Speaker : Rajen Shah, University of Cambridge
Title : The xyz algorithm for fast interaction search in high-dimensional data

Abstract: When performing regression on a dataset with p variables, it is often of interest to go beyond using main effects and include interactions as products between individual variables. However, if the number of variables p is large, as is common in genomic datasets, the computational cost of searching through O(p^2) interactions can be prohibitive. In this talk I will introduce a new randomised algorithm called xyz that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in p. The underlying idea is to transform interaction search into a much simpler closest pair problem. We will see how strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires O(p^u) operations for 1 < u < 2 depending on their strength. An application of xyz to a genome-wide association study shows how more than 10^11 interactions can be screened in minutes using a standard laptop. This is joint work with Gian Thanei and Nicolai Meinshausen (ETH Zurich).

Speaker Bio: Rajen Shah is a lecturer in Statistics at the Statistical Laboratory, University of Cambridge, having completed his PhD under the supervision of Richard Samworth (also at Cambridge) in 2014. His main research interests are in high-dimensional statistics and large-scale data analysis. Rajen is currently an Associate Editor for the Journal of the Royal Statistical Society Series B, and a Turing Fellow.

Room and Time: Augusta, 14h-15h

Past Talks

19 Sep 2017: Richard Samworth, University of Cambridge


Speaker : Richard Samworth, University of Cambridge
Title : High dimensional change point estimation via sparse projection

Abstract: Change points are a very common feature of big data that arrive in the form of a data stream. We study high dimensional time series in which, at certain time points, the mean structure changes in a sparse subset of the co-ordinates. The challenge is to borrow strength across the co-ordinates to detect smaller changes than could be observed in any individual component series. We propose a two-stage procedure called 'inspect' for estimation of the change points: first, we argue that a good projection direction can be obtained as the leading left singular vector of the matrix that solves a convex optimization problem derived from the cumulative sum transformation of the time series. We then apply an existing univariate change point estimation algorithm to the projected series. Our theory provides strong guarantees on both the number of estimated change points and the rates of convergence of their locations, and our numerical studies validate its highly competitive empirical performance for a wide range of data-generating mechanisms. Software implementing the methodology is available in the R package 'InspectChangepoint'.

Speaker Bio: Richard Samworth obtained his PhD in Statistics from the University of Cambridge in 2004, and has remained in Cambridge since, becoming a professor in 2013 and the Professor of Statistical Science in 2017. His main research interests are in nonparametric and high-dimensional statistics, including problems in shape-constrained inference, classification, variable selection and changepoint estimation, among others.

Room and Time: Augusta, 14h-15h.