# Theory and Algorithms in Data Science

This seminar provides a forum for researchers working in foundational areas of Data Science (and related fields). All talks are held at the Alan Turing Institute. For a complete list of talks held in previous years, please click here.

The organizers are Mihai Cucuringu and Vidit Nanda. We represent the Low-dimensional structure as well as the Topology and geometry research interest groups within the Turing Institute. If you'd like to speak in this seminar, please email Vidit at vnanda@turing.ac.uk

# Upcoming Talks

## 09 Oct 2019: Vardan Papyan, Stanford University

Speaker : Vardan Papyan,
Title : Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra

Abstract: Numerous researchers recently applied empirical spectral analysis to the study of modern deep learning classifiers. We identify and discuss an important formal class/cross-class structure and show how it lies at the origin of the many visually striking features observed in deepnet spectra, some of which were reported in recent articles and others unveiled here for the first time. These include spectral outliers and small but distinct bumps often seen beyond the edge of a "main bulk". The structure we identify organizes the coordinates of deepnet features and back-propagated errors, indexing them as an NxC or NxCxC array. Such arrays can be indexed by a two-tuple (i,c) or a three-tuple (i,c,c'), where i runs across the indices of the train set; c runs across the class indices and c' runs across the cross-class indices. This indexing naturally induces C class means, each obtained by averaging over the indices i and c' for a fixed class c. The same indexing also naturally defines C^2 cross-class means, each obtained by averaging over the index i for a fixed class c and a cross-class c'. We develop a formal process of spectral attribution, which is used to show the outliers are attributable to the C class means; the small bump next to the "main bulk" is attributable to between-cross-class covariance; and the "main bulk" is attributable to within-cross-class covariance. Formal theoretical results validate our attribution methodology.

We show how the effects of the class/cross-class structure permeate not only the spectra of deepnet features and backpropagated errors, but also the gradients, Fisher Information matrix and Hessian, whether these are considered in the context of an individual layer or the concatenation of them all. The Kronecker or Khatri-Rao product of the class means in the features and the class/cross-class means in the backpropagated errors approximates the class/cross-class means in the gradients. These means of gradients then create C and C^2 outliers in the spectrum of the Fisher Information matrix, which is the second moment of these gradients. The outliers in the Fisher Information matrix spectrum then create outliers in the Hessian spectrum. We explain the significance of this insight by proposing a correction to KFAC, a well known second-order optimization algorithm for training deepnets.

Room and Time: Augusta, 14h - 15h

# Recent Talks

## 23 Aug 2019: Charalampos Tsourakakis, Boston University

Speaker : Charalampos Tsourakakis, Boston University
Title : Near-Optimal Learning of Joint Alignments and Clusters with a Faulty Oracle

Abstract: In this talk we will discuss state-of-the-art results on two important problems, clustering and learning joint alignments (e.g., of images and shapes) using noisy queries. For the clustering problem, we consider the following model for two clusters: we are allowed to query any pair of nodes whether they belong to the same cluster or not, but the answer to the query is corrupted with some probability 0 < q < 1/2. Let \delta=1-2q be the bias. We provide an algorithm that recovers all signs correctly with high probability in the presence of noise with O(n log n / \delta^2 + \log^2 n/\delta^6) queries. This is the best known result for this problem for all but tiny \delta. We also provide another algorithm that performs O(n\log n/\delta^4) queries, and uses breadth first search as its main algorithmic primitive. Then we will present our results for the problem of learning joint alignments. The goal is to recover n discrete variables x_i in {0, ..., k-1} (up to some global offset) given noisy observations of a set of their pairwise differences {x_i - x_j mod k}; specifically, with probability 1/k+\delta for some \delta > 0 one obtains the correct answer, and with the remaining probability one obtains a uniformly random incorrect answer. We provide an efficient algorithm that learns the joint alignment with high probability. We will conclude with some open problems.

Room and Time: Augusta, 14h - 15h

## 27 Jun 2019: Jon Bloom, Broad Institute

Speaker : Jon Bloom, Broad Institute (MIT/Harvard)
Title : Regularized linear autoencoders, the Morse theory of loss, and backprop in the brain

Abstract: When trained to minimize the distance between the data and its reconstruction, linear autoencoders (LAEs) learn the subspace spanned by the top principal directions but cannot learn the principal directions themselves. We prove that L2-regularized LAEs are symmetric at all critical points and learn the principal directions as the left singular vectors of the decoder. We smoothly parameterize the critical manifold and relate the minima to the MAP estimate of probabilistic PCA. Finally, we consider implications for PCA algorithms, computational neuroscience, and the algebraic topology of deep learning. To appear at ICML 2019. Project resources: https://github.com/danielkunin/Regularized-Linear-Autoencoders

Room and Time: Augusta, 15h - 16h

## 17 Jun 2019: Renaud Lambiotte, University of Oxford

Speaker : Renaud Lambiotte, University of Oxford
Title : Higher-Order Networks

Abstract: Network science provides powerful analytical and computational methods to describe the behaviour of complex systems. From a networks viewpoint, the system is seen as a collection of elements interacting through pairwise connections. Canonical examples include social networks, neuronal networks or the Web. Importantly, elements often interact directly with a relatively small number of other elements, while they may influence large parts of the system indirectly via chains of direct interactions. In other words, networks allow for a sparse architecture together with global connectivity. Compared with mean-field approaches, network models often have greater explanatory power because they account for the non-random topologies of real-life systems. However, new forms of high-dimensional and time-resolved data have now also shed light on the limitations of these models. In this talk, I will review recent advances in the development of higher-order network models, which account for different types of higher-order dependencies in complex data. Those include temporal networks, where the network is itself a dynamical entity and higher-order Markov models, where chains of interactions are more than a combination of links.

Room and Time: Augusta, 10h - 11h

## 11 Jun 2019: Jia Chen, University of York

Speaker : Jia Chen, University of York
Title : Estimating Latent Group Structure in Time-Varying Coefficient Panel Data Models

Abstract: This paper studies the estimation of latent group structures in heterogeneous time-varying coefficient panel data models. While allowing the coefficient functions to vary over cross sections provides a good way to model cross-sectional heterogeneity, it reduces the degree of freedom and leads to poor estimation accuracy when the time-series length is short. On the other hand, in a lot of empirical studies, it is not uncommon to find that heterogeneous coefficients exhibit group structures where coefficients belonging to the same group are similar or identical. This paper aims to provide an easy and straightforward approach for estimating the underlying latent groups. This approach is based on the hierarchical agglomerative clustering (HAC) of kernel estimates of the heterogeneous time-varying coefficients when the number of groups is known. We establish the consistency of this clustering method and also propose a generalised information criterion for estimating the number of groups when it is unknown. Simulation studies are carried out to examine the finite sample properties of the proposed clustering method as well as the post-clustering estimation of the group-specific time-varying coefficients. The simulation results show that our methods give comparable performance as the penalised-sieve-estimation-based classifier-LASSO approach by Su et al. (2018), but are computationally easier. An application to a panel study of economic growth is also provided.

Room and Time: Augusta, 13h - 14h

## 16 May 2019: Ioannis Kosmidis, University of Warwick

Speaker : Ioannis Kosmidis, University of Warwick
Title : Reduced-bias inference in regression modelling

Abstract: This talk focuses on a unified theoretical and algorithmic framework for reducing bias in the estimation of statistical models from a practitioners point of view. The talk will briefly discuss how shortcomings of classical estimators can be overcome via reduction of bias, and provide a few illustrations for well-used statistical models with tractable likelihoods, including regression models with categorical responses and Beta regression. New results will then be presented on the use of bias reduction methods when modelling longitudinal and clustered data with regression models. The substantial effect that the bias of the variance components can have on inference in the presence of heterogeneity motivates the application of the framework to deliver higher-order corrective methods for generalised linear mixed models. We ent will present the challenges in doing so along with resolutions stemming from current research.

Room and Time: Augusta, 14h - 15h

## 09 May 2019: Degui Li, University of York

Speaker : Degui Li, University of Bristol
Title : Detection of Multiple Structural Breaks in Large Covariance Matrices

Abstract: This talk is about multiple structural breaks in large covariance matrices of high-dimensional time series variables satisfying the approximate factor model structure. The breaks in the second-order structure of the common components are due to sudden changes in either factor loadings or covariance of latent factors, requiring appropriate transformation of the factor models to facilitate estimation of the (transformed) common factors and factor loadings via the classical principal component analysis. With the estimated factors and idiosyncratic errors, an easy-to-implement CUSUM-based detection technique is introduced to consistently estimate the location and number of breaks and correctly identify whether they originate in the common or idiosyncratic error components. The algorithms of Wild Binary Segmentation and Wild Sparsified Binary Segmentation are used to estimate the breaks in the common and idiosyncratic error components, respectively. Under some mild conditions, the asymptotic properties of the proposed methodology are derived with near-optimal rates (up to a logarithmic factor) achieved for the estimated change points. Some numerical studies are conducted to examine the finite-sample performance of the developed method and its comparison with other existing approaches. This is a joint work with Yu-Ning Li (Zhejiang University) and Piotr Fryzlewicz (London School of Economics).

Room and Time: Augusta, 14h - 15h

## 30 Apr 2019: Haeran Cho, University of Bristol

Speaker : Haeran Cho, University of Bristol
Title : Model selection in change-point problems

Abstract: Lately, there has been a surge of interest in research for computationally fast and statistically efficient methods for change-point detection, as nonstationarities frequently observed in real-life datasets are often attributed to structural breaks in the underlying stochastic properties. In multiple change-point detection, model selection via estimating the total number of change-points poses as a challenge, particularly when the dimensionality of the data is large. In this talk, I will address the model selection in change-point problems in two different settings: when p = 1 where one can benefit from localised application of an information criterion, and when p is large where the change-point detection problem can be translated to that of detecting pervasive and latent 'factors'.

Room and Time: Enigma, 1430h - 1530h

## 26 Mar 2019: David Barber, University College London

Speaker : David Barber, University College London

Abstract: For distributions p and q with different support, the divergence D(p||q) may not exist. We define a spread divergence on modified p and q and describe sufficient conditions for the existence of such a divergence. We give examples of using a spread divergence to train implicit generative models, including linear models (Principal Components Analysis and Independent Components Analysis) and non-linear models (Deep Generative Networks). We show how a general privacy preserving machine learning mechanism is a natural application of the spread divergence.

Room and Time: Augusta, 14h - 15h

## 12 Mar 2019: Jack Jewson, University of Warwick

Speaker : Jack Jewson, University of Warwick
Title : Doubly Robust Bayesian Inference for Non-Stationary Streaming Data with beta-Divergences

Abstract: We present the first robust Bayesian Online Changepoint Detection algorithm through General Bayesian Inference (GBI) with beta-divergences. The resulting inference procedure is doubly robust for both the parameter and the changepoint (CP) posterior, with linear time and constant space complexity. We provide a construction for exponential models and demonstrate it on the Bayesian Linear Regression model. In so doing, we make two additional contributions: Firstly, we make GBI scalable using Structural Variational approximations that are exact as beta goes to 0. Secondly, we give a principled way of choosing the divergence parameter beta by minimizing expected predictive loss on-line. Reducing False Discovery Rates of CPs from over 90% to 0% on real world data, this offers the state of the art.

Joint work with Jeremias Knoblauch and Theo Damoulas

Room and Time: Enigma, 13h - 14h

## 07 Mar 2019: Edgar Dobriban, University of Pennsylvania

Speaker : Edgar Dobriban, University of Pennsylvania
Title : How to deal with big data? Understanding large-scale distributed regression

Abstract: Modern massive datasets pose an enormous computational burden to practitioners. Distributed computation has emerged as a universal approach to ease the burden: Datasets are partitioned over machines, which compute locally, and communicate short messages. Distributed data also arises due to privacy reasons, such as with medical databases. It is important to study how to do statistical inference and machine learning in a distributed setting. In this talk, we present results about one-step parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, and take a weighted average of the parameters. How much do we lose compared to doing linear regression on the full data? Here we study the performance loss in estimation error, test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We discover several key phenomena. First, averaging is not optimal, and we find the exact performance loss. Second, different problems are affected differently by the distributed framework. Estimation error and confidence interval length increases a lot, while prediction error increases much less. These results match numerical simulations and a data analysis example. To derive these results, we rely on recent results from random matrix theory, where we also develop a new calculus of deterministic equivalents as a tool of broader interest.

Room and Time: Augusta, 14h-15h

## 28 Feb 2019: Seok Young Hong, University of Nottingham

Speaker : Seok Young Hong, University of Nottingham
Title : Nonparametric estimation of infinite order regression and its application to finance.

Abstract: In this walk, we study nonparametric estimation of the infinite order regression with stationary and weakly dependent data. We propose a Nadaraya-Watson type estimator that operates with an infinite number of conditioning variables. We propose a bandwidth sequence that shrinks the effects of long lags, so the influence of all conditioning information is modelled in a natural and flexible way, and the econometric issues of omitted information bias and specification error are effectively handled. We establish the asymptotic properties of the estimator under a wide range of static and dynamic regressions framework, thereby allowing various kinds of conditioning variables to be used. We establish pointwise/uniform consistency and the CLTs. We show that the convergence rates are at best logarithmic, and depend on the smoothness of the regression, the distribution of the marginal regressors and their dependence structure in a non-trivial way via the Lambert W function. As an application, the proposed theory is applied to investigate an important question in finance. We report some new empirical evidence on the dynamics of risk-return relation over time and its link with the macroeconomy, and also add supporting evidence for explaining some major puzzles in finance.

Room and Time: Enigma, 14h-15h

## 21 Feb 2019: Weining Wang, City, University of London

Speaker : Weining Wang, City, University of London
Title : LASSO-Driven Inference in Time and Space

Abstract: We consider the estimation and inference in a system of high-dimensional regression equations allowing for temporal and cross-sectional dependency in covariates and error processes, covering rather general forms of weak dependence. A sequence of large-scale regressions with LASSO is applied to reduce the dimensionality, and an overall penalty level is carefully chosen by a block multiplier bootstrap procedure to account for multiplicity of the equations and dependencies in the data. Correspondingly, oracle properties with a jointly selected tuning parameter are derived. We further provide high-quality de-biased simultaneous inference on the many target parameters of the system. We provide bootstrap consistency results of the test procedure, which are based on a general Bahadur representation for the $Z$-estimators with dependent data. Simulations demonstrate good performance of the proposed inference procedure. Finally, we apply the method to quantify spillover effects of textual sentiment indices in a financial market and to test the connectedness among sectors.

Room and Time: Augusta, 14h - 15h

## 19 Feb 2019: Tengyao Wang, University College London

Speaker : Tengyao Wang, University College London
Title : Isotonic regression in general dimensions

Abstract: We study the least squares regression function estimator over the class of real-valued functions on [0,1]^d that are increasing in each coordinate. For uniformly bounded signals and with a fixed, cubic lattice design, we establish that the estimator achieves the minimax rate of order n^{-min(2/(d+2),1/d)} in the empirical L_2 loss, up to poly-logarithmic factors. Further, we prove a sharp oracle inequality, which reveals in particular that when the true regression function is piecewise constant on k hyperrectangles, the least squares estimator enjoys a faster, adaptive rate of convergence of (k/n)^{min(1,2/d)}, again up to poly-logarithmic factors. Previous results are confined to the case d \leq 2. Finally, we establish corresponding bounds (which are new even in the case d=2) in the more challenging random design setting. There are two surprising features of these results: first, they demonstrate that it is possible for a global empirical risk minimisation procedure to be rate optimal up to poly-logarithmic factors even when the corresponding entropy integral for the function class diverges rapidly; second, they indicate that the adaptation rate for shape-constrained estimators can be strictly worse than the parametric rate.

Room and Time: Augusta, 11h-12h

## 14 Feb 2019: Lyudmila Grigoryeva, Universität Konstanz

Speaker : Lyudmila Grigoryeva, Universität Konstanz
Title : Reservoir Computing (RC). Part II: Applicability of RC systems with unbounded inputs and their empirical performance.

Abstract: We show an extension of the results presented in the previous talk to unbounded inputs which is a particularly natural setup for stochastic signals. We complement the universality results with an empirical assessment of performance in applications to time series forecasting. In particular, we focus on predicting financial realized covariance matrices which is a question of much importance for practitioners and researchers. We examine the empirical performance of RC in comparison with many conventional state-of-the-art econometric models for various realized covariance estimators, periods, and dimensions. We show that universal RC families consistently demonstrate superior predictive ability for various empirical designs. We conclude the presentation outlining the work in progress and future lines of research.

Room and Time: Augusta, 14:40h-15:15h

## 14 Feb 2019: Juan-Pablo Ortega, Universität St. Gallen + Centre National de Recherche Scientifique

Speaker : Juan-Pablo Ortega, U St Gallen + CNRS
Title : Reservoir Computing (RC). Part I: RC paradigm and universal approximation properties of RC systems.

Abstract: A relatively recent family of dynamic machine learning paradigms known collectively as reservoir computing is presented which is capable of unprecedented performances in the forecasting of deterministic and stochastic processes. We then focus on the universal approximation properties of the most widely used families of reservoir computers in applications. These results are a much awaited generalization to the dynamic context of previous well-known static results obtained in the context of neural networks. The universal approximation properties with respect to L^\infty and L^p -type criteria of three important families of reservoir computers with stochastic discrete-time semi-infinite inputs are shown. First, it is proved that linear reservoir systems with either polynomial or neural network readout maps are universal. More importantly, it is proved that the same property holds for two families with linear readouts, namely, state-afine systems and echo state networks. The linearity in the readouts is a key feature in supervised machine learning applications. It guarantees that these systems can be used in high-dimensional situations and in the presence of large datasets.

Room and Time: Augusta, 14h - 14:35h

## 07 Feb 2019: Xinghao Qiao, London School of Economics

Speaker : Xinghao Qiao, London School of Economics
Title : A General Theory for Large-Scale Curve Time Series via Functional Stability Measure

Abstract: Modelling a large bundle of curves arises in a broad spectrum of real applications. However, existing literature relies primarily on the critical assumption of independent curve observations. In this talk, we provide a general theory for large-scale Gaussian curve time series, where the temporal and cross-sectional dependence across multiple curve observations exist and the number of functional variables, p, may be large relative to the number of observations, n. We propose a novel functional stability measure for multivariate stationary processes based on their spectral properties and use it to establish some useful concentration bounds on the sample covariance matrix function. These concentration bounds serve as a fundamental tool for further theoretical analysis, in particular, for deriving nonasymptotic upper bounds on the errors of the regularized estimates in high dimensional settings. As functional principle component analysis (FPCA) is one of the key techniques to handle functional data, we also investigate the concentration properties of the relevant estimated terms under a FPCA framework. To illustrate with an important application, we consider vector functional autoregressive models and develop a regularization approach to estimate autoregressive coefficient functions under the sparsity constraint. Using our derived nonasymptotic results, we investigate the theoretical properties of the regularized estimate in a "large p, small n" regime. The finite sample performance of the proposed method is examined through simulation studies.

Room and Time: Augusta, 14h-15h

## 05 Feb 2019: Clifford Lam, London School of Economics

Speaker : Clifford Lam, London School of Economics
Title : Testing for High-Dimensional White Noise

Abstract: Testing for white noise is a classical yet important problem in statistics, especially for diagnostic checks in time series modeling and linear regression. For high-dimensional time series in the sense that the dimension p is large in relation to the sample size T, the popular omnibus tests including the multivariate Hosking and Li-McLeod tests are extremely conservative, leading to substantial power loss. To develop more relevant tests for high-dimensional cases, we propose a portmanteau-type test statistic which is the sum of squared singular values of the first q lagged sample autocovariance matrices. It, therefore, encapsulates all the serial correlations (upto the time lag q) within and across all component series. Using the tools from random matrix theory and assuming both p and T diverge to infinity, we derive the asymptotic normality of the test statistic under both the null and a specific VMA(1) alternative hypothesis. As the actual implementation of the test requires the knowledge of three characteristic constants of the population cross-sectional covariance matrix and the value of the fourth moment of the standardized innovations, non-trivial estimations are proposed for these parameters and their integration leads to a practically usable test. Extensive simulation confirms the excellent finite-sample performance of the new test with accurate size and satisfactory power for a large range of finite (p, T) combinations, therefore ensuring wide applicability in practice. We also talk about alternative tests using maximal correlations. Results can be used even for nonlinear processes and non-iid data.

Room and Time: Augusta, 14h-15h

## 29 Jan 2019: Xiaowen Dong, University of Oxford

Speaker : Xiaowen Dong, University of Oxford
Title : Learning Quadratic Games on Networks

Abstract: Individuals, or organisations, cooperate with or compete against one another in a wide range of practical situations. In the economics literature, such strategic interactions are often modelled as games played on networks, where an individual's payoff depends not only on her action but also that of her neighbours. The current literature has largely focused on analysing the characteristics of network games in the scenario where the structure of the network, which is represented by a graph, is known beforehand. It is often the case, however, that the actions of the players are readily observable while the underlying interaction network remains hidden. In this talk, I will introduce two novel frameworks for learning, from the observations on individual actions, network games with linear-quadratic payoffs, and in particular the structure of the interaction network. Both frameworks are based on the Nash equilibrium of such games and involve solving a joint optimisation problem for the graph structure and the individual marginal benefits. We test the proposed frameworks on both synthetic and real world examples, and show that they can effectively and more accurately learn the games than the baselines. The proposed approach has both theoretical and practical implications for understanding strategic interactions in a network environment.

Room and Time: TBA, 14h - 15h