Final Report Summary - DISTRIBUTION TESTING (Algorithms for Testing Properties of Distributions)
Data that is most naturally viewed as samples from a probability distribution abounds. Examples include data from purchase records and other financial transactions, scientific measurements, sensor nets, the World Wide Web, and various types of network traffic records. In many cases there is no explicit description of the distribution- the only access to the distribution is given via samples. Even so, in order to effectively make use of such data, one must understand whether the underlying probability distribution has various properties. It is often desired to perform a variety of analyses on such data, including trend analysis, finding correlations and detecting broad scale anomalous events. For example, is the distribution on the samples uniform, independent, of high entropy or does it have other features that determine its shape or behavior?
Surprisingly, though many of these problems have been well studied in the statistics, data mining and machine learning communities, little attention has been paid to the sample and time complexity of these problems in the case when the distributions are over very large domains. For many properties, standard techniques such as the Chi-squared test and ``plug-in''algorithms, which approximate the probability that the distribution assigns to each domain element, yield algorithms whose sample complexity is at least linear in the size of the domain. However, it can be crucial to find algorithms which use a number of queries that is sublinear in the size of the domain. A body of work is now emerging which considers such questions and has shown that sublinear sample complexity is indeed possible in many cases.
This new direction has already yielded results on fundamental problems which are both surprising and potentially useful. In particular, testers have been designed for several properties such as closeness testing, independence, entropy estimation, and the number of distinct elements that have reasonable weight in a distribution.
In this project we have pursued this line of research and pushed the boundaries in two ways. We have introduced new models as well as given new, more efficient algorithms in both the existing and new models. Many works have considered the problem of finding ways of testing whether distributions have various properties that are efficient in terms of the size of the domain.
In [LRR11] we have initiated the study of testing properties of collections of distributions- as might occur when the distributions come from readings of many sensors in a sensor net, or when considering the distributions of words across many different text files. We have considered the complexities of testing whether all the distributions in the collection are similar or not, and whether the distributions are clusterable. Depending on the exact model in use, we have given upper and lower bounds on the sample complexities which are tight in many cases.
In [LRR12] we have studied the problem of testing whether all members of a collection of distributions have the same mean. We give upper and lower bounds on this problem which are very close,
and are sublinear in the number of distributions in the collection.
In [ILR12] we have considered distributions that can be well represented as k-histograms, where a k-histogram distribution can be represented by a list of k intervals and k corresponding probability values. k-histograms are widely used in database systems to give concise representations of distributions.
We consider the following problem: given a collection of samples from a distribution p, find a k-histogram that is close to it (according to the commonly used l2 distance measure). We give time and sample efficient algorithms for this problem. We further provide algorithms that distinguish distributions that have the property of being a k-histogram from distributions that are far from any k-histogram.
In [RX11] we have studied the problem of testing whether a distribution is k-wise independent over a general domain and have given non-trivial upper bounds. In [BFRV11] we have studied the problem of testing whether a distribution is monotone when the domain is an arbitrary partially ordered set. We have shown that the complexity can be nearly linear for some domains. On the other hand, we have developed techniques for constructing upper bounds that depend on parameters of the domain. These techniques give the best known testers for monotonicity of distributions over the Boolean cube.
We have also achieved results that are of interest in the field of algorithms. In the setting of dynamic algorithms, in [OR10] we show how to maintain a large matching and a small vertex cover under deletions and insertions of edges such that the update time is polylogarithmic. Previous algorithms required update time polynomial in the size of the graph.
In [ORRR11] we have given a nearly optimal sublinear-time algorithm for approximating the size of a minimum vertex cover in a graph. Our running time is nearly linear in the average degree of the graph, improving on the bound of Yoshida et. al. which required O (d4) time.
In [RTVX11] we have developed a model of local algorithms which allow one to achieve consistent query access to parts of solutions of large optimisation problems without having to compute the entire solution. We have given such local algorithms for problems including maximul independent set, radio broadcast coloring, hypergraph coloring and satisfiability of k-CNF.
In [RRSW11] we have shown that one can achieve extremely efficient sublinear time algorithms for estimating influence of monotone function.
Surprisingly, though many of these problems have been well studied in the statistics, data mining and machine learning communities, little attention has been paid to the sample and time complexity of these problems in the case when the distributions are over very large domains. For many properties, standard techniques such as the Chi-squared test and ``plug-in''algorithms, which approximate the probability that the distribution assigns to each domain element, yield algorithms whose sample complexity is at least linear in the size of the domain. However, it can be crucial to find algorithms which use a number of queries that is sublinear in the size of the domain. A body of work is now emerging which considers such questions and has shown that sublinear sample complexity is indeed possible in many cases.
This new direction has already yielded results on fundamental problems which are both surprising and potentially useful. In particular, testers have been designed for several properties such as closeness testing, independence, entropy estimation, and the number of distinct elements that have reasonable weight in a distribution.
In this project we have pursued this line of research and pushed the boundaries in two ways. We have introduced new models as well as given new, more efficient algorithms in both the existing and new models. Many works have considered the problem of finding ways of testing whether distributions have various properties that are efficient in terms of the size of the domain.
In [LRR11] we have initiated the study of testing properties of collections of distributions- as might occur when the distributions come from readings of many sensors in a sensor net, or when considering the distributions of words across many different text files. We have considered the complexities of testing whether all the distributions in the collection are similar or not, and whether the distributions are clusterable. Depending on the exact model in use, we have given upper and lower bounds on the sample complexities which are tight in many cases.
In [LRR12] we have studied the problem of testing whether all members of a collection of distributions have the same mean. We give upper and lower bounds on this problem which are very close,
and are sublinear in the number of distributions in the collection.
In [ILR12] we have considered distributions that can be well represented as k-histograms, where a k-histogram distribution can be represented by a list of k intervals and k corresponding probability values. k-histograms are widely used in database systems to give concise representations of distributions.
We consider the following problem: given a collection of samples from a distribution p, find a k-histogram that is close to it (according to the commonly used l2 distance measure). We give time and sample efficient algorithms for this problem. We further provide algorithms that distinguish distributions that have the property of being a k-histogram from distributions that are far from any k-histogram.
In [RX11] we have studied the problem of testing whether a distribution is k-wise independent over a general domain and have given non-trivial upper bounds. In [BFRV11] we have studied the problem of testing whether a distribution is monotone when the domain is an arbitrary partially ordered set. We have shown that the complexity can be nearly linear for some domains. On the other hand, we have developed techniques for constructing upper bounds that depend on parameters of the domain. These techniques give the best known testers for monotonicity of distributions over the Boolean cube.
We have also achieved results that are of interest in the field of algorithms. In the setting of dynamic algorithms, in [OR10] we show how to maintain a large matching and a small vertex cover under deletions and insertions of edges such that the update time is polylogarithmic. Previous algorithms required update time polynomial in the size of the graph.
In [ORRR11] we have given a nearly optimal sublinear-time algorithm for approximating the size of a minimum vertex cover in a graph. Our running time is nearly linear in the average degree of the graph, improving on the bound of Yoshida et. al. which required O (d4) time.
In [RTVX11] we have developed a model of local algorithms which allow one to achieve consistent query access to parts of solutions of large optimisation problems without having to compute the entire solution. We have given such local algorithms for problems including maximul independent set, radio broadcast coloring, hypergraph coloring and satisfiability of k-CNF.
In [RRSW11] we have shown that one can achieve extremely efficient sublinear time algorithms for estimating influence of monotone function.