Representing and Operating on Uncertain Data Distributions

This work combines geometric data analysis and uncertainty by focusing on representing distributions of outputs, not just most likely or expected values. This approach presents two main challenges, and this research has made major progress towards solutions on both fronts.

Representation. The first challenge is the representation of the uncertain data. This work was among the first in geometry to propose the locational model of uncertainty where instead of input data of a set of fixed points, the input data is a set of points each described by a probability distribution of where the point might be. This generality can be discretized to an indecisive model where each point can be at one of a finite number of possible locations; This work showed how well the latter approximated the former. The indecisive model can also represent multiple views of each object of study, from multiple sensing observations or noisy measurements. By formalizing these models, it has become possible to ask and answer many geometric analytical questions on data in this form. It is arguable, however, that the answers to these questions should reflect the uncertainty in the input. In particular, this research shows that this is feasible and the representations do not need to be too complex, depending only on the desired accuracy and the complexity of the views of the distribution, not size or richness or skewness of the input data or its uncertainty. This result has allowed for applications of this technology to various problems ranging from nearest neighbor searching , extreme point detection, medical imaging on noisy brain DTI data, and distributed threshold monitoring of fluctuating weather data.

Computation. Once represented, an additional challenge is to efficiently compute on this uncertain data, in particular to calculate these rich output representations. A general result shows that a simple (randomized) Monte Carlo process can create a discrete output distribution of the uncertainty with the guarantees described above. Another work shows that to make the result exact is much less efficient, with some problems provably so; it describes a dichotomy of hardness where LP-type problems are polynomial, but related non-LP-type problems are #P-Hard. The construction of such a subset (a coreset), was the first such result for uncertain data.