Schedule for: 16w2693 - Robustness Theory and Methodology: Recent Advances and Future Directions
Beginning on Friday, September 2 and ending Sunday September 4, 2016
All times in Banff, Alberta time, MDT (UTC-6).
Friday, September 2 | |
---|---|
16:00 - 19:30 |
Check-in begins (Front Desk – Professional Development Centre - open 24 hours) ↓ Note: the Lecture rooms are available after 16:00. (Front Desk – Professional Development Centre) |
19:30 - 22:30 |
Wine and cheese social ↓ Wine and cheese are provided at no cost. Photos of Wines and his students and colleagues will on display in the background.
Beverages and a small assortment of snacks are also available from BIRS in the lounge on a cash honour system. (Corbett Hall Lounge (CH 2110)) |
Saturday, September 3 | |
---|---|
07:00 - 09:00 |
Breakfast ↓ A buffet breakfast is served daily between 7:00am and 9:00am in the Vistas Dining Room, the top floor of the Sally Borden Building. Note that BIRS does not pay for meals for 2-day workshops. (Vistas Dining Room) |
09:00 - 09:10 | Giseon Heo: Welcome (TCPL 201) |
09:10 - 10:00 |
Xiaojian Xu: Robustification in a Statistical Process ↓ Robustness in Statistics can be defined as the ability of an inferential statistic or procedure that is reliable, resistant, and avoidable. Since statistics is the science of a process of data collecting, data analyzing, drawing conclusions, and evaluating/re-evaluating the aforementioned process, the consideration of robustification can encompass any stage of a statistical process, such as robust designing an experiment, robust sampling, robust estimation, and robust testing procedures. For the inference aspects, a brief summary will be given for robust statistics against outliers, small departures from the assumed parametric distribution, or possible misspecification in the assumed mean response structure and/or in assumed variance structures in regression models. In particular, the path of development and advances of robustification at the stage of designing an experiment is discussed. (TCPL 201) |
10:00 - 10:30 | Coffee Break (TCPL Foyer) |
10:30 - 11:00 |
Matus Maciak: Flexibility and Robustness from ROBUST ↓ Flexibility and robustness are quite often two important aspects to keep in mind
when dealing with some regression model estimation: the model should be flexible
to adapt for the underlying structure which we want to estimate and, on the other hand, we
would like to stay free of any complicated and unrealistic assumptions. There are
of course many different approaches on how to try to achieve both. In our work we focus on change-point detection and estimation in nonparametric regression models while allowing for heavy tailed distributions of random errors and even dependent observations. In order to decide whether some change-point is statistically relevant for the model or not, we also introduce a statistical test which can be used to draw a proper decision.
Finally, we present some examples to show that, indeed, robustness and flexibility can play a crucial role in the regression estimation. (TCPL 201) |
11:00 - 11:30 |
Sanjoy Sinha: Robust designs for generalized linear mixed models ↓ Generalized linear mixed models (GLMMs) are commonly used for analyzing clustered correlated discrete binary and count data including longitudinal data and repeated measurements. We explore techniques for the design of experiments, where the design issue is formulated as a decision of choosing the values of the predictor(s) for GLMMs. We investigate sequential design methodologies when the fitted model is possibly of an incorrect parametric form. We assess the performance of the proposed design using a small simulation study. (TCPL 201) |
11:30 - 13:00 |
Lunch ↓ A buffet lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. Note that BIRS does not pay for meals for 2-day workshops. (Vistas Dining Room) |
13:00 - 13:20 |
Group Photo ↓ Meet in foyer of TCPL to participate in the BIRS group photo. The photograph will be taken outdoors, so dress appropriately for the weather. Please don't be late, or you might not be in the official group photo! (TCPL Foyer) |
13:30 - 14:00 |
Julie Zhou: Minimax regression designs and challenges ↓ I will briefly introduce optimality criteria for regression designs. When there are possible model misspecifications, minimax approach can be used to construct designs which are robust against small model deviations. Several model misspecifications and some challenges for finding minimax designs will be discussed. One application will be presented. (TCPL 201) |
14:00 - 14:30 |
Rui Hu: Robust Discrimination Designs over Hellinger Neighbourhoods ↓ To aid in the discrimination between two, possibly nonlinear, regression models, we study the construction of experimental designs. Considering that each of these two models might be only approximately specified, robust ``maximin'' designs are proposed. The rough idea is as follows. We impose neighbourhood structures on each regression response, to describe the uncertainty in the specifications of the true underlying models. We determine the least favourable -- in terms of Kullback-Leibler divergence -- members of these neighbourhoods. Optimal designs are those maximizing this minimum divergence. Sequential, adaptive approaches to this maximization are studied. Asymptotic optimality is established. (TCPL 201) |
14:30 - 15:00 |
Zhichun Zhai: A Robust PCA-SVM-Based Feature Selection Method for Big Data Classification: A Genetic Algorithm Approach ↓ Feature selection has become indispensable in the detection of true underlying patterns
in big data that may yet also contain irrelevant/redundant information. When performing feature selection based on principal component analysis (PCA), the usual convention is to use only a few of top principal components as the most representative. Some of these top components, however, might not play much of a role in a support vector machine (SVM) classifier and thus do not necessarily result in efficient classification. It is thus critical to devise a systematic approach to principal component selection that plays more role in SVM classifier and feature selection based on these chosen principal components. This paper presents a two-step feature selection scheme. First, using a genetic algorithm, we select a subset of principal components out of all principal components with nonzero corresponding eigenvalues. Then, based on this subset, we calculate the contribution of all features; from among the top-contributing features, we use a genetic algorithm to select a subset with relatively high SVM
classification accuracy. This method incorporates PCA and genetic algorithms into feature selection and is both robust and efficient for datasets with dimensionality much larger than the number of instances. Indeed, for big data in particular, the latter of these is already considerably large. Furthermore, we test the effectiveness of this new approach on ten well-known datasets: simulation results show that our method does improve classification accuracy on all of them. This is joint work with Rui Hu and Giseon Heo. (TCPL 201) |
15:00 - 15:30 | Coffee Break (TCPL Foyer) |
15:30 - 16:00 |
Ivor Cribben: A linear modeling framework for statistical inference on networks ↓ Existing approaches for conducting statistical inference between groups (patients and controls) of networks can be divided into two sets. Both sets begin by estimating graphs for each subject in both groups. The first set of methods carries out pairwise edge tests between the groups of patients and controls, while the second set starts by summarizing the networks using a graph metric, and then test for the equality of the graph metrics between the two groups. However, these methods fail to directly test the entire network and do not take into account the (possible) temporal autocorrelation and the heterogeneity characteristics of the data across subjects. In this work, we introduce a flexible general linear modeling framework for conducting statistical inference on networks that incorporates these features. This novel method accounts for the temporal autocorrelation in a nonparametric and subject-specific manner, and estimates subject-specific variances using iterative least squares and residual maximum likelihood estimation. We apply the new model to a resting-state functional magnetic resonance imaging (fMRI) study to compare the brain networks jointly in both typical and reading impaired young adults in order to characterize the resting state networks that are related to reading processes. We also compare the performance of the model to other methods that do not account for the temporal autocorrelation or heterogeneity across the subjects using an extensive simulation study. (TCPL 201) |
16:00 - 16:30 |
Pengfei Li: Sample-size calculation for tests of homogeneity ↓ Mixture models are widely used to explain excessive variation in observations that is not captured by standard parametric models, and they lead to suggestive latent structures. The hypothetical latent structure often needs critical examination based on experimental data. It is therefore important to know the sample size needed to ensure a reasonable chance of success. We investigate this issue for the EM-test. We obtain a simple sample-size formula and an associated simulation-based calibration procedure, and we demonstrate via data examples and simulation studies that they provide useful guidance for several common mixture models. (TCPL 201) |
16:30 - 17:00 |
Zhide Fang: Cross-annotation in NGS metagenomic functional profiling and correction ↓ Accurate functional profiling is one of the important steps in many metagenomic studies. Profiling approach based on read counts by next-generation sequencing techniques may cause the problem of cross-annotation. We propose a statistical method to address the issue of cross-annotation in functional profiling of a metagenome. The applications on in vitro-simulated metogenomic samples, those simulated by a bioinformatoic tool, and a real-world data, show that the method is successful in correcting the cross-annotations (TCPL 201) |
17:30 - 20:30 |
Dinner ↓ Address: 211 Banff Ave, Banff
Phone: 403-985-6688 (Bamboo Garden in Banff) |
Sunday, September 4 | |
---|---|
07:00 - 09:00 | Breakfast (Vistas Dining Room) |
09:00 - 09:20 |
Lucy Gao: Distributionally Robust Multinomial Logistic Regression ↓ Multinomial logistic regression and its regularized variants are a popular multi-class classifier in fields such as neuroimaging and genome sciences. However, as real data is often noisy and/or contaminated, robust statistical methods are desirable. Building on Shafieezadeh-Abadeh et. al (2015) which proposes distributionally robust logistic regression, we propose distributionally robust multinomial logistic regression (DRMLR). We define a ball of probability distributions with the Wasserstein metric centred on the empirical distribution of the training samples, then minimize the worst case expected log loss function with expectation taken with respect to distributions in the Wasserstein ball. DRMLR can be fit via a tractable convex reformulation of the optimization problem, and distributionally robust estimates of the misclassification rate of DRMLR can be found by solving tractable convex optimization problems as well. DRMLR contains classical logistic regression, a form of regularized multinomial logistic regression, and distributionally robust logistic regression as special cases. (TCPL 201) |
09:20 - 09:40 |
Yue Yin: Minimax design criterion for fractional factorial designs ↓ We consider an A-optimal minimax design criterion for mixed-level fractional factorial designs in this talk. The linear model usually includes all the main effects and some specified interactions among the factors, and we use a requirement set to denote all those effects. A-optimal minimax design criterion is to minimize the maximum trace of the mean squared error matrix of the least squares estimator of the effects in the model, and the maximum is taken over small possible departures of the requirement set. A-optimal minimax design is robust against misspecification of the requirement set. Various design properties will be presented for two-level and mixed-level fractional factorial designs. An example is given to compare the results for A-optimal, D-optimal, E-optimal, A-optimal minimax and D-optimal minimax designs. (TCPL 201) |
09:40 - 10:00 |
Nadezda Frolova: Estimation of extreme value dependence in time series data ↓ Nowadays, there has been an increasing interest in extreme value analysis that is particularly useful to study financial and climate data. Various statistical methods have been developed for estimating extreme value dependence in time series data sets, and the field is still growing. In my presentation, I want to focus on two methods: the extremogram and the cross extremogram, and quantile regression. I use both methods to study the dependence between high spot electricity prices of Australia between states that are included in the National Electricity Market. (TCPL 201) |
10:00 - 10:30 | Coffee Break (TCPL Foyer) |
10:30 - 10:50 |
Matthew Pietrosanu: Point Clouds and Heatmaps: A Practical Approach to Multidimensional Persistent Homology for Robust Shape Recognition ↓ Persistent homology is a technique in algebraic and computational topology useful in recovering the underlying topology of a given dataset, and has a wide range of applications in computer vision, statistics, genomics, and beyond. This technique suffers, however, through its inability to simultaneously consider multiple parameters describing a dataset—a mathematically difficult and unsolved problem. In particular, this prevents persistent homology from distinguishing between distinct yet topologically-equivalent shapes, such as circles and ellipses, that could otherwise be differentiated by examining both scale and curvature.
In this presentation, we put forth a novel extension of persistent homology to two parameters, which we call Heatmap Pseudo-Bifiltration. Furthermore, we develop a robust statistical test to detect differences between point-cloud datasets on the basis of scale, curvature, and topological structure. The effect of sampling variability and noise on the results of this technique will be examined, particularly in the context of point-cloud curvature estimation in arbitrary dimensions. (TCPL 201) |
10:50 - 11:10 |
Yi Zhou: Persistent Homology on Time Series ↓ In this study, we applied the method of topological data analysis (TDA) and the theory of persistent homology for time series. We computed cross correlation matrices and also the partial correlation matrix of multivariate time series and transformed them into distance matrices. After applying persistent homology on the distance matrices, we solved unsupervised learning problems (investigating clusters and loops) for well water level data and supervised learning problems for polysomnography (PSG) data (building a model to predict OAHI (Obstructive Apnea-Hypopnea (3% desaturation) Index) of new incoming participants). We obtained these solutions based on the topological features of the datasets and the model turned out effective in predicting. (TCPL 201) |
11:10 - 11:30 |
Berhanu Wubie: Clustering Survival Data using Random Forest and Persistent Homology ↓ Clustering survival data to assess their survival experience is very important and critical for clinicians to provide good quality life for patients. Clustering using random forest (RF) based on partitioning around the medoids and persistent homology (PH), a topological data analysis technique for cluster identification and persistent extraction. In this work we considered done two different datasets; the kidney and liver data. Both methods work well to identify different groups of patients possessing different survival experience accounting for factors associated with the prediction of survival experience. The clusters formed were assessed and found to have significant difference in their survival experience. Further investigation for feature extraction using PH revealed that patients in some groups showed poor survival experience than patients in other groups. Clustering using RF and PH result in a promising exploration of groups within patients that give insight for patient handling and give valuable information how to provide better quality service to patients who need more attention. (TCPL 201) |
11:30 - 12:00 |
Checkout by Noon ↓ 2-day workshop participants are welcome to use BIRS facilities (Corbett Hall Lounge, TCPL, Reading Room) until 15:00 on Sunday, although participants are still required to checkout of the guest rooms by 12 noon. There is no coffee break service on Sunday afternoon, but self-serve coffee and tea are always available in the 2nd floor lounge, Corbett Hall. (Front Desk – Professional Development Centre) |