[3IA Young Women Researchers] AI with Aude Sportisse

Published on March 8, 2022 Updated on March 11, 2022
 


What is your research topic?

My research topic is globally related to the development of machine learning techniques. Since October 2021, I have been working on semi-supervised learning with Charles Bouveyron and Pierre-Alexandre Mattei in the Maasai team at Inria centre at Université Côté d’Azur. Previously, I did a PhD thesis on the handling of missing values under the supervision of Claire Boyer and Julie Josse in Sorbonne Université.

See the poster

 

Could you briefly explain it?

Missing values can be considered as unavailable entries in a data set, and are often coded as NA (for Not Available) in code languages. They can occur for many reasons: unanswered questions in a survey, lost data, sensing machines that fail, aggregation of multiple sources, etc. Classical statistical methods cannot be directly applied on the data sets which contain missing values (you can just have in mind that computing NA+1 is impossible). In data science, most people delete all the rows (or all the columns) which contain missing values. Most of the time, this naive strategy is unfortunately not suitable: (i) there can be a huge loss of information, by deleting entire rows or columns, (ii) it is rare that a sub-population of the data is representative of the general population. This last situation raises the problem of the information contained in a missing value. If the process that causes the data to be missing, called the missing-data mechanism, depends on the data values themselves, the missing values are said informative. It is for example the case when rich people are less inclined to reveal their income. My research work focuses on this case of missing values, which is the most realistic case but also the most challenging one.

 

Can you illustrate with an example?  

In my PhD work, I was focusing on the development of methods which handle informative missing values in supervised learning, when we have access on a outcome variable (for example: indicating the health status of a patient) and many features (clinical measurements), and in unsupervised learning, when we only have access to the features.
My postdoctoral work is exploring a new dimension of the problem of informative missing values. In semi-supervised learning, we have access to features but the outcome variable is missing for a part of the data. In real life, although the amount of data available is often huge, labeling the data is costly and time-consuming. It is particularly true for image data sets: images are available in large quantities on image banks but they are most of the time unlabeled. It is therefore necessary to ask experts (doctors if they are medical images) to label them (assign them a class, an output variable). In this context, people are more inclined to label images of some classes which are easy to recognize. The unlabeled data are thus informative missing values, because the unavailability of the labels depends on their values themselves. Typically, the goal of semi-supervised learning is to learn predictive models using all the data (labeled and unlabeled ones). However, classical methods do not consider the missing data mechanism and lead to biased estimates if the missing values are informative. We aim at designing new semi-supervised algorithms that benefit from theoretical guarantees and than handle missing labels possibly informative.

 

Can you tell us about an important result?

One of the key results in missing data literature says that the missing-data mechanism should be taken into account in the statistical analysis for informative missing values, whereas it can be ignored in the other types of missing values. This can be done by directly modeling it, but it often requires strong assumptions on the data and can lead to time-consuming algorithms. This can be also done by indirectly accounting for it, for example by adding the information of the location of the missing values to the statistical analysis.

 

What are the challenges related to this topic?

The main challenge in my research field is to find solutions that are both theoretically sound and suitable for real data applications. An ideal algorithm would be one that manages several heterogeneous data (i.e. continuous and categorical data) that contains different types of missing data for several purposes (prediction, imputation, clustering). To achieve this, one of the key issues is to know how to test the missing-data mechanism, because in a real data set, even with discussions with experts, it may be difficult to distinguish informative missing variables, which would be treated specifically, from others. Of course, there is also a real challenge in terms of computing time and accessibility of algorithms. For example, for applications in medicine, diagnoses must be made quickly and doctors must have an easy-to-use application. Finally, the algorithms must be safe and interpretable, an important challenge is also to be able to quantify the errors made by the machine learning algorithm.

 

What are the real-world impacts, issues?

Advances in missing values can have a huge impact in the real world. Missing data are really present in almost every real dataset, simply because it is really difficult to collect clean data.
For example, my PhD work was motivated by a real data set, the Traumabase data set, which contains clinical measurements (250 variables) on 20,000 traumatized patients, i.e. patients who have suffered a trauma (car accident, fall, etc). The aim is to assist medical doctors for their decision making in emergency situations. For example, given one patient’s pre-hospital features, could we predict the risk of an hemorrhagic shock? There are many other challenges raised by this data set, as identifying relevant groups of patients sharing similarities, or imputing the missing values to get a complete data set.
My postdoctoral work is also motivated by a public health application: predicting and monitoring cancer treatment response, by using a clinical data set involving both clinical and biological tabular data, and medical images. As it is time-expensive for the doctors to label each medical image, the data set contains both unlabeled and labeled images. Combining image classification with other tasks on different sources (including tabular data sets), the final goal will be to predict whether the treatment would be beneficial for the patient or not. Predicting the response to treatment is an essential task to be able to practice personalized medicine and therefore a key ingredient to improve the well-being of patients, especially for cancer, for which treatments are very cumbersome.