[3IA Young Women Researchers] AI with Giulia Marchello

Published on March 8, 2022 Updated on March 10, 2022

What is your research topic?

My research project involves the development of statistical learning methods for dynamical co-clustering of high dimensional, heterogeneous count data with an application to pharmacovigilance.

See the poster

Could you briefly explain it?

In data science there is an increasing need to group observations and features of high dimensional sparse matrices simultaneously.  Machine learning models that allow to perform this kind of operations are called co-clustering methods and are used in a wide range of applications (e.g.  natural language processing,  recommending systems, biomedical data, etc.). However, although the study in this field has been greatly expanded by many notable methods introduced in the last few decades, the development of dynamic co-clustering methods still remains almost an unexplored territory. Also, count data modeling is another application area that is still underdeveloped but whose importance is gradually increasing in a number of domain. Therefore, the goal of my research project is to develop a dynamic co-clustering method that can automate the analysis process by identifying deep changes that may occur in the evolution of the data structure or in the way in which existing groups interact, allowing a fast visualization and interpretation of the results. 


Can you illustrate with an example?  

The application area that we chose for this project is pharmacovigilance, whose main activity concerns the detection of safety signals about drugs.The method currently used, i.e. manual expert detection of safety signals, despite being unavoidable, has the disadvantage of being incomplete due to its workload and to require a significant amount of data before being able to detect a critical event. This is why, developing automatic method of safety signal detection is currently a major issue in pharmacovigilance. Since the adversarial effect notifications can be viewed as count data observed along the time, co-clustering may play an important role in summarizing the information carried out by pharmacovigilance data and identifying patterns of interest. It would be indeed of interest to cluster both the drugs and the adverse reactions along the temporal dimension to assist medical experts in the retrospective detection of safety signals. For this reason, in this research project we investigate the use of model-based co-clustering as a tool for automatic safety signal detection.


Can you tell us about an important result?

To demonstrate the interest in pharmacovigilance, we run a large-scale retrospective experiment on an 10-year ADRs dataset from Regional Center of Pharmacovigilance (RCPV), located in the University Hospital of Nice (France).  The center covers an area of over 2.3 million inhabitants and receives notifications about ADRs from different channels: a website form that everyone can freely fill and send, phone calls, emails, medical visits at the hospital units, etc. A time horizon of 10 years is considered, from January 1st, 2010 to September 30th, 2020,  the unity measure for time intervals is a month. The overall dataset is made of by 44,269 declarations, for which the market name of the drug, the notified adversarial effect, the channel used for the declaration and its origin, as well as an identification number and the reception date are reported. The resulting dataset contains 542 drugs, 586 adverse drug reactions (ADRs) and 129 months with 13,363 non-zero entries. 

The interest of this application lies not only in summarizing the massive amount of adverse reactions data but also in identifying possible unexpected phenomena, such as atypical side effects of certain types of drugs. During the year 2017, there was an extremely uncommon behavior in the progress of notifications to the pharmacovigilance center. In fact, in that year an unexpected rise of reports for ADRs happened concerning two specific drugs: Mirena® and Lévothyrox®. Mirena® had a media coverage peak occurred in May 2017, which resulted in a massive wave of ADRs reports from patients to French RCPVs.  Also, Lévothyrox® spontaneous reports represented almost the 90% of all the spontaneous notifications that the Nice center received from patients in 2017.  One can understand the difficulty to work with such data which contain signals of very different amplitude. Indeed, behind those very visible effects, many ADR signals need to be detected for obvious public health reasons. In fact, this application revealed that in addition to Lévothyrox® health crisis, which was the one with the widest media coverage, other major events have occurred. The first one concerning Médiator® health crisis, which took place in 2009-2010. Also, other unexpected variations of notifications were detected such as an under-notification of bleeding related ADRs during Lévothyrox®. In conclusion, we were not only able to identify clusters that are coherent with retrospective knowledge, such as the Lévothyrox® and Mirena® crises, but also to detect an under-notification of bleeding ADRs during the Lévothyrox® crisis, the health professionals were unaware of.


What are the challenges related to this topic?

The results discussed in the previous questions were obtained by performing a model called dynamic latent block model (dLBM). This method was particularly useful in identifying structural changes or patterns of interest in the data structure. However, the main limitation of dLBM is that this model can be used only as a retrospective tool for medical authorities to recognize drug and adversarial effect behaviors that might be missed by a human eye, even an expert one. To overcome this limitation, we are developing an online model that will be able to run on the flow of ADR declarations. This model, named stream-dLBM, will be able to detect in a near-real time deep changes in the evolution of declarations. One of the fundamental assumptions of stream-dLBM is that each element is allowed to change the cluster membership over the considered time period.  Thus, it implies that the interpretation of the results is highly intuitive: when there is a change in the affiliation of the elements to the clusters it means that a breaking point has been detected,  giving space for further analysis to verify the reason. The application of stram-dLBM would allow the early detection of any atypical and/or unnatural development in the progression of reports and might help experts to summarize in an automated and easily interpretable way the large amount of adverse effect reports received periodically.


What are the real-world impacts, issues?

As mentioned earlier my research work involves the application of dynamic co-clustering methods to pharmacovigilance. For this reason, one of our main goals is to develop methods able to automate the process of signal detection related to the reporting of adverse effects of drugs and vaccines. Through the development of model-based co-clustering methods we try to provide results that are as interpretable as possible in order to meet the needs of medical experts. One issue we have encountered dealing with the ADR dataset is with drug annotation.  In fact, sometimes it happened to find the same medicine annotated in a different way. Also, there exists several medicine containing the same active principles, but they should have approximately the same side effects. To overcome this issue and to prevent the same medicine from being considered more than once if reported under slightly different names, we decided to use name of the molecule of the drug, namely, the international nonproprietary name (INN).