Share this page

Publications of the month - March 2026

Research

Published on March 26, 2026 – Updated on April 2, 2026

3IA Côte d'Azur researchers March 2026 publications

We are pleased to share the 3IA Côte d’Azur’s researchers’ new publications.

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026, Denver (USA)

Illustrator’s Depth: Monocular Layer Index Prediction for Image Decomposition by N. Maruani (3IA Ph.D.), P. Zhang, S. Chaudhuri, M. Fisher, N. Zhao, V. G. Kim, P. Alliez (3IA Chairholder), M. Desbrun, W. Yifan

Abstract: We introduce Illustrator's Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist's compositional process, illustrator's depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.

Read the paper

MoVie: Broaden Your Views with Human Motion for Action Detection by D. Yang, M. Ali, X. Yu, X. Shen, Q. Kong, G. Francesca, F. Brémond (3IA Chairholder)

Abstract: Human action detection in videos requires both semantic recognition and accurate modeling of motion. While recent video foundation models have advanced visual semantics, they still struggle to capture complex and compositional actions due to the limited representation ability of motion.
Human skeleton sequences, which explicitly describe the body structure and movement, provide valuable physical and geometric motions that complement RGB videos. However, combining video and skeleton modalities faces two key challenges: (i) label-driven skeleton features are too coarse to describe fine-grained motion and (ii) skeleton motion and RGB video lie in heterogeneous feature spaces, so current fusion strategies often cause feature interference.
To address these, we propose MoVie, a unified Motion-Video processing framework that uses structured human motion as a bridge between the two signals. We first propose a Structural Motion Projection module that decomposes motion into primitive components using a learnable motion dictionary, to produce fine-grained descriptors. Then, we design a Motion-guided Feature Regularization mechanism that aligns visual features with motion through an orthogonality-based transformation, so that fine-grained motion cues can guide visual representations without collapsing semantic diversity. Extensive evaluations on Toyota Smarthome Untrimmed, Charades, Multi-THUMOS and PKU-MMD datasets demonstrate that MoVie significantly improves state-of-the-art action detection performance.

Read the paper / With supplementary

PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation by D. Yang, Y. Wang, S. Shao, F. Brémond (3IA Chairholder), J. Wang

Abstract: Real-world human action understanding remains challenging due to long-tailed label distributions, compositional motion patterns, and viewpoint variations. Existing skeleton-based methods often lack a structured and transferable representation of motion, and task-specific models for generation, classification, and detection are usually trained independently, resulting in fragmented pipelines and limited cross-task generalization.
We present PRISM, a PRImitive-centric Skeleton Modeling framework that learns a shared motion representation from a motion generation objective and transfers it to perception tasks. PRISM represents each action sequence as a trajectory in a primitive coefficient space, which captures how a set of learned atomic motion primitives contribute to the observed motion.
A structured decomposition module learns this representation in a physically grounded and view-invariant manner via motion generation. Instead of enforcing joint or unified training across tasks, PRISM provides a single primitive-centric representation that can be sequentially transferred to downstream classification and frame-wise detection through lightweight task heads.
This representation introduces structure, compositionality, and improved generalization across distinct supervisions. PRISM consistently improves performance on long-tailed and multi-label datasets and enables interpretable reasoning over compositional and rare actions. Extensive experimental results show that the structured primitive space serves as a transferable and robust foundation for diverse action understanding tasks in real-world datasets.

Read the paper

B-MoE: A Body-Part-Aware Mixture-of-Experts 'All Parts Matter' Approach to Micro-Action Recognition by N. Poddar, A. Reka, D-L. Borza, S. Majhi, M. Balazia, A. Das, F. Brémond (3IA Chairholder)

Asbtract: Micro-actions, fleeting and low-amplitude motions, such as glances, nods, or minor posture shifts, carry rich social meaning but remain difficult for current action recognition models to recognize due to their subtlety, short duration, and high inter-class ambiguity. In this paper, we introduce B-MoE, a Body-part-aware Mixture-of-Experts framework designed to explicitly model the structured nature of human motion. In B-MoE, each expert specializes in a distinct body region (head, body, upper limbs, lower limbs), and is based on the lightweight Macro–Micro Motion Encoder (M3E) that captures long-range contextual structure and fine-grained local motion. A cross-attention routing mechanism learns inter-region relationships and dynamically selects the most informative regions for each micro-action. B-MoE uses a dual-stream encoder that fuses these region-specific semantic cues with global motion features to jointly capture spatially localized cues and temporally subtle variations that characterize micro-actions. Experiments on three challenging benchmarks (MA-52, SocialGesture, and MPII-GroupInteraction) show consistent state-of-the-art gains, with improvements in ambiguous, underrepresented, and low-amplitude classes.

Read the paper

29th Annual Conference on Artificial Intelligence and Statistics (AISTATS), May 2026, Tangier (Marocco)

Reconciling Communication Compression and Byzantine-Robustness in Distributed Learning by D. Gupta, A. Honsell, C. Xu, N. Gupta, G. Neglia (3IA Chairholder)

Abstract: Distributed learning enables scalable model training over decentralized data, but remains hindered by Byzantine faults and high communication costs. While both challenges have been studied extensively in isolation, their interplay has received limited attention. Prior work has shown that naively combining communication compression with Byzantine-robust aggregation can severely weaken resilience to faulty nodes. The current state-of-the-art, Byz-DASHA-PAGE, leverages a momentum-based variance reduction scheme to counteract the negative effect of compression noise on Byzantine robustness. In this work, we introduce RoSDHB, a new algorithm that integrates classical Polyak momentum with a coordinated compression strategy. Theoretically, RoSDHB matches the convergence guarantee of Byz-DASHA-PAGE under the standard (G,B)-gradient dissimilarity model, but relies on milder assumptions. Empirically, RoSDHB demonstrates stronger robustness while achieving substantial communication savings compared to Byz-DASHA-PAGE.

Read the paper

Stick-Breaking Embedded Topic Model with Continuous Optimal Transport for Online Analysis of Document Streams by F. Granese, S. Villata, C. Bouveyron

Abstract: Online topic models are unsupervised algorithms to identify latent topics in data streams that continuously evolve over time. Although these methods naturally align with real-world scenarios, they have received considerably less attention from the community compared to their offline counterparts, due to specific additional challenges. To tackle these issues, we present SB-SETM, an innovative model extending the Embedded Topic Model (ETM) to process data streams by merging models formed on successive partial document batches. To this end, SB-SETM (i) leverages a truncated stick-breaking construction for the topic-per-document distribution, enabling the model to automatically infer from the data the appropriate number of active topics at each timestep; and (ii) introduces a merging strategy for topic embeddings based on a continuous formulation of optimal transport adapted to the high dimensionality of the latent topic space. Numerical experiments show SB-SETM outperforming baselines on simulated scenarios. We extensively test it on a real-world corpus of news articles covering the Russian-Ukrainian war throughout 2022-2023.

Read the paper

23rd European Semantic Web Conference (ESWC), May 2026, Dubrovnik (Croatia)

Link Prediction or Perdition: the Seeds of Instability in Knowledge Graph Embeddings by G. Méroué (3IA Ph.D. student), F. Gandon (3IA Chairholder), and P. Monnin

Abstract: Embedding models (KGEMs) constitute the main link prediction approach to complete knowledge graphs. Standard evaluation protocols emphasize rank-based metrics such as MRR or Hits@K, but usually overlook the influence of random seeds on result stability. Moreover, these metrics conceal potential instabilities in individual predictions and in the organization of embedding spaces. In this work, we conduct a systematic stability analysis of multiple KGEMs across several datasets. We find that high-performance models actually produce divergent predictions at the triple level and highly variable embedding spaces. By isolating stochastic factors (i.e., initialization, triple ordering, negative sampling, dropout, hardware), we show that each independently induces instability of comparable magnitude. Furthermore, for a given model, hyperparameter configurations with better MRR are not guaranteed to be more stable. Moreover, voting, albeit a known remediation mechanism, only provides a limited enhancement of stability. These findings highlight critical limitations of current benchmarking protocols, and raise concerns about the reliability of KGEMs for knowledge graph completion.

Read the paper

International Symposium on Biomedical Imaging (ISBI), April 2026, London (England)

Resource-Efficient Automatic Refinement of Segmentations via Weak Supervision from Light Feedback by A. de Langlais, B. Billot, T. Aguilar Vidal, M-O. Gauci (3IA Fellow), H. Delingette (3IA Chairholder)

Abstract: Delineating anatomical regions is a key task in medical image analysis. Manual segmentation achieves high accuracy but is labor-intensive and prone to variability, thus prompting the development of automated approaches. Recently, a breadth of foundation models has enabled automated segmentations across diverse anatomies and imaging modalities, but these may not always meet the clinical accuracy standards. While segmentation refinement strategies can improve performance, current methods depend on heavy user interactions or require fully supervised segmentations for training. Here, we present SCORE (Segmentation COrrection from Regional Evaluations), a weakly supervised framework that learns to refine mask predictions only using light feedback during training. Specifically, instead of relying on dense training image annotations, SCORE introduces a novel loss that leverages region-wise quality scores and over/under-segmentation error labels. We demonstrate SCORE on humerus CT scans, where it considerably improves initial predictions from TotalSegmentator, and achieves performance on par with existing refinement methods, while greatly reducing their supervision requirements and annotation time.
Code

Read the paper

19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), March 2026, Rabat (Marocco)

Stakeholder Suite: A Unified AI Framework for Mapping Actors, Topics and Arguments in Public Debates (Demo track) by M. Chenene, J. Rouhier, J. Daniélou, M. Sarkar, and E. Cabrio (3IA Chairholder)

Abstract: Public debates surrounding infrastructure and energy projects involve complex networks of stakeholders, arguments, and evolving narratives. Understanding these dynamics is crucial for anticipating controversies and informing engagement strategies, yet existing tools in media intelligence largely rely on descriptive analytics with limited transparency. This paper presents Stakeholder Suite, a framework deployed in operational contexts for mapping actors, topics, and arguments within public debates. The system combines actor detection, topic modeling, argument extraction and stance classification in a unified pipeline. Tested on multiple energy infrastructure projects as a case study, the approach delivers fine-grained, source-grounded insights while remaining adaptable to diverse domains. The framework achieves strong retrieval precision and stance accuracy, producing arguments judged relevant in 75% of pilot use cases. Beyond quantitative metrics, the tool has proven effective for operational use: helping project teams visualize networks of influence, identify emerging controversies, and support evidence-based decision-making.

Read the paper

CacheNotes: Task-Aware Key-Value Cache Compression for Reasoning-Intensive Knowledge Tasks by G. Corallo, O. Weller, F. Petroni, P. Papotti (3IA Chairholder)

Abstract: Integrating external knowledge into Large Language Models (LLMs) is crucial for many real-world applications, yet current methods like Retrieval-Augmented Generation (RAG) face limitations with broad, multi-source queries, while long-context models are computationally prohibitive.
We introduce CACHENOTES: Task-Aware Key-Value Cache Compression. Given a task description and a corpus, CACHENOTES first generates a sequence of Compression-Planning-Tokens (CPTs), an offline task-focused distillation pass that identifies and organizes key information from the corpus. These CPTs are then used to guide a one-time compression of the corpus into a compact, reusable KV cache, which is then used alone at inference time to efficiently answer diverse, reasoning-intensive queries — eliminating repeated retrieval or context expansion.
Experiments on LongBench show that, on Question-Answering tasks at a 20× compression, CACHENOTES outperforms RAG by over 8 F1 points and reduces latency by over 4×. On RULER, it surpasses previous query-agnostic compression methods by 55 points, narrowing the gap to query-aware compression approaches. Additional results on real-world enterprise and synthetic datasets demonstrate its strong performance on multi-hop and broad-coverage queries.

Read the paper

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2026, Tucson (USA)

Denoise, Divide, Distill, and Predict (D3P): Towards Forecasting Long-horizon Real-world Anomaly from Normalcy by Q. Mérilleau, S. Majhi, A. Dantcheva (3IA Chairholder), Q. Kong, L. Garattoni, G. Francesca, F. Brémond (3IA Chairholder)

Asbtract: Forecasting abnormal human behavior (AHB) in unconstrained real-world environments is critical for enabling proactive safety interventions. Unlike short-term anomaly detection, long-horizon forecasting offers a vital reaction window but remains underexplored due to three core challenges: (i) noisy, complex human–agent interactions; (ii) weak temporal coupling between normal observations and distant anomalies; and (iii) data scarcity limiting the scalability of autoregressive models.
To address these, we propose D3P (Denoise, Divide, Distill, and Predict), a novel encoder–decoder framework that bridges denoised pasts with distilled autoregressive futures. Our Differential Past Encoder (DiPE) disentangles scene-level and object-level dynamics via differential attention, suppressing irrelevant interactions and enhancing discriminative cues. The Distilled Future Auto-Regressive Decoder (D-FAD) adopts a divide-and-conquer strategy, segmenting future queries into temporal chunks for sequential prediction, while leveraging distillation to balance robustness and latency.
We validate our approach on the AHB-F benchmark, the only dataset dedicated to abnormal behavior forecasting, and further integrate D-FAD with several state-of-the-art methods. In all cases, our framework consistently outperforms prior work in both forecasting accuracy and computational efficiency.

Read the paper

MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training by M. O. Zeeshan, N. Gillet, A. Lameiras Koerich, M. Pedersoli, F. Brémond (3IA Chairholder), E. Granger

Asbtract: Personalized expression recognition (ER) involves adapting a machine learning model to subject-specific data for improved recognition of expressions with considerable interpersonal variability. Subject-specific ER can benefit significantly from multi-source domain adaptation (MSDA) methods — where each domain corresponds to a specific subject — to improve model accuracy and robustness.
Despite promising results, state-of-the-art MSDA approaches often overlook multimodal information or blend sources into a single domain, limiting subject diversity and failing to explicitly capture unique subject-specific characteristics.
To address these limitations, we introduce MuSACo, a multimodal subject-specific selection and adaptation method for ER based on co-training. It leverages complementary information across multiple modalities and multiple source domains for subject-specific adaptation. This makes MuSACo particularly relevant for affective computing applications in digital health, such as patient-specific assessment for stress or pain, where subject-level nuances are crucial.
MuSACo selects source subjects relevant to the target and generates pseudo-labels using the dominant modality for class-aware learning, in conjunction with a class-agnostic loss to learn from less confident target samples. Finally, source features from each modality are aligned, while only confident target features are combined.
Experimental results on challenging multimodal ER datasets — BioVid, StressID, and BAH — show that MuSACo outperforms UDA (blending) and state-of-the-art MSDA methods.
Code

Read the paper

Transactions on Machine Learning Research, February 2026

When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models by Raphaël Razafindralambo (3IA Ph.D.), Rémy Sun, Damien Garreau, Frederic Precioso, Pierre-Alexandre Mattei (3IA Deputy Scientific Director and Chairholder)

Abstract: Diffusion models now generate high-quality, diverse samples, with an increasing focus on more powerful models. Although ensembling is a well-known way to improve supervised models, its application to unconditional score-based diffusion models remains largely unexplored. In this work we investigate whether it provides tangible benefits for generative modelling. We find that while ensembling the scores generally improves the score-matching loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as FID on image datasets. We confirm this observation across a breadth of aggregation rules using Deep Ensembles, Monte Carlo Dropout, on CIFAR-10 and FFHQ. We attempt to explain this discrepancy by investigating possible explanations, such as the link between score estimation and image quality. We also look into tabular data through random forests, and find that one aggregation strategy outperforms the others. Finally, we provide theoretical insights into the summing of score models, which shed light not only on ensembling but also on several model composition techniques (e.g. guidance).
Python code

Read the paper / See the video

IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, Honolulu, Hawai (USA)

Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection by G. D’Amicantonio, S. Majhi, Q. Kong, L. Garattoni, G. Francesca, F. Brémond (3IA Chairholder), E. Bondarev

Abstract: Video Anomaly Detection (VAD) is a challenging task due to the variability of anomalous events and the limited availability of labeled data. Under the Weakly-Supervised VAD (WSVAD) paradigm, only video-level labels are provided during training, while predictions are made at the frame level. Although state-of-the-art models perform well on simple anomalies (e.g., explosions), they struggle with complex real-world events (e.g., shoplifting). This difficulty stems from two key issues: (1) the inability of current models to address the diversity of anomaly types, as they process all categories with a shared model, overlooking category-specific features; and (2) the weak supervision signal, which lacks precise temporal information, limiting the ability to capture nuanced anomalous patterns blended with normal events. To address these challenges, we propose Gaussian Splatting-guided Mixture of Experts (GS-MoE), a novel framework that employs a set of expert models, each specialized in capturing specific anomaly types. These experts are guided by a temporal Gaussian splatting loss, enabling the model to leverage temporal consistency and enhance weak supervision. The Gaussian splatting approach encourages a more precise and comprehensive representation of anomalies by focusing on temporal segments most likely to contain abnormal events. The predictions from these specialized experts are integrated through a mixture-of-experts mechanism to model complex relationships across diverse anomaly patterns. Our approach achieves state-of-the-art performance, with a 91.58% AUC on the UCF-Crime dataset, and demonstrates superior results on XD-Violence and MSAD datasets. By leveraging category-specific expertise and temporal guidance, GS-MoE sets a new benchmark for VAD under weak supervision.

Read the paper / With supplementary

Scaling Action Detection: AdaTAD++ with Transformer-Enhanced Temporal-Spatial Adaptation, with supplementary by T. Agrawal, A. Ali, A. Dantcheva (3IA Chairholder), F. Bremond (3IA Chairholder)

Abstract: Temporal Action Detection (TAD) is essential for analyzing long-form videos by identifying and segmenting actions within untrimmed sequences. While recent innovations like Temporal Informative Adapters (TIA) have improved resolution, memory constraints still limit large video processing. To address this, we introduce AdaTAD++, an enhanced framework that decouples temporal and spatial processing within adapters, organizing them into independently trainable modules. Our novel two-step training strategy first optimizes for high temporal and low spatial resolution, then vice versa, allowing the model to utilize both high spatial and temporal resolutions during inference, while maintaining training efficiency. Additionally, we incorporate a more sophisticated temporal module capable of capturing long-range dependencies more effectively than previous methods. Experiments on benchmark datasets, including ActivityNet-1.3, THUMOS14, and EPIC-Kitchens 100, demonstrate that AdaTAD++ achieves state-of-the-art performance. We also explore various adapter configurations, discussing their trade-offs regarding resource constraints and performance, providing valuable insights into their optimal application.

Read the paper

MultiMediate Challenge: Multi-modal Group Behaviour Analysis for Artificial Mediation, part of the 33th ACM International Conference on Multimedia, October 2025, Dublin (Ireland)

MultiMediate'25: Cross-cultural Multi-domain Engagement Estimation by D. S. Withanage Don, M. Funk, M. Balazia, H. Qiu, S. Okada, F. Brémond (3IA Chairholder), J. Alexandersson, A. Bulling, E. André, P. Müller

Abstract: Estimating the momentary level of participant engagement is an important prerequisite for assistive systems that support human interactions. Previous work has addressed this task in within-domain evaluation scenarios, i.e. training and testing on the same dataset. This is in contrast to real-life scenarios where domain shifts between training and testing data frequently occur. With MultiMediate '24, we present the first challenge addressing multi-domain engagement estimation. As training data, we utilise the NOXI database of dyadic novice-expert interactions. In addition to within-domain test data, we add two new test domains. First, we introduce recordings following the NOXI protocol but covering languages that are not present in the NOXI training data. Second, we collected novel engagement annotations on the MPIIGroupInteraction dataset, which consists of group discussions between three to four people. In this way, MultiMediate '24 evaluates the ability of approaches to generalise across factors such as language and cultural background, group size, task, and screen-mediated vs. face-to-face interaction. This paper describes the MultiMediate '24 challenge, presents baseline results, and discusses selected challenge solutions.

Read the paper