Share this page

Publications of the month - May 2026

Research

Published on May 20, 2026 – Updated on May 22, 2026

AI Cluster 3IA Côte d'Azur's publications of May 2026

We are pleased to share the 3IA Côte d’Azur’s researchers’ new publications.

35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026), August 2026, Bremen

PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions
Greta Damo (3IA Ph.D.), Stéphane Petiot (3IA Techpool Engineer), Elena Cabrio (3IA Chairholder), Serena Villata (3IA Scientific Director and Chairholder)

Abstract: The increasing volume of hate speech on online platforms poses significant societal challenges. While the Natural Language Processing community has developed effective methods to automatically detect the presence of hate speech, responses to it, called counter-speech, are still an open challenge. We present PEACE 2.0, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it. More specifically, PEACE 2.0 has three main new functionalities: leveraging a Retrieval-Augmented Generation (RAG) pipeline i) to ground HS explanations into evidence and facts, ii) to automatically generate evidence-grounded counter-speech, and iii) exploring the characteristics of counter-speech replies. By integrating these capabilities, PEACE 2.0 enables in-depth analysis and response generation for both explicit and implicit hateful messages.

Read the paper

28th IAPR International Conference on Pattern Recognition (ICPR 2026), August 2026, Lyon

Dual the Reasoning Double the Insight with TambI: A Self-Supervised Framework for Skeleton Action Representation
Mahmoud Ali, Snehashis Majhi, Di Yang, Quan Kong, Gianpiero Francesca, François Brémond (3IA Chairholder)

Abstract: Self-supervised learning has shown great promise for skeleton-based action recognition, especially with contrastive methods. However, existing approaches rely on single-stream motion encoders. This may fail to fully capture both spatial and temporal details, which are critical for real-world generalization. To address this, we propose TambI, a novel self-supervised framework for skeleton action representation learning. We introduce a novel Dual reasoning module to learn complementary skeleton motion representations. Subsequently, we further design a dual objective learning for an indirect contrastive strategy: (1) an Instance Consistency Loss that aligns representations across models and preserves motion details, and (2) a refined contrastive loss using multiple positive samples to enhance feature discrimination. Extensive experiments on six benchmark datasets, including laboratory (NTU-RGB+D, PKU-MMD) and real-world (Toyota SmartHome, Penn Action, Posetics) settings, demonstrate state-of-the-art performance and superior generalizability.

Read the paper

Forty-Third International Conference on Machine Learning (ICML 2026), July 2026, Seoul

Variance-Reduced (ε,δ)−Unlearning using Forget Set Gradients
Martin Van Waerebeke, Giovanni Neglia (3IA Chairholder), Kevin Scaman, Marco Lorenzi (3IA Chairholder), El-Mahdi El-Mhamdi

Abstract: In machine unlearning, (ε,δ)−unlearning is a popular framework that provides formal guarantees on the effectiveness of the removal of a subset of training data, the \emph{forget set}, from a trained model. For strongly convex objectives, existing first-order methods achieve (ε,δ)−unlearning, but they only use the forget set to calibrate injected noise, never as a direct optimization signal. In contrast, efficient empirical heuristics often exploit the forget samples (e.g., via gradient ascent) but come with no formal unlearning guarantees. We bridge this gap by presenting the Variance-Reduced Unlearning (VRU) algorithm. To the best of our knowledge, VRU is the first first-order algorithm that directly includes forget set gradients in its update rule, while provably satisfying (ε,δ)−unlearning. We establish the convergence of VRU and show that incorporating the forget set yields strictly improved rates, i.e., a better dependence on the achieved error compared to existing first-order (ε,δ)−unlearning methods. Moreover, we prove that, in a low-error regime VRU asymptotically outperforms any first-order methods that ignores the forget set. Experiments corroborate our theory, showing consistent gains over both state-of-the-art certified unlearning methods and over empirical baselines that explicitly leverage the forget set.

Read the paper

IEEE International Conference on Multimedia and Expo (ICME 2026), July 2026, Bangkok

T-MOR: Learning Motion-Aware Skeleton Representations for Human Action Recognition
Di Yang, Mahmoud Ali, Quan Kong, Gianpiero Francesca, François Brémond (3IA Chairholder)

Abstract: Vision-language models such as CLIP have recently achieved strong performance on a wide range of visual understanding tasks. However, most existing models rely primarily on appearance-level supervision from images or videos, and do not explicitly model human motion, which is essential for fine-grained and human-centric action recognition task as actions are defined by temporally structured and physically grounded body movements. To address this problem, we propose Transferable skeleton MOtion Representation (T-MOR), a motion-aware framework that learns transferable action representations from skeleton sequences with the aid of video and language supervision during training. T-MOR adopts a multi-modal contrastive learning scheme that aligns skeleton motion with visual and textual representations, while performing inference using only lightweight skeleton inputs. To support large-scale pre-training, we construct PoseCap-1M, a new dataset that contains over one million synchronized video, skeleton, and text triplets covering diverse human activities. We evaluate T-MOR on a range of human-centric action recognition benchmarks, including action classification and frame-wise temporal detection. Experimental results show that T-MOR consistently improves performance across multiple datasets, such as Toyota Smarthome, Penn Action, UAV-Human, TSU, and Charades. In addition, T-MOR demonstrates strong generalization ability in few-shot and zero-shot settings, highlighting the effectiveness of motion-centric and embodied representations for transferable action understanding.

Read the paper

The 32nd International Conference on Principles and Practice of Constraint Programming (CP 2026), July 2026, Lisbon

The Distance Constraint on Sequence Variables
Margaux Schmied (3IA Ph.D. student), Augustin Delecluse, Jean-Charles Régin (3IA Chairholder), Pierre Schaus

Sequence variables provide a compact framework for modeling routing and scheduling problems in constraint programming, by constructing solutions through successive insertions of nodes into a partial sequence.
We study the propagation of a distance constraint on these variables and propose new admissible lower bounds on the cost of any feasible extension. These bounds, derived from relaxations of the insertion process, make it possible to detect and eliminate unfeasible insertions.
Experiments on the traveling salesman problem with time windows and prize collecting show that these filtering rules significantly reduce the search space compared to existing methods.

Read the paper

The 17th International Conference on Information Processing in Computer-Assisted Interventions (IPCAI 2026), July 2026, Nagoya

TBDM: Temporal Boundary Distillation Module for Surgical Gesture Segmentation
Ezem Sura Ekmekci (3IA Ph.D student), Sébatien Frey, Snehashis Majhi, Khodor Hamadi, Hervé Delingette (3IA Chairholder), Wen Wei, Matthieu Durand, Pierre Berthet-Rayne, François Brémond (3IA Chairholder), Nicholas Ayache (3IA Chairholder)

This paper will be published in International Journal of Computer Assisted Radiology and Surgery

Asbtract: Purpose: Achieving fine-grained understanding of surgical gestures remains a fundamental challenge in computer vision, due to the subtle and temporally overlapping nature of surgical motions. Gesture boundaries, where transitions between surgical actions occur, present challenges for precise temporal localization. We propose a temporal boundary analysis framework that improves overall surgical gesture segmentation by explicitly modeling transitions between actions. While most existing methods rely on both RGB and kinematic data, our approach operates on RGB-only video, without requiring additional annotations or computational overhead at inference.
Methods: We introduce a Temporal Boundary Distillation Module (TBDM) that leverages privileged information during training to learn boundary-aware features. TBDM employs cross-attention between class-present and class-absent temporal regions derived from ground-truth annotations, explicitly encoding transition information. A lightweight projection layer learns boundary-aware features through knowledge distillation from TBDM, supervised by classification and distillation loss (MSE). At inference, only the trained projection layer is required, resulting in no additional computational cost.
Results: We evaluated TBDM on CholecT50 and RARP-45 surgical datasets. TBDM consistently improved baseline models across all metrics, achieving up to +8.5 edit score improvement on CholecT50. On RARP-45, our approach achieved state-of-the-art edit score (81.4) and F1@50 (77.9), demonstrating effectiveness across different architectures and datasets.

Read the paper

IEEE International Conference on Robotics and Automation (ICRA 2026), June 2026, Vienna

DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-time Optical Flow and Stereo Estimation
Tushar Anand, Maheswar Bora, Antitza Dantcheva (3IA Chairholder), Abhijit Das

Abstract: In this work, we propose a novel Mamba block DenVisCoM, as well as a novel hybrid architecture specifically tailored for accurate and real-time estimation of optical flow and disparity estimation. Given that such multi-view geometry and motion tasks are fundamentally related, we propose a unified architecture to tackle them jointly. Specifically, the proposed hybrid architecture is based on DenVisCoM and a Transformer-based attention block that efficiently addresses real-time inference, memory footprint, and accuracy at the same time for joint estimation of motion and 3D dense perception tasks. We extensively analyze the benchmark trade-off of accuracy and real-time processing on a large number of datasets. Our experimental results and related analysis suggest that our proposed model can accurately estimate optical flow and disparity estimation in real time. All models and associated code are available here.

Read the paper

Learning-Based Fusion for Robust Multi-Spectral Visual Servoing
Enrico Fiasché, Siddharth Singh Savner, Ezio Malis (3IA Chairholder), Philippe Martinet

Asbtract: Multispectral sensors, which measure multiple wavelength bands beyond the standard red, green, and blue channels, capture richer information than conventional RGB cameras. Such enriched data is especially valuable in visual servoing, where robot control critically depends on image content. However, leveraging multiple spectral bands (typically around a dozen) directly within real-time visual servoing constitutes a significant challenge. The only prior work tackled this problem using a Pixel Selection strategy based on image gradients. This paper introduces a learning-based framework to enhance Multi-Spectral Visual Servoing (MSVS) by fusing data from multispectral cameras into a single, robust representation for control. An autoencoder is employed to compress multispectral inputs into a noise-attenuated 2D image, which is then used within a standard rule-based Direct Visual Servoing (DVS) scheme. Comparison experiments both with simulated data and with a real robot in complex and unstructured environments show that the proposed learning-based fusion maintains stable convergence and improves positioning accuracy under noisy conditions while preserving computational efficiency.

Read the paper

ICRA Workshop on "Geometry in the Age of Data Driven Robotics

Introducing Sylvester Forms to Robotics: Efficient Closed-Form Pose Estimation
Jana Vráblíková, Ezio Malis (3IA Chairholder), Laurent Busé

Asbtract: Pose estimation from 3D-to-3D correspondences is fundamental in robotics and computer vision, with strong relevance to real-time perception and localization. It is commonly formulated as a nonlinear optimization problem that can be reduced to a polynomial system and solved in closed form. In this paper, we introduce a new class of resultant-based polynomial solvers that exploits Sylvester forms to reduce elimination complexity. By integrating Sylvester forms into a hidden-variable formulation, we derive closed-form solvers operating in lower degrees, producing smaller elimination matrices and lower computational cost. Experiments on the KITTI dataset show that the proposed solvers are accurate and faster than state-of-the-art closed-form methods. Beyond the proposed solver, our results highlight a broader point that is particularly relevant for geometric robotics: geometric methods and data-driven methods need not be opposed. While the solver itself is derived from exact algebraic structure, its numerical performance depends on implementation choices such as the order of monomials that induces the block decomposition of the elimination matrix. Since we currently do not have a principled method for selecting the ordering that gives the best numerical conditioning, this work suggests a hybrid direction in which offline learning optimizes such choices while preserving the solver's exact geometric structure.

Read the paper

Lie Group Error Coordinates for Symmetry-Aware Reinforcement Learning applied to Quadrotor Low-Level Control
Andrea Pagnini, Ezio Malis (3IA Chairholder)

Abstract: As data-driven methods become prevalent in robotics, a key question remains whether classical geometric structures are still relevant or whether they can be learned from data. We argue that geometry is not an alternative to learning, but a design tool that shapes what must be learned. In this paper, we show that encoding the right symmetry in the observation of a RL agent reduces the effective complexity of the control problem at the representation level, prior to any architectural choice. We demonstrate this principle on quadrotor low-level control, expressing tracking errors as Lie group quantities in the desired body frame. We show that this coordinate choice improves sample efficiency and enables zero-shot generalization to unseen trajectories, suggesting that the right choice of error coordinates can effectively improve learning without relying on architectural changes.

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026), June 2026, Denver (USA) - Findings track

THEval. Evaluation Framework for Talking Head Video Generation
Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva (3IA Chairholder)

Abstract: Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this consideration, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.

Read the paper

20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026), May 2026, Kyoto

SKEYES: Open Your Eyes to See More: Dual Perspective Contrastive Learning for Skeleton-Based Action Understanding
Mahmoud Ali, Di Yang, Quan Kong, Gianpiero Francesca, François Brémond (3IA Chairholder)

Abstract: Self-supervised learning has emerged as a powerful approach for skeleton-based action recognition, with contrastive methods driving recent progress. However, most existing approaches use a single encoder to jointly model spatial and temporal features, which can blur their distinct semantics and hinder fine-grained motion understanding. To tackle this, we propose SKEYES, a dual-perspective self-supervised framework with two parallel encoders that separately capture spatial and temporal dynamics. This design avoids early feature fusion and preserves the unique characteristics of pose and motion. We further introduce a dual contrastive learning objective that aligns both intra-view and cross-view features, which can promote complementary learning across feature types. To ensure efficiency, only the main encoder is used during inference. Extensive experiments on six benchmark datasets covering both laboratory settings (NTU RGB+D 60/120, PKU-MMD) and real-world environments (Toyota SmartHome, Penn Action, Posetics) demonstrate that SKEYES achieves state-of-the-art performance when transferring for action recognition and action detection tasks, with strong generalization even under low-label conditions.

Read the paper

25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), May 2026, Paphos

Fast and Robust Information Spreading in the Noisy PULL Model : Extended Abstract
Niccolò d'Archivio, Amos Korman, Emanuele Natale (3IA Chairholder), Robin Vacus

Abstract: Efficient information spreading in stochastic multi-agent systems is a core challenge when communication is noisy, bandwidth-limited, and agents lack global coordination. Yet biological systems—such as ant colonies and fish schools—routinely overcome these constraints: a small number of informed individuals can reliably guide large, uncoordinated populations using minimal, noisy signals. Motivated by these observations, we investigate how reliable information dissemination can be achieved in bio-inspired stochastic settings with limited communication and no global control.
We analyze the noisy PULL(ℎ) model, covering a general setting that spans from rumor spreading to majority consensus: a subset of source agents hold initial preferences, and the goal is to converge to the majority preference. Agents passively observe noisy messages from ℎ randomly sampled peers per round. Prior work shows that convergence requires Ω( $n$ /ℎ) rounds even under favorable conditions. We ask: how far can one push simplicity—no synchronization and minimal message size—without compromising convergence speed?
We present a quasi self-stabilizing protocol using only 2-bit messages that converges from arbitrary initial states despite severe noise and asynchrony. It achieves optimal convergence time $O$ (( $n$ /ℎ) log $n$ ) with high probability, and $O$ (log $n$ ) time in the fully connected case ℎ = $n$ . A key subroutine is an even simpler 1-bit protocol assuming simultaneous start, based on a natural two-phase “listen-then-amplify” mechanism reminiscent of biological strategies.
Together, our results connect biologically inspired heuristics with provable guarantees for robust, efficient information dissemination in highly unreliable and uncoordinated systems.

Read the paper

3rd European Semantic Web Conference (ESWC 2026), May 2026, Dubrovnik

Link Prediction or Perdition: the Seeds of Instability in Knowledge Graph Embeddings
Guillaume Méroué (3IA Ph.D. student), Fabien Gandon (3IA Chairholder), Pierre Monnin

Best paper award

Abstract: Embedding models (KGEMs) constitute the main link prediction approach to complete knowledge graphs. Standard evaluation protocols emphasize rank-based metrics such as MRR or Hits@K, but usually overlook the influence of random seeds on result stability. Moreover, these metrics conceal potential instabilities in individual predictions and in the organization of embedding spaces. In this work, we conduct a systematic stability analysis of multiple KGEMs across several datasets. We find that high-performance models actually produce divergent predictions at the triple level and highly variable embedding spaces. By isolating stochastic factors (i.e., initialization, triple ordering, negative sampling, dropout, hardware), we show that each independently induces instability of comparable magnitude. Furthermore, for a given model, hyperparameter configurations with better MRR are not guaranteed to be more stable. Moreover, voting, albeit a known remediation mechanism, only provides a limited enhancement of stability. These findings highlight critical limitations of current benchmarking protocols, and raise concerns about the reliability of KGEMs for knowledge graph completion.

Read the paper

Journal of Artificial Intelligence Research, 2026

Assessing the Minimal Dialectical Quality in Argumentation: A Neuro-Symbolic Approach Integrating Argument Mining, Quality Assessment, and Probabilistic Reasoning
Victor Hugo Nascimento Rocha, Fabio Gagliardi Cozman, Serena Villata (3IA Scientific Director and Chairholder)

Transactions on Machine Learning Research, May 2026

Think2SQL: Blueprinting Reward Density and Advantage Scaling for Effective Text-to-SQL Reasoning
Simone Papicchio, Simone Rossi, Luca Cagliero, Paolo Papotti (3IA Chairholder)

Abstract: While Large Language Models (LLMs) have advanced the state-of-the-art in Text-to-SQL, robust reasoning in complex, multi-table environments remains a bottleneck for parameter-efficient models. This paper presents a systematic empirical study on injecting reasoning capabilities into Text-to-SQL through the lens of Reinforcement Learning with Verifiable Rewards (RLVR) for the Qwen3 model family. We uncover a critical interplay between reward density, advantage scaling, and model capacity. Our analysis yields four primary insights. First, we propose a novel execution-guided dense reward function that significantly outperforms binary signals and existing state-of-the-art rewards by providing granular feedback at the instance level. Second, we analyze the mechanics of advantage calculation, demonstrating that while large models thrive on sparse signals with aggressive advantage scaling, smaller models require dense rewards and conservative scaling to improve Text-to-SQL performance. Third, we evaluate the impact of cold start showing that distillation does not always benefit RLVR performance, and supervised fine-tuned models are prone to distributional mimicry. Fourth, we map the Pareto frontier of training efficiency, providing insights for optimizing Text-to-SQL reasoning under computational constraints. Our findings culminate in the Think2SQL family: our 4B-parameter model demonstrates reasoning capabilities competitive with state-of-the-art models such as o3. We release our models, datasets, and code to create a blueprint for RLVR optimization in Text-to-SQL.

Read the paper