Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities
ICLR 2026 Workshop
📍 International Conference on Learning Representations (ICLR 2026)
📅 Date: Monday April 27 · 🌍 Location: Rio de Janeiro, Brazil
Overview
Modern AI systems, particularly large language models, vision-language models, and deep vision networks, are increasingly deployed in high-stakes settings such as healthcare, autonomous driving, and legal decisions. Yet, their lack of transparency, fragility to distributional shifts between train/test environments, and representation misalignment in emerging tasks and data/feature modalities raise serious concerns about their trustworthiness.
This workshop focuses on developing trustworthy AI systems by principled design: models that are interpretable, robust, and aligned across the full lifecycle – from training and evaluation to inference-time behavior and deployment. We aim to unify efforts across modalities (language, vision, audio, and time series) and across technical areas of trustworthiness spanning interpretability, robustness, uncertainty, and safety.
Call for Papers
We invite submissions on topics including (but not limited to):
- Interpretable and Intervenable Models
- concept bottlenecks and modular architectures, mechanistic interpretability and concept-based reasoning, interpretability for control and real-time intervention;
- Inference-Time Safety and Monitoring
- reasoning trace auditing in LLMs and VLMs, inference-time safeguards and safety mechanisms, chain-of-thought consistency and hallucination detection, real-time monitoring and failure intervention mechanisms;
- Multimodal Trust Challenges
- grounding failures and cross-modal misalignment, safety in vision-language and deep vision systems, cross-modal alignment and robust multimodal reasoning, trust and uncertainty in video, audio, and time-series models
- Robustness and Threat Models
- adversarial attacks and defenses, robustness to distributional, conceptual, and cascading shifts, formal verification methods and safety guarantees, robustness under streaming, online, or low-resource conditions;
- Trust Evaluation and Responsible Deployment
- human-AI trust calibration, confidence estimation, uncertainty quantification, metrics for interpretability/alignment/robustness, transparent and accountable deployment pipelines, safety alignment;
- Safety and Trustworthiness in LLM Agents
- safety and failures in planning and action execution, emergent behaviors in multi-agent interactions, intervention and control in agent loops, alignment of long-horizon goals with user intent, auditing and debugging LLM agents in real-world deployment.
Reviews are double-blind and the accepted papers are non-archival. Accepted papers will be presented as posters and/or short talks.
Submission Instruction
- Format: (1) Short paper track: max 4 pages, excluding references; (2) Long paper track: max 9 pages, excluding references. Please use the LaTeX style files (ICLR conference style) provided here. The submission can have unlimited appendix within the same PDF as long as the main text remains within the stated limit.
- Submission: Openreview link
- Submission deadline: Feb 2, 2026 (AoE)
- Guideline: The content of submission needs to be original and not accepted in other archival venues by the time of our submission deadline. Violation of this policy will be desk-rejected.
Note that for Openreivew submission, new profiles created without an institutional email will go through a moderation process that can take up to two weeks. New profiles created with an institutional email will be activated automatically.
Workshop Schedule
| Time |
Session |
| 9:00 – 9:10 AM |
Opening Remarks |
| 9:10 – 9:30 AM |
Spotlight Talks Session 1 |
| 9:30 – 10:00 AM |
Keynote 1, Mihaela van der Schaar (U Cambridge) Stop Forecasting! Start Understanding Time Series Dynamics and Causality! |
| 10:00 – 11:00 AM |
☕ Coffee Break & Networking + Poster Session 1 |
| 11:00 – 11:30 AM |
Keynote 2, Fernanda Viegas (Harvard) How AI Chatbots See Us: Making Interpretability User-Facing |
| 11:30 – 12:00 PM |
Keynote 3, Hamed Hassani (UPenn) Robust Policy Optimization to Prevent Catastrophic Forgetting |
| 12:00 – 1:30 PM |
🍽️ Lunch Break |
| 1:40 – 2:00 PM |
Spotlight Talks Session 2 |
| 2:00 – 2:30 PM |
Keynote 4, Nanyun (Violet) Peng (UCLA) TBA |
| 2:30 – 3:00 PM |
☕ Coffee Break |
| 3:00 – 3:30 PM |
Keynote 5, Yan Liu (USC) Actionable Interpretability through the Lenses of Feature Interaction |
| 3:30 – 4:30 PM |
Poster Session 2 + Networking |
| 4:30 – 4:35 PM |
Closing Remarks |
Important Dates
| Event |
Date |
| Submission deadline |
Feb 2, 2026 |
| Notification to authors |
Feb 28, 2026 |
| Camera-ready deadline |
Mar 6, 2026 |
| Workshop date |
April 27, 2026 |
(All deadlines are AoE.)
Invited Speakers
Yan Liu
USC, Full Professor
Mihaela van der Schaar
U Cambridge, Full Professor
Nanyun (Violet) Peng
UCLA, Associate Professor
Hamed Hassani
UPenn, Associate Professor
Martin Wattenberg
Harvard, Professor
Fernanda Viegas
Harvard, Professor
Organizers
Lily Weng
UC San Diego
Nghia Hoang
Washington State U
Tengfei Ma
Stony Brook U
Jake Snell
Princeton
Francesco Croce
Aalto U
Chandan Singh
Microsoft Research
Subarna Tripathi
Intel
Lam Nguyen
IBM Research
Accepted papers (Spotlight talk)
- 1. Hide and Find: A Distributed Adversarial Attack on Federated Graph Learning
- 2. LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model
- 3. Geometry-Aware Crossover for Effective and Efficient Evolutionary Attacks
- 4. Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration
- 5. Patching LLMs Like Software: A Lightweight Method for Improving Safety Policies in Large Language Models
- 6. DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
- 7. GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
- 8. Uncertainty Drives Social Bias Changes in Quantized Large Language Models
- 9. A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
- 10. Prototype-Based Selective Prediction for Multimodal Instruction Models
- 11. Human-Guided Harm Recovery for Computer Use Agents
- 12. Black-box Optimization of LLM Outputs by Asking for Directions
- 13. Understanding Adversarial Transfer Across Modalities: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed
- 14. Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning
- 15. When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs
- 16. FedGraph: Defending Federated Large Language Model Fine-Tuning Against Backdoor Attacks via Graph-Based Aggregation
- 17. BarrierSteer: LLM Safety via Learning Barrier Steering
- 18. Investigating Data Interventions for Subgroup Fairness: An ICU Case Study
- 19. Fairness Failure Modes of Multimodal LLMs
- 20. Test-Time Training Undermines Existing Safety Guardrails
Accepted papers (Posters)
- 1. Learning Minimal Contexts: How Chain-of-Thought Induces Out-of-Distribution Generalization
- 2. Federated Agent Reinforcement Learning
- 3. Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
- 4. When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models
- 5. Improving Semantic Uncertainty Quantification in Question Answering via Token-Level Temperature Scaling
- 6. Post-hoc Stochastic Concept Bottleneck Models
- 7. Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks
- 8. ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
- 9. SureFED: Robust Federated Learning via Uncertainty-Aware Inward and Outward Inspection
- 10. Robust Object Detection via Kronecker Tensor Decomposition: Theory, Algorithms, and Applications
- 11. Fault-Tolerant Preference Alignment via Multi-Agent Verification
- 12. The Semantic Imprinting Hypothesis: How Semantic Watermarks Survive Prompt-based Editing
- 13. Backdoor Attacks on Decentralised Post-Training
- 14. SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks
- 15. Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast
- 16. Representational de-collapse: Interactions between supervised finetuning and in-context learning in language models
- 17. SafetyPairs: Isolating Safety Critical Image Features With Counterfactual Image Generation
- 18. Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models
- 19. Simple LLM Baselines are Competitive for Model Diffing
- 20. TIGHTENING OPTIMALITY GAP WITH CONFIDENCE THROUGH CONFORMAL PREDICTION
- 21. No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes
- 22. Bootstrapping-based Regularisation for Reducing Individual Prediction Instability in Clinical Risk Prediction Models
- 23. Benchmarking AI Control Protocols for Safety in Medical Question-Answering Tasks
- 24. Visual Disentangled Diffusion Autoencoders: Scalable Counterfactual Generation for Foundation Models
- 25. Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
- 26. Position: Beyond Reasoning Zombies — AI Reasoning Requires Process Validity
- 27. Same Question, Different Lies: Cross-Context Consistency (C³) for Black-Box Sandbagging Detection
- 28. AdaptNC: Adaptive Nonconformity Scores for Uncertainty-Aware Autonomous Systems in Dynamic Environments
- 29. Sparse Circuits of Vision Language Alignment
- 30. Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models
- 31. Training with Honeypots: Reshaping How LLMs Fail
- 32. GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video
- 33. Expert Selections In MoE Models Reveal (Almost) As Much As Text
- 34. TrustLDM: Benchmarking Trustworthiness in Language Diffusion Model
- 35. Moral Preferences of LLMs Under Directed Contextual Influence
- 36. Always Keep Your Promises: A Model-Agnostic Attribution Algorithm for Neural Networks
- 37. Mitigating Reward Hacking with RL Training Interventions
- 38. Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations
- 39. Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors
- 40. Google’s LLM Watermarking System is Vulnerable to Layer Inflation Attack
- 41. MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Knowledge Poisoning Attacks
- 42. Frontier Models Can Take Actions at Low Probabilities
- 43. Diff Mining: Logit Differences Reveal Finetuning Objectives
- 44. Disentangling goal and framing for detecting LLM jailbreaks
- 45. MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
- 46. Model Organisms for Generalization Resistance Under Distribution Shift
- 47. On the Effects of Adversarial Perturbations on Distribution Robustness
- 48. Learn to be Unlearned: Optimizing Language Models for Unlearning via Clustered Gradient Routing
- 49. Selective Disclosure: Controlling Information Leakage in DocVQA Explanations
- 50. GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory
- 51. SafeGuide: Adaptive Inference-Time Safety Control for Diffusion Models
- 52. Enabling Preference-driven Unlearning in Few-step Distilled Text-to-Image Diffusion Models
- 53. Causal Analysis of Representation Drift for Robust Deployment
- 54. No One Monitor Fits All: Oversight Strategies for Frontier Agents
- 55. Robust Feature Attribution via Integrated Sensitivity Gradients
- 56. Byzantine Machine Learning: MultiKrum and an Optimal Notion of Robustness
- 57. Collaborative Threshold Watermarking
- 58. Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates
- 59. A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior
- 60. When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs
- 61. Dual-Objective Reinforcement Learning with novel Hamilton-Jacobi-Bellman formulations
- 62. Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
- 63. RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning
- 64. Auditing Cascading Risks in Multi-Agent Systems via Semantic–Geometric Co-evolution
- 65. The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
- 66. Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
- 67. Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles
- 68. Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
- 69. Closing the Distribution Gap in Adversarial Training for LLMs
- 70. Unifying Perspectives on Learning Biases: A Data-Centric Intervention for Holistic Fairness, Robustness, and Generalization
- 71. Instruction Following by Principled Attention Boosting of Large Language Models
- 72. Watermarking Discrete Diffusion Language Models
- 73. Explainability Is Not a Feature: A Position on Trustworthy AI
- 74. Digging Deeper: Learning Multi-Level Concept Hierarchies
- 75. Enhancing Deep Neural Network Reliability with Refinement and Calibration
- 76. AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents
- 77. Endogenous Resistance to Activation Steering in Language Models
- 78. Stress-Testing Alignment Audits with Prompt-Level Strategic Deception
- 79. How does information access affect LLM monitors’ ability to detect sabotage?
- 80. Nonparametric Variational Differential Privacy via Embedding Parameter Clipping
- 81. Robust AI Evaluation through Maximal Lotteries
- 82. Leveraging RAG for Training-Free Alignment of LLMs
- 83. Hierarchical Retrieval at Scale: Bridging Transparency and Efficiency
- 84. Memorization Dynamics in Knowledge Distillation for Language Models
- 85. Stability-Aware Prompt Optimization for Clinical Data Abstraction
- 86. Explaining Grokking in Transformers through the Lens of Inductive Bias
- 87. Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models
- 88. AutoBaxBuilder: Bootstrapping Code Security Benchmarking
- 89. Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
- 90. Deception in Dialogue: Evaluating and Mitigating Deceptive Behavior in Large Language Models
- 91. Understanding Empirical Unlearning with Combinatorial Interpretability
- 92. Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning
- 93. Paranoid Monitors: How Long Context Breaks LLM Agent Supervision
- 94. Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features
- 95. Enhancing Trust in Large Language Models via Uncertainty-Calibrated Fine-tuning
- 96. Agentic Uncertainty Reveals Agentic Overconfidence
- 97. Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates
- 98. Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models
- 99. From Data to Behavior: Predicting Unintended Model Behaviors Before Training
- 100. Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making
- 101. INFERENCE-TIME SAFETY FOR CODE LLMS VIA RETRIEVAL-AUGMENTED REVISION
- 102. Evolving Safety Landscape of Multi-modal Large Language Models: A Survey of Emerging Threats and Safeguards
- 103. DELTA-CROSSCODER: ROBUST CROSSCODER IN NARROW FINE-TUNING REGIMES
- 104. The Realignment Problem: When Right becomes Wrong in LLMs
- 105. Beyond Static Truthfulness Benchmarks: Two Truths and One Lie for Multi-Agent Deception and Detection
- 106. VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models
- 107. Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations
- 108. Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment
- 109. Towards Statistical Verification for Trustworthy AI
- 110. Efficient Refusal Ablation in LLM through Optimal Transport
- 111. Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model
- 112. BackFed: A Standardized and Efficient Benchmark Framework for Backdoor Attacks in Federated Learning
- 113. RouterInterp: Superposed Specialisation in MoE Routing
- 114. Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models
- 115. Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
- 116. MONITORING EMERGENT REWARD HACKING DURING GENERATION VIA INTERNAL ACTIVATIONS
- 117. Mitigating Legibility Tax with Decoupled Prover-Verifier Games
- 118. Query Circuits: Explaining How Language Models Answer User Prompts
- 119. RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models
- 120. Attention Sinks in Diffusion Language Models
- 121. Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models
- 122. The Rogue Scalpel: Activation Steering Compromises LLM Safety
- 123. OmniPatch: A Universal Adversarial Patch for ViT-CNN Cross-Architecture Transfer in Semantic Segmentation
- 124. BUDDY: Blending Training and Deployment Data with Weighted Expert Ensembles for Post-hoc LLM Calibration
📧 Lily Weng (lweng@ucsd.edu), Nghia Hoang (trongnghia.hoang@wsu.edu)