Accepted Papers

← Back to workshop homepage


Spotlight Papers

1.
Hide and Find: A Distributed Adversarial Attack on Federated Graph Learning
Jinshan Liu ⋅ Ken Li ⋅ Jiazhe Wei ⋅ Bin Shi ⋅ Bo Dong
2.
LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model
Praney Goyal ⋅ Marcel Mateos Salles ⋅ Pradyut Sekhsaria ⋅ Hai Huang ⋅ Randall Balestriero
3.
Geometry-Aware Crossover for Effective and Efficient Evolutionary Attacks
Hyo Seo Kim ⋅ Gang Luo ⋅ Can Chen ⋅ Binghui Wang ⋅ Yue Duan ⋅ Ren Wang
4.
Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration
Dongkyu Cho ⋅ Miao Zhang ⋅ Gregory D. Lyng ⋅ Rumi Chunara
5.
Patching LLMs Like Software: A Lightweight Method for Improving Safety Policies in Large Language Models
Huzaifa Arif ⋅ Pin-Yu Chen ⋅ Keerthiram Murugesan ⋅ Alex Gittens ⋅ Payel Das ⋅ Ching-Yun Ko
6.
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
Xu Wang ⋅ Bingqing Jiang ⋅ Yu Wan ⋅ Baosong Yang ⋅ Lingpeng Kong ⋅ Difan Zou
7.
GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
Yichi Zhang ⋅ Nabeel Seedat ⋅ Yinpeng Dong ⋅ Peng Cui ⋅ Jun Zhu ⋅ Mihaela van der Schaar
8.
Uncertainty Drives Social Bias Changes in Quantized Large Language Models
Stanley Bryan Hua ⋅ Sanae Lotfi ⋅ Irene Chen
9.
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar ⋅ Julianna Piskorz ⋅ David Baek ⋅ David Demitri Africa ⋅ Jim Weatherall ⋅ Max Tegmark ⋅ Christian Schroeder de Witt ⋅ Mihaela van der Schaar ⋅ David Krueger
10.
Prototype-Based Selective Prediction for Multimodal Instruction Models
Eduardo Soares ⋅ Emilio Vital Brazil ⋅ Plamen Angelov ⋅ Victor Shirasuna ⋅ Renato Cerqueira
11.
Human-Guided Harm Recovery for Computer Use Agents
Christy Li ⋅ Sky CH-Wang ⋅ Andi Peng ⋅ Andreea Bobu
12.
Black-box Optimization of LLM Outputs by Asking for Directions
Jie Zhang ⋅ Meng Ding ⋅ Yang Liu ⋅ Jue Hong ⋅ Florian Tramer
13.
14.
Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning
Ajinkya Mohgaonkar ⋅ Lukas Gosch ⋅ Mahalakshmi Sabanayagam ⋅ Debarghya Ghoshdastidar ⋅ Stephan Günnemann
15.
When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs
Beidi Zhao ⋅ Wenlong Deng ⋅ Xinting Liao ⋅ Yushu Li ⋅ Nazim Shaikh ⋅ Yao Nie ⋅ Xiaoxiao Li
16.
FedGraph: Defending Federated Large Language Model Fine-Tuning Against Backdoor Attacks via Graph-Based Aggregation
Xi Chen ⋅ Chunyi Zhou ⋅ Rui Zeng ⋅ Xiaogang Xu ⋅ Zhe Liu ⋅ Shouling Ji
17.
BarrierSteer: LLM Safety via Learning Barrier Steering
Thanh Q. Tran ⋅ Arun Verma ⋅ Kiwan Wong ⋅ Bryan Kian Hsiang Low ⋅ Daniela Rus ⋅ Wei Xiao
19.
Fairness Failure Modes of Multimodal LLMs
Canyu Chen ⋅ Anglin Cai ⋅ Joan Nwatu ⋅ Jianshu Zhang ⋅ Yale Li ⋅ Han Liu ⋅ Jessica Hullman ⋅ Rada Mihalcea ⋅ Kathleen McKeown ⋅ Manling Li
20.
Test-Time Training Undermines Existing Safety Guardrails
Simone Antonelli ⋅ Mohammad Sadegh Akhondzadeh ⋅ Aleksandar Bojchevski

Poster Papers

2.
Federated Agent Reinforcement Learning
Canyu Chen ⋅ Kangyu Zhu ⋅ Zhaorun Chen ⋅ Zhanhui Zhou ⋅ Shizhe Diao ⋅ Yiping Lu ⋅ Tian Li ⋅ Manling Li ⋅ Dawn Song
3.
Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
Haoming Xu ⋅ Ningyuan Zhao ⋅ Yunzhi Yao ⋅ Weihong Xu ⋅ Hongru WANG ⋅ Xinle Deng ⋅ Shumin Deng ⋅ J Pan ⋅ Huajun Chen ⋅ Ningyu Zhang
4.
When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models
Qitong Wang ⋅ Haoran Dai ⋅ Haotian Zhang ⋅ Christopher Rasmussen ⋅ Binghui Wang
5.
Improving Semantic Uncertainty Quantification in Question Answering via Token-Level Temperature Scaling
Tom A. Lamb ⋅ Desi R Ivanova ⋅ Philip Torr ⋅ Tim G. J. Rudner
6.
Post-hoc Stochastic Concept Bottleneck Models
Wiktor Hoffmann ⋅ Sonia Laguna ⋅ Moritz Vandenhirtz ⋅ Emanuele Palumbo ⋅ Julia E Vogt
8.
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Trapoom Ukarapol ⋅ Nut Chukamphaeng ⋅ Kunat Pipatanakul ⋅ Pakhapoom Sarapat
9.
SureFED: Robust Federated Learning via Uncertainty-Aware Inward and Outward Inspection
Nasimeh Heydaribeni ⋅ Ruisi Zhang ⋅ Tara Javidi ⋅ Cristina Nita-Rotaru ⋅ Farinaz Koushanfar
10.
Robust Object Detection via Kronecker Tensor Decomposition: Theory, Algorithms, and Applications
Salman Ahmadi-Asl ⋅ Roman Garaev ⋅ Hamidreza Behjoo ⋅ Asad Khattak ⋅ Manuel Mazzara
11.
Fault-Tolerant Preference Alignment via Multi-Agent Verification
Elias Hossain ⋅ Maryam Rahimimovassagh ⋅ Subash Neupane ⋅ Mohammad Basher ⋅ Ivan Garibay ⋅ Niloofar Yousefi
13.
Backdoor Attacks on Decentralised Post-Training
Oguzhan Ersoy ⋅ Nikolay Blagoev ⋅ Jona Lintelo ⋅ Stefanos Koffas ⋅ Marina Krček ⋅ Stjepan Picek
15.
Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast
Ji Qi ⋅ Mingxiao Liu ⋅ VIET THUC ⋅ Yuzhe Li ⋅ Zhuoshi Pan ⋅ Gene Cheung ⋅ H. Vicky Zhao
17.
SafetyPairs: Isolating Safety Critical Image Features With Counterfactual Image Generation
Alec Helbling ⋅ Shruti Palaskar ⋅ Kundan Krishna ⋅ Polo Chau ⋅ Leon Gatys ⋅ Joseph Cheng
19.
Simple LLM Baselines are Competitive for Model Diffing
Elias Kempf ⋅ Simon Schrodi ⋅ Bartosz Cywiński ⋅ Thomas Brox ⋅ Neel Nanda ⋅ Arthur Conmy
20.
TIGHTENING OPTIMALITY GAP WITH CONFIDENCE THROUGH CONFORMAL PREDICTION
Miao Li ⋅ Michael Klamkin ⋅ Russell Bent ⋅ Pascal Van Hentenryck
21.
No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes
Iván Moreno ⋅ Arnau Padres Masdemont ⋅ Anton Hawthorne ⋅ David Demitri Africa ⋅ Lorenzo Pacchiardi
23.
Benchmarking AI Control Protocols for Safety in Medical Question-Answering Tasks
Guido Freire ⋅ Agustín Martínez-Suñé ⋅ Viviana Cotik
25.
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
Kundan Krishna ⋅ Joseph Cheng ⋅ Charles Maalouf ⋅ Leon Gatys
27.
Same Question, Different Lies: Cross-Context Consistency (C³) for Black-Box Sandbagging Detection
Lin Yulong ⋅ Pablo Bernabeu-Perez ⋅ Benjamin Arnav ⋅ Lennie Wells ⋅ Mary Phuong
29.
30.
Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models
Ziyuan Chen ⋅ Yujin Jeong ⋅ Tobias Braun ⋅ Anna Rohrbach
31.
Training with Honeypots: Reshaping How LLMs Fail
Samuel Simko ⋅ Punya Syon Pandey ⋅ Zhijing Jin ⋅ Bernhard Schölkopf
32.
GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video
Zhenhao Zhu ⋅ Yue Liu ⋅ Yanpei Guo ⋅ Wenjie Qu ⋅ Cancan Chen ⋅ Yufei He ⋅ Yibo Li ⋅ Yulin Chen ⋅ Tianyi Wu ⋅ Huiying Xu ⋅ Xinzhong Zhu ⋅ Jiaheng Zhang
34.
TrustLDM: Benchmarking Trustworthiness in Language Diffusion Model
Yichuan Mo ⋅ Yukun Jiang ⋅ Yanbo Shi ⋅ Mingjie Li ⋅ Michael Backes ⋅ Yang Zhang ⋅ Yisen Wang
35.
Moral Preferences of LLMs Under Directed Contextual Influence
Phil Blandfort ⋅ Tushar Karayil ⋅ Urja Pawar ⋅ Robert Graham ⋅ Alex McKenzie ⋅ Dmitrii Krasheninnikov
36.
Always Keep Your Promises: A Model-Agnostic Attribution Algorithm for Neural Networks
Kevin Lee ⋅ Duncan Smith-Halverson ⋅ Pablo Millan Arias
37.
Mitigating Reward Hacking with RL Training Interventions
Aria Wong ⋅ Josh Engels ⋅ Neel Nanda
38.
Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations
Noah Y Siegel ⋅ Nicolas Heess ⋅ Maria Perez-Ortiz ⋅ Oana-Maria Camburu
39.
Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors
Max McGuinness ⋅ Alex Serrano ⋅ Luke Bailey ⋅ Scott Emmons
40.
41.
MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Knowledge Poisoning Attacks
Hyeonjeong Ha ⋅ Qiusi Zhan ⋅ Jeonghwan Kim ⋅ Dimitrios Bralios ⋅ Saikrishna sanniboina ⋅ Nanyun (Violet) Peng ⋅ Kai-Wei Chang ⋅ Daniel Kang ⋅ Heng Ji
42.
Frontier Models Can Take Actions at Low Probabilities
Alex Serrano ⋅ Wen Xing ⋅ David Lindner ⋅ Erik Jenner
43.
Diff Mining: Logit Differences Reveal Finetuning Objectives
Greg Kocher ⋅ Robert West ⋅ Clément Dumas ⋅ Julian Minder
44.
Disentangling goal and framing for detecting LLM jailbreaks
Amirhossein Farzam ⋅ Majid Behbahani ⋅ Mani Malek ⋅ Yuriy Nevmyvaka ⋅ Guillermo Sapiro
45.
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Kan ⋅ Tommy Tran ⋅ Vedant Yadav ⋅ Ava Cai ⋅ Ruizhe Li ⋅ Maheep Chaudhary
46.
Model Organisms for Generalization Resistance Under Distribution Shift
Jou Barzdukas ⋅ Jack Peck ⋅ Julian Schulz ⋅ Paulius Rauba ⋅ Lennie Wells
47.
On the Effects of Adversarial Perturbations on Distribution Robustness
Yipei Wang ⋅ Zhaoying Pan ⋅ Xiaoqian Wang
48.
Learn to be Unlearned: Optimizing Language Models for Unlearning via Clustered Gradient Routing
Vincent Hanke ⋅ Jing Xu ⋅ Martin Pawelczyk ⋅ Michael Backes ⋅ Adam Dziedzic ⋅ Franziska Boenisch
49.
Selective Disclosure: Controlling Information Leakage in DocVQA Explanations
Kangsoo Jung ⋅ Mohamed Ali Souibgui ⋅ Changkyu Choi ⋅ Catuscia Palamidessi
50.
GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory
Pepijn Cobben ⋅ Xuanqiang Angelo Huang ⋅ Thao Pham ⋅ Isabel Dahlgren ⋅ Terry Zhang ⋅ Zhijing Jin
51.
SafeGuide: Adaptive Inference-Time Safety Control for Diffusion Models
Tong Zhou ⋅ Juyang Bai ⋅ Xiaolin Xu ⋅ Shaolei Ren
52.
Enabling Preference-driven Unlearning in Few-step Distilled Text-to-Image Diffusion Models
Gaurav Patel ⋅ Jun Fang ⋅ Greg Ver Steeg ⋅ Qiang Qiu ⋅ Sravan Sripada
54.
No One Monitor Fits All: Oversight Strategies for Frontier Agents
Neil Kale ⋅ Shashwat Saxena ⋅ Ziqian Zhong ⋅ Chen Wu ⋅ Aditi Raghunathan
55.
Robust Feature Attribution via Integrated Sensitivity Gradients
Rukmangadh Sai Myana ⋅ Sumit Jha ⋅ Yanzhao Wu
56.
Byzantine Machine Learning: MultiKrum and an Optimal Notion of Robustness
Gilles Bareilles ⋅ Wassim Bouaziz ⋅ Julien Fageot ⋅ El-Mahdi El-Mhamdi
57.
Collaborative Threshold Watermarking
Tameem Bakr ⋅ Anish Ambreth ⋅ Nils Lukas
58.
Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates
Ariel Fogel, Omer Hofman, Eilon Cohen, Roman Vainshtein
59.
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior
Harry Mayne ⋅ Justin Kang ⋅ Dewi Gould ⋅ Kannan Ramchandran ⋅ Adam Mahdi ⋅ Noah Y Siegel
60.
When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs
Shaowen Wang ⋅ Yiqi Dong ⋅ Ruinian Chang ⋅ Tansheng Zhu ⋅ Yuebo Sun ⋅ Kaifeng Lyu ⋅ Jian Li
61.
Dual-Objective Reinforcement Learning with novel Hamilton-Jacobi-Bellman formulations
William Sharpless ⋅ Dylan Hirsch ⋅ Sander Tonkens ⋅ Nikhil Shinde ⋅ Sylvia Herbert
62.
Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
Ziwen Xu ⋅ Chenyan WU ⋅ Hengyu Sun ⋅ Haiwen Hong ⋅ Mengru Wang ⋅ Yunzhi Yao ⋅ Longtao Huang ⋅ Hui Xue' ⋅ Shumin Deng ⋅ Zhixuan Chu ⋅ Huajun Chen ⋅ Ningyu Zhang
63.
RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning
Zeming Wei ⋅ Qiaosheng Zhang ⋅ Xia Hu ⋅ Xingcheng Xu
64.
Auditing Cascading Risks in Multi-Agent Systems via Semantic–Geometric Co-evolution
zixun luo ⋅ YUHANG FAN ⋅ hengyu lin ⋅ Youzhi Zhang
65.
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Jingyu (Jack) Zhang ⋅ Haozhu Wang ⋅ Eric Michael Smith ⋅ Sid Wang ⋅ Amr Sharaf ⋅ Mahesh Pasupuleti ⋅ Ben Van Durme ⋅ Daniel Khashabi ⋅ Jason E Weston ⋅ Hongyuan Zhan
66.
Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
Anmol Goel ⋅ Cornelius Emde ⋅ Sangdoo Yun ⋅ Seong Joon Oh ⋅ Martin Gubri
67.
Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles
Xinzhu Liang ⋅ Joseph Lukens ⋅ Sanjaya Lohani ⋅ Thomas Searles ⋅ Brian Kirby ⋅ Xin Qiu ⋅ Kody Law
68.
Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
Giovanni De Muri ⋅ Mark Vero ⋅ Robin Staab ⋅ Martin Vechev
69.
Closing the Distribution Gap in Adversarial Training for LLMs
Chengzhi (Martin) Hu ⋅ Jonas Dornbusch ⋅ David Lüdke ⋅ Stephan Günnemann ⋅ Leo Schwinn
71.
Instruction Following by Principled Attention Boosting of Large Language Models
Vitoria Guardieiro ⋅ Avishree Khare ⋅ Adam Stein ⋅ Eric Wong
72.
Watermarking Discrete Diffusion Language Models
Avi Bagchi ⋅ Akhil Bhimaraju ⋅ Moulik Choraria ⋅ Daniel Alabi ⋅ Lav R Varshney
73.
Explainability Is Not a Feature: A Position on Trustworthy AI
Gabriel Banaggia ⋅ Eduardo Soares ⋅ Renato Cerqueira ⋅ Emilio Vital Brazil ⋅ Simone Diniz Junqueira Barbosa
74.
Digging Deeper: Learning Multi-Level Concept Hierarchies
Oscar Hill ⋅ Mateo Espinosa Zarlenga ⋅ Mateja Jamnik
75.
Enhancing Deep Neural Network Reliability with Refinement and Calibration
Ramya Hebbalaguppe ⋅ K.N Ajay Shastry ⋅ Soumya Suvra Ghosal ⋅ Chetan Arora
76.
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents
Emma Gouné ⋅ Akshat Naik ⋅ Patrick Quinn ⋅ Guillermo Bosch ⋅ Francisco Zabala ⋅ Jason Brown ⋅ Edward Young
77.
Endogenous Resistance to Activation Steering in Language Models
Alex McKenzie ⋅ Keenan Pepper ⋅ Stijn Servaes ⋅ Martin Leitgab ⋅ Murat Cubuktepe ⋅ Michael Vaiana ⋅ Diogo de Lucena ⋅ Judd Rosenblatt ⋅ Michael Graziano
78.
Stress-Testing Alignment Audits with Prompt-Level Strategic Deception
Oliver Daniels ⋅ Benjamin M Marlin ⋅ Perusha Moodley ⋅ David Lindner
79.
How does information access affect LLM monitors' ability to detect sabotage?
Rauno Arike ⋅ Raja Moreno ⋅ Rohan Subramani ⋅ Shubhorup Biswas ⋅ Francis Ward
80.
81.
Robust AI Evaluation through Maximal Lotteries
Hadi Khalaf ⋅ Serena Wang ⋅ Daniel Halpern ⋅ Itai Shapira ⋅ Flavio Calmon ⋅ Ariel Procaccia
84.
Memorization Dynamics in Knowledge Distillation for Language Models
Jaydeep Borkar ⋅ Karan Chadha ⋅ Niloofar Mireshghallah ⋅ Yuchen Zhang ⋅ Irina-Elena Veliche ⋅ David Smith ⋅ Zheng Xu ⋅ Diego Garcia-Olano
85.
Stability-Aware Prompt Optimization for Clinical Data Abstraction
Arinbjörn Kolbeinsson ⋅ Daniel Timbie ⋅ Sajjan Narsinghani ⋅ Sanjay Hariharan
86.
Explaining Grokking in Transformers through the Lens of Inductive Bias
Jaisidh Singh ⋅ Diganta Misra ⋅ Antonio Orvieto
87.
Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models
Ashutosh Ranjan ⋅ Vivek Srivastava ⋅ Shirish Karande ⋅ Murari Mandal
88.
AutoBaxBuilder: Bootstrapping Code Security Benchmarking
Tobias von Arx ⋅ Niels Mündler ⋅ Mark Vero ⋅ Maximilian Baader ⋅ Martin Vechev
89.
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
Linh Le ⋅ David Williams-King ⋅ Mohamed Merzouk ⋅ Aton Kamanda ⋅ Adam Oberman
90.
Deception in Dialogue: Evaluating and Mitigating Deceptive Behavior in Large Language Models
Marwa Abdulhai ⋅ Ryan Cheng ⋅ Aryansh Shrivastava ⋅ Natasha Jaques ⋅ Yarin Gal ⋅ Sergey Levine
91.
Understanding Empirical Unlearning with Combinatorial Interpretability
Shingo Kodama ⋅ Niv Cohen ⋅ Micah Adler ⋅ Nir Shavit
92.
Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning
Hen Davidov ⋅ Nachshon Cohen ⋅ Oren Kalinsky ⋅ Yaron Fairstein ⋅ Guy Kushilevitz ⋅ Ram Yazdi ⋅ Patrick Rebeschini
93.
Paranoid Monitors: How Long Context Breaks LLM Agent Supervision
Alicia Yang ⋅ Aashiq Muhamed ⋅ Mona Diab ⋅ Virginia Smith
95.
Enhancing Trust in Large Language Models via Uncertainty-Calibrated Fine-tuning
Ranganath Krishnan ⋅ Piyush Khanna ⋅ Omesh Tickoo
96.
Agentic Uncertainty Reveals Agentic Overconfidence
Jean Kaddour ⋅ Srijan Patel ⋅ Gbetondji Dovonon ⋅ Leo Richter ⋅ Pasquale Minervini ⋅ Matt Kusner
97.
Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates
Nikhil Prakash ⋅ Donghao Ren ⋅ Dominik Moritz ⋅ Yannick Assogba
98.
Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models
Mingyeong Kim ⋅ Jungwon Choi ⋅ Chaeyun Jang ⋅ Juho Lee
99.
From Data to Behavior: Predicting Unintended Model Behaviors Before Training
Mengru Wang ⋅ Zhenqian Xu ⋅ Junfeng Fang ⋅ Yunzhi Yao ⋅ Shumin Deng ⋅ Huajun Chen ⋅ Ningyu Zhang
100.
Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making
Khurram Yamin ⋅ Jingjing Tang ⋅ Santiago Cortes-Gomez ⋅ Amit Sharma ⋅ Eric Horvitz ⋅ Bryan Wilder
101.
102.
Evolving Safety Landscape of Multi-modal Large Language Models: A Survey of Emerging Threats and Safeguards
Xi Li ⋅ Shu Zhao ⋅ Xiaohan Zou ⋅ Fei Zhao ⋅ Fuxiao Liu ⋅ Yusen Zhang ⋅ Cheng Han ⋅ Yushun Dong ⋅ Jiaqi Wang
103.
DELTA-CROSSCODER: ROBUST CROSSCODER IN NARROW FINE-TUNING REGIMES
Aly Kassem ⋅ Thomas Jiralerspong ⋅ Negar Rostamzadeh ⋅ Golnoosh Farnadi
104.
The Realignment Problem: When Right becomes Wrong in LLMs
Aakash Sharma ⋅ Debdeep Sanyal ⋅ Manodeep Ray ⋅ Vivek Srivastava ⋅ Shirish Karande ⋅ Murari Mandal
105.
106.
VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models
Hyesu Lim ⋅ Jinho Choi ⋅ Taekyung Kim ⋅ Byeongho Heo ⋅ Jaegul Choo ⋅ Dongyoon Han
107.
Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations
Yahan Li ⋅ Xinyi Jie ⋅ Wanjia Ruan ⋅ Xubei Zhang ⋅ Huaijie ZHU ⋅ Yicheng Gao ⋅ Ruishan Liu
108.
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment
Mathieu Petitbois ⋅ Rémy Portelas ⋅ sylvain lamprier
109.
Towards Statistical Verification for Trustworthy AI
Blossom Metevier ⋅ Max Springer ⋅ Bohdan Turbal ⋅ Aleksandra Korolova
110.
111.
Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model
Tianyi Wu ⋅ Mingzhe Du ⋅ Yue Liu ⋅ Chengran Yang ⋅ Terry Yue Zhuo ⋅ Jiaheng Zhang ⋅ See-Kiong Ng
112.
113.
RouterInterp: Superposed Specialisation in MoE Routing
Ilya Lasy ⋅ Yinuo Cai ⋅ Kola Ayonrinde
114.
Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models
Sajjad Ghiasvand ⋅ Haniyeh Ehsani Oskouie ⋅ Mahnoosh Alizadeh ⋅ Ramtin Pedarsani
115.
Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
Hen Davidov ⋅ Shai Feldman ⋅ Gilad Freidkin ⋅ Yaniv Romano
116.
MONITORING EMERGENT REWARD HACKING DURING GENERATION VIA INTERNAL ACTIVATIONS
Patrick Wilhelm ⋅ Thorsten Wittkopp ⋅ Odej Kao
121.
Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models
Jingwei Ni ⋅ Ekaterina Fadeeva ⋅ Tianyi Wu ⋅ Mubashara Akhtar ⋅ Jiaheng Zhang ⋅ Elliott Ash ⋅ Markus Leippold ⋅ Timothy Baldwin ⋅ See-Kiong Ng ⋅ Artem Shelmanov ⋅ Mrinmaya Sachan
122.
The Rogue Scalpel: Activation Steering Compromises LLM Safety
Anton Korznikov ⋅ Andrey Galichin ⋅ Alexey Dontsov ⋅ Oleg Rogov ⋅ Ivan Oseledets ⋅ Elena Tutubalina
123.
OmniPatch: A Universal Adversarial Patch for ViT-CNN Cross-Architecture Transfer in Semantic Segmentation
Aarush Aggarwal ⋅ Akshat Tomar ⋅ Amritanshu Tiwari ⋅ Sargam Goyal
124.
BUDDY: Blending Training and Deployment Data with Weighted Expert Ensembles for Post-hoc LLM Calibration
Aishwarya Mandyam ⋅ Wenhui Lu ⋅ Wing Hung Wong ⋅ John Duchi ⋅ Barbara Engelhardt