IJCAI-ECAI 2026 Accepted Papers · Special Track on Human-Centred AI
Presentation format
Every accepted paper is presented in two formats: an oral talk — which must be delivered in person in Bremen by one of the authors — and a poster during a dedicated poster session.
-
#HC13
Beyond “Made with AI”: Visualizing Provenance Density to Mitigate the Transparency Penalty
As generative AI makes polished prose cheap to produce, users can no longer rely on fluency as a proxy for truth. We call this failure mode the Fluency Trap: users trust fluent hallucinations while also discounting accurate content once it is disclosed as AI-generated. Binary "Made with AI" labels respond with authorship disclosure, but they do not show what supports a claim. We propose Provenance Density, an evidence-visualization interface that shows the density of verified claims in a text. In a user study with 81 participants, an idealized Provenance Density interface produced a large discernment gap between truth and fabrication (+4.15 points, d=1.82), whereas participants given no signal showed no detectable discrimination. A technical audit with 200 samples shows that retrieval density alone is insufficient; unexpectedly, the Consistency Veto carries most of the discriminative signal on dynamic queries. As AI-generated content becomes indistinguishable from human writing, effective transparency must move from authorship disclosure toward evidence visualization.Human-Centred AIHumans and AIHuman-Centred AIAI Ethics, Trust, FairnesHuman-Centred AINatural Language Processing -
#HC36
Self-Refine Learning in LLM Multi-Agent Systems for Legal Norm Cognition and Compliance
As large language models (LLMs) increasingly serve as autonomous agents in social simulations, ensuring their ability to understand and comply with legal norms is essential. Yet, current LLM agents frequently exhibit reward hacking (RH) behaviors by optimizing metrics at the expense of norm adherence, undermining simulation fidelity and limiting deployment. We introduce a TBC-TBA self-refine learning multi-agent framework that enables dynamic normative adaptation through iterative multi-agent feedback. This framework integrates Think-Before-Chat (social feedback processing) and Think-Before-Act (norm-guided decision making) phases, allowing agents to progressively refine their normative understanding via structured interaction cycles. Across five mainstream LLMs and 100 legal scenarios, we found that while LLMs partially recognize legal norms, they systematically exhibit RH behaviors with illegal action rates (IAR) of 14.29–37.11%. Comparison with human cognition reveals alignment in moral reasoning but sharp divergence in risk perception and probability distortion. To address these deficits, we introduce four methods to improve LLM's normative compliance. Dynamic Norm Learning Mechanism (DNLM) serves as the core, using a psychologically grounded identify–infer–implement process that reduces IAR by 15.78% on average and delivers the most significant improvement. We also introduce Deep MaxPain (DMP) for consequence based deterrence, Norm Analysis Chain-of-Thought (NA-CoT) for structured reasoning, and Few-shot Norm Learning (FNL) for case based acquisition, all of them enhance compliance. Our findings show that LLM agents can better follow legal norms when equipped with structured self-refine learning and psychologically informed mechanisms. This work improves social alignment in multi-agent systems and opens avenues for future research on scalable, norm-compliant autonomous agents.The code and data are publicly available on GitHub.Human-Centred AIAgent-based and Multi-agent SystemsHuman-Centred AIMultidisciplinary Topics and ApplicationsHuman-Centred AIAI Ethics, Trust, FairnesHuman-Centred AINatural Language Processing -
#HC39
Uncovering Discriminatory Behavior in LLM-Based CV Screening Systems with Composite Statistical Metamorphic Relations
Artificial intelligence (AI) applications are prone to reproducing existing patterns of marginalization, bias, and discrimination. The adoption of automated decision-making tools, particularly those based on large language models (LLMs), necessitates the detection and analysis of their discriminatory behavior to ensure compliance with anti-discrimination regulations. This work advances metamorphic testing for bias detection, aiming to (1) uncover discriminatory behavior in AI tools and (2) standardize the metamorphic test creation approach, rooting it in the legal and academic research on discrimination. To fill in these gaps, we introduce a metamorphic test creation workflow and evaluate it on the LLM-based CV screening use case. Our method yields a metamorphic relation (MR) catalogue of 8 atomic and 6 composite MRs, legally justified by Article 21 (1) of the Charter of Fundamental Rights of the European Union and sociologically motivated by the intersectionality theory. The resulting metamorphic test suite uncovers discriminatory failures for all 6 evaluated LLMs, totaling 1,272 defects across 7,800 test cases. Both the new atomic MRs and the composite MRs detect additional discriminatory defects for all LLMs. Composite MRs consistently uncovered more defects than their atomic counterparts, with failure rates increasing by on average 9% for Llama-3.1 8B and 25% for Llama-3.2 3B.Human-Centred AIAI Ethics, Trust, FairnesHuman-Centred AIMachine LearningHuman-Centred AIMultidisciplinary Topics and ApplicationsHuman-Centred AIHumans and AI -
#HC50
Psychological Benefits and Costs of Diversifying Algorithmic Recourse
Algorithmic recourse provides counterfactual action plans that help people overturn unfavorable AI decisions. While diverse recourse sets may improve transparency and motivation, they may also impose cognitive load and negative emotions by increasing counterfactual reasoning demands. To examine this trade-off, we conducted a between-subjects controlled experiment (N=750) that manipulated recourse-set diversity while controlling the number of options, and evaluated its effects on psychological benefits and costs. Results show that diversification enhances psychological benefits (e.g., willingness to act) for small sets without incurring additional psychological costs, whereas for large sets, it makes cognitive load more salient. These findings suggest that naively diversifying recourse can burden decision subjects, underscoring the need for new diversification methods that incorporate human cognition and psychology to mitigate such costs.Human-Centred AIHumans and AIHuman-Centred AIAI Ethics, Trust, Fairnes -
#HC77
Test-Time User Alignment via Bayesian Population Guidance in Subjective Tasks
In subjective tasks, different individuals can have different correct answers for the same input—the ground truth is not fixed but rather determined by each user's personal perspective. Standard foundation models suppress this individual variation by producing population-averaged predictions; conversely, few-shot in-context learning (ICL) struggles to capture user-specific preferences from limited demonstrations. To address this limitation, we propose Population-Guided Bayesian Calibration (PGBC), a test-time personalization method that applies Bayesian inference to foundation model outputs for user-specific alignment. Unlike ICL methods that rely on the model to implicitly infer preferences from examples, PGBC explicitly quantifies each individual's deviation from the population distribution and incorporates this into prediction with minimal feedback (K < 10). Experiments on three subjective benchmark datasets across four foundation models demonstrate that PGBC significantly outperforms zero-shot baselines with a large effect size (Cohen's d = 1.21), and the improvement over ICL approaches is even more pronounced (d = 1.40). Furthermore, user-level distribution analysis reveals that PGBC aligns predictions with individual preferences when zero-shot baselines exhibit substantial bias, while ICL methods increase distributional mismatch. Our results show that PGBC enables practical deployment in cold-start scenarios where minimal user feedback must yield maximal personalization.Human-Centred AIUncertainty in AIHuman-Centred AIMachine Learning -
#HC78
Responsibility in Multi-Agent Sequential Decision-Making: Comparing Human Judgments to Formal Models of Causal Attribution
With the growing adoption of artificial intelligence in high-stakes decision-making domains, identifying the causes of outcomes--particularly failures--and determining who is responsible has become a critical concern. In this work, we investigate how well formal definitions of responsibility attribution, grounded in the framework of actual causality, align with human judgments of responsibility. To this end, we conduct a large-scale survey to elicit human judgments of responsibility in multi-agent sequential decision-making scenarios, using a modified version of the card game Goofspiel. We evaluate different responsibility attribution methods, assessing their alignment with human judgments about responsibility, and identifying factors that significantly shape responsibility judgments. While no single responsibility attribution method consistently aligns with human responses, our findings highlight key factors that influence human responsibility judgments, including agent-specific biases and the amount of information available to agents during decision-making. -
#HC79
Studying the Effects of Robot Distraction on School Shooter Behavior Using Virtual Reality
We examine a particular adversarial human--robot interaction for which mobile robots influence the behavior of people role-playing as a school shooter. Using a high-fidelity virtual reality environment, we conducted a controlled study with 150 participants. Two autonomous robots predicted participant movement and positioned themselves to interfere and distract. The robots' approach strategy was manipulated---either moving directly into the participant's path (aggressive) or maintaining distance (passive)---along with the level of distraction, ranging from no additional cues (low), to siren and lights (medium), to siren, lights, and smoke to impair visibility (high). An aggressive, high-distraction robot configuration reduced the number of victims by 46.6% relative to a no-robot control. These results show that robot-based distraction can meaningfully alter human behavior and outcomes in adversarial settings, while also raising important ethical questions about the use of such systems in school environments.Human-Centred AIHumans and AIHuman-Centred AIRoboticsHuman-Centred AIMachine Learning -
#HC84
DiverValue-Bench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values
The alignment of large language models (LLMs) with human values is critical for their safe and effective deployment across diverse user populations. However, existing benchmarks often neglect cultural and demographic diversity, leading to limited understanding of how value alignment generalizes globally. In this work, we introduce DiverValue-Bench, a benchmark that systematically evaluates LLMs’ alignment with multi-dimensional human value preferences across 74 countries/regions. DiverValue-Bench contains 23,763 high-quality instances annotated with fine-grained value labels, personalized questions, and rich demographic metadata, providing broad demographic and geographic coverage for population-aware value-alignment evaluation. Using DiverValue-Bench, we conduct an in-depth analysis of several representative LLMs, revealing substantial disparities in alignment performance across geographic and demographic lines. We further demonstrate that lightweight fine-tuning methods, such as Low-Rank Adaptation (LoRA) and Direct Preference Optimization (DPO), can significantly enhance value alignment in both in-domain and out-of-domain settings. Our findings underscore the necessity for population-aware alignment evaluation and provide actionable insights for building culturally adaptive and value-sensitive LLMs. DiverValue-Bench serves as a practical foundation for future research on global alignment, personalized value modeling, and equitable AI development. -
#HC136
Human-Centric Behavior-Aware Adaptive Off-Policy Selection
In many human-centric environments, such as education and healthcare, the unobservability of human underlying states has been recognized as a key obstacle for understanding individual needs, thus hindering out ability to provide personalized decision-making policies. Several reinforcement learning (RL)-related approaches have been used to facilitate sequential decision-making in these settings, including off-policy selection (OPS), which aids in safely evaluating and selecting optimal policies offline. However, existing OPS algorithms are unsuitable when both the state is unobserved and the setting requires a personalized policy. To address this challenge, we propose a behavior-aware adaptive policy selection framework (HBO) that first captures potentially unique characteristics of the state from human behaviors, and then estimates when and how to intervene with less uncertainty in a timely manner, with bounded error. HBO is evaluated over two real-world human-centric applications, intelligent tutoring and sepsis treatments, where it significantly enhanced participants' long-term course outcomes and survival rates. Broadly, our work enables improved policy personalization in high-stakes domains where extensive evaluation is not possible. -
#HC146
Probing Cultural Awareness in LLMs: A Case Study of Cross-Culture Aesthetic Stylistics
Large Language Models (LLMs) are increasingly deployed in diverse cultural contexts, yet their ability to master aesthetic stylistics, i.e., the strategic use of language to evoke cultural resonance, remains underexplored. We curate C4Styli, a benchmark of highly stylized translated movie titles and advertising slogans from Hong Kong and the Chinese Mainland, to evaluate LLMs via the lens of behavioral recognition and productive competence. Extensive evaluations show that LLMs differ from humans in stylistic recognition, and this recognition ability varies across text domains. In addition, stylistic recognition and generation performance in LLMs are not consistently aligned. To further examine whether LLMs genuinely capture stylistic information in stylistic recognition, we conduct structural ablation with logistic regression probes. We find that, in the Hong Kong setting, stylistic recognition in LLMs relies primarily on surface-level linguistic information rather than stylistic structure. This suggests limited sensitivity to Hong Kong-specific stylistic structure. Our
code and data are available at https://github.com/wangjs9/C4STYLI.Human-Centred AIHumans and AI
