1

Trace-of-Thought Prompting: Investigating Prompt-Based Knowledge Distillation Through Question Decomposition

Knowledge distillation allows smaller neural networks to emulate the performance of larger, teacher models with reduced computational demands. Traditional methods for Large Language Models (LLMs) often necessitate extensive fine-tuning, which limits …

Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models

As the use of Large Language Models (LLMs) becomes more widespread, understanding their self-evaluation of confidence in generated responses becomes increasingly important as it is integral to the reliability of the output of these models. We …

Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge

Large Language Models (LLMs) have demonstrated remarkable success in tasks like the Winograd Schema Challenge (WSC), showcasing advanced textual common-sense reasoning. However, applying this reasoning to multimodal domains, where understanding text …

Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models

Research on Large Language Models (LLMs) has often neglected subtle biases that, although less apparent, can significantly influence the models' outputs toward particular social narratives. This study addresses two such biases within LLMs: …

EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries

While Large Language Models (LLMs) excel at the Winograd Schema Challenge (WSC), a coreference resolution task testing common-sense reasoning through pronoun disambiguation, they struggle with instances that feature minor alterations or rewording. To …

WSC+: Enhancing The Winograd Schema Challenge Using Tree-of-Experts

The Winograd Schema Challenge (WSC) serves as a prominent benchmark for evaluating machine understanding. While Large Language Models (LLMs) excel at answering WSC questions, their ability to generate such questions remains less explored. In this …

Debiasing should be good and bad: Measuring the consistency of debiasing techniques in language models

Debiasing methods that seek to mitigate the tendency of Language Models (LMs) to occasionally output toxic or inappropriate text have recently gained traction. In this paper, we propose a standardized protocol which distinguishes methods that yield …

The Turing Quest: Can Transformers Make Good NPCs?

In this paper, we study the viability of the deployment of language models towards non-playable character (NPC) scripts, by introducing a novel pipeline for the automatic construction of NPC scripts using Transformer-based believable scripts for a …

ADEPT: An Adjective-Dependent Plausibility Task

A false contract is more likely to be rejected than a contract is, yet a false key is less likely than a key to open doors. While correctly interpreting and assessing the effects of such adjective-noun pairs (e.g., false key) on the plausibility of …

An analysis of dataset overlap on winograd-style tasks

The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models …