1

Debiasing should be good and bad: Measuring the consistency of debiasing techniques in language models

Debiasing methods that seek to mitigate the tendency of Language Models (LMs) to occasionally output toxic or inappropriate text have recently gained traction. In this paper, we propose a standardized protocol which distinguishes methods that yield …

The Turing Quest: Can Transformers Make Good NPCs?

In this paper, we study the viability of the deployment of language models towards non-playable character (NPC) scripts, by introducing a novel pipeline for the automatic construction of NPC scripts using Transformer-based believable scripts for a …

ADEPT: An Adjective-Dependent Plausibility Task

A false contract is more likely to be rejected than a contract is, yet a false key is less likely than a key to open doors. While correctly interpreting and assessing the effects of such adjective-noun pairs (e.g., false key) on the plausibility of …

An analysis of dataset overlap on winograd-style tasks

The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models …

A generalized knowledge hunting framework for the winograd schema challenge

We introduce an automatic system that performs well on two common-sense reasoning tasks, the Winograd Schema Challenge (WSC) and the Choice of Plausible Alternatives (COPA). Problem instances from these tasks require diverse, complex forms of …

A knowledge hunting framework for common sense reasoning

We introduce an automatic system that achieves state-of-the-art results on the Winograd Schema Challenge (WSC), a common sense reasoning task that requires diverse, complex forms of inference and knowledge. Our method uses a knowledge hunting module …

How reasonable are common-sense reasoning tasks: A case-study on the Winograd schema challenge and SWAG

Recent studies have significantly improved the state-of-the-art on common-sense reasoning (CSR) benchmarks like the Winograd Schema Challenge (WSC) and SWAG. The question we ask in this paper is whether improved performance on these benchmarks …

The KnowRef coreference corpus: Removing gender and number cues for difficult pronominal anaphora resolution

We introduce a new benchmark for coreference resolution and NLI, KnowRef, that targets common-sense understanding and world knowledge. Previous coreference resolution tasks can largely be solved by exploiting the number and gender of the antecedents, …

The efficacy of single-and dual-hormone artificial pancreas systems at regulating glucose levels during continuous and interval exercise in type 1 diabetes

The artificial pancreas (AP) in its 2 versions, single-hormone AP (insulin only; SAP) and dual-hormone AP (insulin and glucagon; DAP), is a promising modality for the treatment of type 1 diabetes (T1D). We conducted an open-label, randomized, …