A main challenge in data curation is that humans, while we can write correct text, oftentimes write repetitively and without diversity. A model trained on the resulting dataset is prone to overfitting, which may lead to low performance on out-of-domain data.

In this work, the authors are proposing a joint human-machine scheme for data collection, bringing together the generative strengths of machine learning models with the evaluation strengths of humans. With DataMaps, they identify examples most helpful to training (for example, the ambiguous ones) and use GPT-3 to generate more similar examples (by asking the model to generate pairs of premise and entailment/neutral/contradiction in a paradigm similar to few-shot learning). Then, humans evaluate and complete the annotation of these examples, either via adjustments to the text (for example, improving the fluency) or amending labels.

WaNLI results in improvements over multiple out-of-domain datasets, include 11% improvement on HANS and 9% on Adversarial NLI. Noteworthy is the proposed technique is not tied to NLI datasets, but can be used for any classification task. Further, when finetuning on MNLI plus an auxiliary dataset, WaNLI performs the best as the auxiliary dataset. Still, finetuning solely on WaNLI results in the second best performance, only behind pairing WaNLI with an auxiliary adversarial NLI dataset.

Finally, the authors analyze artifacts present in WaNLI. They show that a model that only sees one of premise/hypothesis (that is, there is no way to infer the label) has an accuracy of approximately 42%, while the experiment on MNLI results in 50% accuracy. Further, semantic similarity between the premise and the three labels (neutral, entailment and contradiction) is computed for MNLI and WaNLI. In MNLI, the premise is more similar to the entailment, followed by neutral and then contradiction. In WaNLI, representations are much closer with more overlap, making it harder to distinguish (especially between neutral and contradiction). Finally, with DataMaps the authors show that in MNLI most examples are easy to learn, while in WaNLI the distribution is more spread out.

In conclusion, the authors not only propose an interesting and useful data collection process, but also propose an NLI with (seemingly) more robustness than the current datasets.


Alisa Liu et al. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation. 2022. arXiv: 2201.05955 [cs.CL].