The problem of model sensitivity in Natural Language Processing (NLP) and how to overcome it.

Transformer models have been shown to be highly sensitive to noisy real-world data. How bad is the problem and what can we do to fix it?

An interesting paper from The Institute for Artificial Intelligence, Medical University of Vienna, Austria studied the robustness of Neural Language Models to input perturbations in NLP.

The paper cites that high-performance neural language models have obtained state-of-the-art results on a wide range of Natural Language Processing (NLP) tasks, however, results for common benchmark datasets often do not reflect model reliability and robustness when applied to noisy, real-world data.

The study conducted comprehensive experiments on different NLP tasks. They investigated the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations.

The results from the paper below suggest that language models are sensitive to input deviations and their performance can decrease even when small changes are introduced.

table screenshot

Key Takeaway

Extremely minor changes in text input to a trained model have large (>0.1) reduction in F1 score in all studied tasks.

Key quotes from the paper:

"Even a well-trained, high-performance deep language model can be sensitive to negligible changes in the input that cause the model to make erroneous decisions"


"It may be too simplistic to only rely on accuracy scores obtained on benchmark datasets when evaluating the robustness of NLP systems"

Deeper Insights Findings:

  • Transformer models are very sensitive to perturbations.
  • Small changes (typos, missing/additional words, re-ordering can cause different results
  • A typo (misspelling Los Angeles) actually improves the predictions on one our tests (see image below)
  • “Blah Blah Ltd” is extracted as the vendor name versus “Blah Ltd” in second (see image below)
table screenshot table screenshot

How do we fix it?

From the paper:

  • Use NLP-Perturbation [github] in tandem with Checklist [github] and other tools to test sensitivity of models to perturbations.

From Deeper Insights:

Further Reading:

Deeper Insights recommend this paper's section “2. Related Work” for a domain overview.

Our closing comment:

As Deep Learning and transformer models become more widespread and pronounced, so too do the challenges and pitfalls. Traditional methods of Data Science and Machine Learning no longer suffice, training and running a model are just a small part of building and maintaining a productive and robust AI Solution. Domain knowledge and subject matter expertise are imperative to any viable long term solution.

Deeper Insights is the leading Data Science AI/ Machine Learning company helping organisations across industries unlock the transformative power of AI.

Find out more about our services or email us at

Author: Matt Kidd, Data Scientist, Deeper Insights

Let us solve your impossible problem

Speak to one of our industry specialists about how Artificial Intelligence can help solve your impossible problem.