nlpaug: Enhancing NLP Model Performance with Data Augmentation

Khuyen Tran

NLP models often overfit and generalize poorly with limited data. Expanding datasets with new, annotated real-world data is costly and slow.

nlpaug offers diverse NLP data augmentation techniques, which artificially expand existing datasets. This helps models generalize better and perform robustly on unseen data.

Let’s explore some of the augmentation techniques provided by nlpaug:

1. Character-level Augmentation

Keyboard Augmentation

This technique simulates typos by substituting characters based on keyboard distance.

import nlpaug.augmenter.char as nac

text = "The quick brown fox jumps over the lazy dog."
aug = nac.KeyboardAug()
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The quUci b%oan fox j tJps over the lazy dog.

Random Character Insertion

This method randomly inserts characters into the text.

aug = nac.RandomCharAug(action="insert")
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The quick Hbr2own fox jumps Govner the slahzy dog.

2. Word-level Augmentation

Spelling Augmentation

This technique introduces common spelling mistakes.

import nlpaug.augmenter.word as naw

aug = naw.SpellingAug()
augmented_texts = aug.augment(text, n=3)

print("Original:", text)
print("Augmented:")
for aug_text in augmented_texts:
    print(aug_text)

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented:
Then quikly brown fox jumps over the lazy dig.
Th quikly brown fox jumps over the lazy doy.
The quick brouwn fox jumps over the lizy doga.

Contextual Word Embeddings

This advanced method uses pre-trained language models to substitute words based on context.

aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="substitute")
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: the wild brown fox was over the big dog.

Synonym Substitution

This technique replaces words with their synonyms using WordNet.

aug = naw.SynonymAug(aug_src='wordnet')
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The ready brown slyboots jumps over the indolent dog.

Word Splitting

This method randomly splits words into two tokens.

aug = naw.SplitAug()
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The qui ck br own fox jumps o ver the lazy dog.

By applying these augmentation techniques, you can significantly expand your training data, leading to more robust and generalizable NLP models.

Link to nlpaug.

Extract Dates from Text with Datefinder

April 3, 2025

RapidFuzz: Find Similar Strings Despite Typos and Variations

March 23, 2025

Refinery: Human-Guided NLP Data Labeling

March 3, 2025

nlpaug: Enhancing NLP Model Performance with Data Augmentation

Table of Contents

nlpaug: Enhancing NLP Model Performance with Data Augmentation

Khuyen Tran

1. Character-level Augmentation

Keyboard Augmentation

Random Character Insertion

2. Word-level Augmentation

Spelling Augmentation

Contextual Word Embeddings

Synonym Substitution

Word Splitting

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

nlpaug: Enhancing NLP Model Performance with Data Augmentation

Table of Contents

nlpaug: Enhancing NLP Model Performance with Data Augmentation

Khuyen Tran

1. Character-level Augmentation

Keyboard Augmentation

Random Character Insertion

2. Word-level Augmentation

Spelling Augmentation

Contextual Word Embeddings

Synonym Substitution

Word Splitting

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut