nlpaug: Enhancing NLP Model Performance with Data Augmentation

NLP models often overfit and generalize poorly with limited data. Expanding datasets with new, annotated real-world data is costly and slow.

nlpaug offers diverse NLP data augmentation techniques, which artificially expand existing datasets. This helps models generalize better and perform robustly on unseen data.

Let’s explore some of the augmentation techniques provided by nlpaug:

1. Character-level Augmentation

Keyboard Augmentation

This technique simulates typos by substituting characters based on keyboard distance.

import nlpaug.augmenter.char as nac

text = "The quick brown fox jumps over the lazy dog."
aug = nac.KeyboardAug()
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The quUci b%oan fox j tJps over the lazy dog.

Random Character Insertion

This method randomly inserts characters into the text.

aug = nac.RandomCharAug(action="insert")
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The quick Hbr2own fox jumps Govner the slahzy dog.

2. Word-level Augmentation

Spelling Augmentation

This technique introduces common spelling mistakes.

import nlpaug.augmenter.word as naw

aug = naw.SpellingAug()
augmented_texts = aug.augment(text, n=3)

print("Original:", text)
print("Augmented:")
for aug_text in augmented_texts:
    print(aug_text)

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented:
Then quikly brown fox jumps over the lazy dig.
Th quikly brown fox jumps over the lazy doy.
The quick brouwn fox jumps over the lizy doga.

Contextual Word Embeddings

This advanced method uses pre-trained language models to substitute words based on context.

aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="substitute")
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: the wild brown fox was over the big dog.

Synonym Substitution

This technique replaces words with their synonyms using WordNet.

aug = naw.SynonymAug(aug_src='wordnet')
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The ready brown slyboots jumps over the indolent dog.

Word Splitting

This method randomly splits words into two tokens.

aug = naw.SplitAug()
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The qui ck br own fox jumps o ver the lazy dog.

By applying these augmentation techniques, you can significantly expand your training data, leading to more robust and generalizable NLP models.

Link to nlpaug.

Related Posts

Related Posts

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran