NLP models often overfit and generalize poorly with limited data. Expanding datasets with new, annotated real-world data is costly and slow.
nlpaug offers diverse NLP data augmentation techniques, which artificially expand existing datasets. This helps models generalize better and perform robustly on unseen data.
Let’s explore some of the augmentation techniques provided by nlpaug:
1. Character-level Augmentation
Keyboard Augmentation
This technique simulates typos by substituting characters based on keyboard distance.
import nlpaug.augmenter.char as nac
text = "The quick brown fox jumps over the lazy dog."
aug = nac.KeyboardAug()
augmented_text = aug.augment(text)
print("Original:", text)
print("Augmented:", augmented_text[0])
Output:
Original: The quick brown fox jumps over the lazy dog.
Augmented: The quUci b%oan fox j tJps over the lazy dog.
Random Character Insertion
This method randomly inserts characters into the text.
aug = nac.RandomCharAug(action="insert")
augmented_text = aug.augment(text)
print("Original:", text)
print("Augmented:", augmented_text[0])
Output:
Original: The quick brown fox jumps over the lazy dog.
Augmented: The quick Hbr2own fox jumps Govner the slahzy dog.
2. Word-level Augmentation
Spelling Augmentation
This technique introduces common spelling mistakes.
import nlpaug.augmenter.word as naw
aug = naw.SpellingAug()
augmented_texts = aug.augment(text, n=3)
print("Original:", text)
print("Augmented:")
for aug_text in augmented_texts:
print(aug_text)
Output:
Original: The quick brown fox jumps over the lazy dog.
Augmented:
Then quikly brown fox jumps over the lazy dig.
Th quikly brown fox jumps over the lazy doy.
The quick brouwn fox jumps over the lizy doga.
Contextual Word Embeddings
This advanced method uses pre-trained language models to substitute words based on context.
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="substitute")
augmented_text = aug.augment(text)
print("Original:", text)
print("Augmented:", augmented_text[0])
Output:
Original: The quick brown fox jumps over the lazy dog.
Augmented: the wild brown fox was over the big dog.
Synonym Substitution
This technique replaces words with their synonyms using WordNet.
aug = naw.SynonymAug(aug_src='wordnet')
augmented_text = aug.augment(text)
print("Original:", text)
print("Augmented:", augmented_text[0])
Output:
Original: The quick brown fox jumps over the lazy dog.
Augmented: The ready brown slyboots jumps over the indolent dog.
Word Splitting
This method randomly splits words into two tokens.
aug = naw.SplitAug()
augmented_text = aug.augment(text)
print("Original:", text)
print("Augmented:", augmented_text[0])
Output:
Original: The quick brown fox jumps over the lazy dog.
Augmented: The qui ck br own fox jumps o ver the lazy dog.
By applying these augmentation techniques, you can significantly expand your training data, leading to more robust and generalizable NLP models.