Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

nlpaug: Enhancing NLP Model Performance with Data Augmentation

Table of Contents

nlpaug: Enhancing NLP Model Performance with Data Augmentation

NLP models often overfit and generalize poorly with limited data. Expanding datasets with new, annotated real-world data is costly and slow.

nlpaug offers diverse NLP data augmentation techniques, which artificially expand existing datasets. This helps models generalize better and perform robustly on unseen data.

Let’s explore some of the augmentation techniques provided by nlpaug:

1. Character-level Augmentation

Keyboard Augmentation

This technique simulates typos by substituting characters based on keyboard distance.

import nlpaug.augmenter.char as nac

text = "The quick brown fox jumps over the lazy dog."
aug = nac.KeyboardAug()
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The quUci b%oan fox j tJps over the lazy dog.

Random Character Insertion

This method randomly inserts characters into the text.

aug = nac.RandomCharAug(action="insert")
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The quick Hbr2own fox jumps Govner the slahzy dog.

2. Word-level Augmentation

Spelling Augmentation

This technique introduces common spelling mistakes.

import nlpaug.augmenter.word as naw

aug = naw.SpellingAug()
augmented_texts = aug.augment(text, n=3)

print("Original:", text)
print("Augmented:")
for aug_text in augmented_texts:
    print(aug_text)

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented:
Then quikly brown fox jumps over the lazy dig.
Th quikly brown fox jumps over the lazy doy.
The quick brouwn fox jumps over the lizy doga.

Contextual Word Embeddings

This advanced method uses pre-trained language models to substitute words based on context.

aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="substitute")
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: the wild brown fox was over the big dog.

Synonym Substitution

This technique replaces words with their synonyms using WordNet.

aug = naw.SynonymAug(aug_src='wordnet')
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The ready brown slyboots jumps over the indolent dog.

Word Splitting

This method randomly splits words into two tokens.

aug = naw.SplitAug()
augmented_text = aug.augment(text)

print("Original:", text)
print("Augmented:", augmented_text[0])

Output:

Original: The quick brown fox jumps over the lazy dog.
Augmented: The qui ck br own fox jumps o ver the lazy dog.

By applying these augmentation techniques, you can significantly expand your training data, leading to more robust and generalizable NLP models.

Link to nlpaug.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran