Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Testing

Behave: Write Readable ML Tests with Behavior-Driven Development

Table of Contents

Motivation
What is behave?
Invariance Testing
Directional Testing
Minimum Functionality Testing
Behave’s Trade-offs
Conclusion

Motivation
Imagine you create an ML model to predict customer sentiment based on reviews. Upon deploying it, you realize that the model incorrectly labels certain positive reviews as negative when they’re rephrased using negative words.

This is just one example of how an extremely accurate ML model can fail without proper testing. Thus, testing your model for accuracy and reliability is crucial before deployment.
But how do you test your ML model? One straightforward approach is to use unit-test:
from textblob import TextBlob

def test_sentiment_the_same_after_paraphrasing():
sent = "The hotel room was great! It was spacious, clean and had a nice view of the city."
sent_paraphrased = "The hotel room wasn't bad. It wasn't cramped, dirty, and had a decent view of the city."

sentiment_original = TextBlob(sent).sentiment.polarity
sentiment_paraphrased = TextBlob(sent_paraphrased).sentiment.polarity

both_positive = (sentiment_original > 0) and (sentiment_paraphrased > 0)
both_negative = (sentiment_original < 0) and (sentiment_paraphrased < 0)
assert both_positive or both_negative

This approach works but can be challenging for non-technical or business participants to understand. Wouldn’t it be nice if you could incorporate project objectives and goals into your tests, expressed in natural language?
Feature: Sentiment Analysis
As a data scientist
I want to ensure that my model is invariant to paraphrasing
So that my model can produce consistent results.

Scenario: Paraphrased text
Given a text
When the text is paraphrased
Then both text should have the same sentiment

That is when behave comes in handy.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

What is behave?
behave is a Python framework for behavior-driven development (BDD). BDD is a software development methodology that:

Emphasizes collaboration between stakeholders (such as business analysts, developers, and testers)
Enables users to define requirements and specifications for a software application

Since behave provides a common language and format for expressing requirements and specifications, it can be ideal for defining and validating the behavior of machine learning models.
To install behave, type:
pip install behave

Let’s use behave to perform various tests on machine learning models.

📚 For comprehensive unit testing strategies and best practices, check out Production-Ready Data Science.

Invariance Testing
Invariance testing tests whether an ML model produces consistent results under different conditions.
An example of invariance testing involves verifying if a model is invariant to paraphrasing. An ideal model should maintain consistent sentiment scores even when a positive review is rephrased using negative words like “wasn’t bad” instead of “was good.”

Feature File
To use behave for invariance testing, create a directory called features. Under that directory, create a file called invariant_test_sentiment.feature.
└── features/
└─── invariant_test_sentiment.feature

Within the invariant_test_sentiment.feature file, we will specify the project requirements:
Feature: Sentiment Analysis
As a data scientist
I want to ensure that my model is invariant to paraphrasing
So that my model can produce consistent results.

Scenario: Paraphrased text
Given a text
When the text is paraphrased
Then both text should have the same sentiment

The “Given,” “When,” and “Then” parts of this file present the actual steps that will be executed by behave during the test.
The Feature section serves as living documentation to provide context but does not trigger test execution.
Python Step Implementation
To implement the steps used in the scenarios with Python, start with creating the features/steps directory and a file called invariant_test_sentiment.py within it:
└── features/
├──── invariant_test_sentiment.feature
└──── steps/
└──── invariant_test_sentiment.py

The invariant_test_sentiment.py file contains the following code, which tests whether the sentiment produced by the TextBlob model is consistent between the original text and its paraphrased version.
from behave import given, then, when
from textblob import TextBlob

@given("a text")
def step_given_positive_sentiment(context):
context.sent = "The hotel room was great! It was spacious, clean and had a nice view of the city."

@when("the text is paraphrased")
def step_when_paraphrased(context):
context.sent_paraphrased = "The hotel room wasn't bad. It wasn't cramped, dirty, and had a decent view of the city."

@then("both text should have the same sentiment")
def step_then_sentiment_analysis(context):
# Get sentiment of each sentence
sentiment_original = TextBlob(context.sent).sentiment.polarity
sentiment_paraphrased = TextBlob(context.sent_paraphrased).sentiment.polarity

# Print sentiment
print(f"Sentiment of the original text: {sentiment_original:.2f}")
print(f"Sentiment of the paraphrased sentence: {sentiment_paraphrased:.2f}")

# Assert that both sentences have the same sentiment
both_positive = (sentiment_original > 0) and (sentiment_paraphrased > 0)
both_negative = (sentiment_original < 0) and (sentiment_paraphrased < 0)
assert both_positive or both_negative

Explanation of the code above:

The steps are identified using decorators matching the feature’s predicate: given, when, and then.
The decorator accepts a string containing the rest of the phrase in the matching scenario step.
The context variable allows you to share values between steps.

Run the Test
To run the invariant_test_sentiment.feature test, type the following command:
behave features/invariant_test_sentiment.feature

Output:
Feature: Sentiment Analysis # features/invariant_test_sentiment.feature:1
As a data scientist
I want to ensure that my model is invariant to paraphrasing
So that my model can produce consistent results in real-world scenarios.
Scenario: Paraphrased text
Given a text
When the text is paraphrased
Then both text should have the same sentiment
Traceback (most recent call last):
assert both_positive or both_negative
AssertionError

Captured stdout:
Sentiment of the original text: 0.66
Sentiment of the paraphrased sentence: -0.38

Failing scenarios:
features/invariant_test_sentiment.feature:6 Paraphrased text

0 features passed, 1 failed, 0 skipped
0 scenarios passed, 1 failed, 0 skipped
2 steps passed, 1 failed, 0 skipped, 0 undefined

The output shows that the first two steps passed and the last step failed, indicating that the model is affected by paraphrasing.
Directional Testing
Directional testing is a statistical method used to assess whether the impact of an independent variable on a dependent variable is in a particular direction, either positive or negative.
An example of directional testing is to check whether the presence of a specific word has a positive or negative effect on the sentiment score of a given text.

To use behave for directional testing, we will create two files directional_test_sentiment.feature and directional_test_sentiment.py.
└── features/
├──── directional_test_sentiment.feature
└──── steps/
└──── directional_test_sentiment.py

Feature File
The code in directional_test_sentiment.feature specifies the requirements of the project as follows:
Feature: Sentiment Analysis with Specific Word
As a data scientist
I want to ensure that the presence of a specific word
has a positive or negative effect on the sentiment score of a text

Scenario: Sentiment analysis with specific word
Given a sentence
And the same sentence with the addition of the word 'awesome'
When I input the new sentence into the model
Then the sentiment score should increase

Notice that “And” is added to the prose. Since the preceding step starts with “Given,” behave will rename “And” to “Given.”
Python Step Implementation
The code in directional_test_sentiment.py implements a test scenario, which checks whether the presence of the word “awesome ” positively affects the sentiment score generated by the TextBlob model.
from behave import given, then, when
from textblob import TextBlob

@given("a sentence")
def step_given_positive_word(context):
context.sent = "I love this product"

@given("the same sentence with the addition of the word '{word}'")
def step_given_a_positive_word(context, word):
context.new_sent = f"I love this {word} product"

@when("I input the new sentence into the model")
def step_when_use_model(context):
context.sentiment_score = TextBlob(context.sent).sentiment.polarity
context.adjusted_score = TextBlob(context.new_sent).sentiment.polarity

@then("the sentiment score should increase")
def step_then_positive(context):
assert context.adjusted_score > context.sentiment_score

The second step uses the parameter syntax {word}. When the .feature file is run, the value specified for {word} in the scenario is automatically passed to the corresponding step function.
This means that if the scenario states that the same sentence should include the word “awesome,” behave will automatically replace {word} with “awesome.”

This conversion is useful when you want to use different values for the {word} parameter without changing both the .feature file and the .py file.

Run the Test
behave features/directional_test_sentiment.feature

Output:
Feature: Sentiment Analysis with Specific Word
As a data scientist
I want to ensure that the presence of a specific word has a positive or negative effect on the sentiment score of a text
Scenario: Sentiment analysis with specific word
Given a sentence
And the same sentence with the addition of the word 'awesome'
When I input the new sentence into the model
Then the sentiment score should increase

1 feature passed, 0 failed, 0 skipped
1 scenario passed, 0 failed, 0 skipped
4 steps passed, 0 failed, 0 skipped, 0 undefined

Since all the steps passed, we can infer that the sentiment score increases due to the new word’s presence.
Minimum Functionality Testing
Minimum functionality testing is a type of testing that verifies if the system or product meets the minimum requirements and is functional for its intended use.
One example of minimum functionality testing is to check whether the model can handle different types of inputs, such as numerical, categorical, or textual data. To test with diverse inputs, generate test data using Faker for more comprehensive validation.

To use minimum functionality testing for input validation, create two files minimum_func_test_input.feature and minimum_func_test_input.py.
└── features/
├──── minimum_func_test_input.feature
└──── steps/
└──── minimum_func_test_input.py

Feature File
The code in minimum_func_test_input.feature specifies the project requirements as follows:
Feature: Test my_ml_model

Scenario: Test integer input
Given I have an integer input of 42
When I run the model
Then the output should be an array of one number

Scenario: Test float input
Given I have a float input of 3.14
When I run the model
Then the output should be an array of one number

Scenario: Test list input
Given I have a list input of [1, 2, 3]
When I run the model
Then the output should be an array of three numbers

Python Step Implementation
The code in minimum_func_test_input.py implements the requirements, checking if the output generated by predict for a specific input type meets the expectations.
from behave import given, then, when

import numpy as np
from sklearn.linear_model import LinearRegression
from typing import Union

def predict(input_data: Union[int, float, str, list]):
"""Create a model to predict input data"""

# Reshape the input data
if isinstance(input_data, (int, float, list)):
input_array = np.array(input_data).reshape(-1, 1)
else:
raise ValueError("Input type not supported")

# Create a linear regression model
model = LinearRegression()

# Train the model on a sample dataset
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
model.fit(X, y)

# Predict the output using the input array
return model.predict(input_array)

@given("I have an integer input of {input_value}")
def step_given_integer_input(context, input_value):
context.input_value = int(input_value)

@given("I have a float input of {input_value}")
def step_given_float_input(context, input_value):
context.input_value = float(input_value)

@given("I have a list input of {input_value}")
def step_given_list_input(context, input_value):
context.input_value = eval(input_value)

@when("I run the model")
def step_when_run_model(context):
context.output = predict(context.input_value)

@then("the output should be an array of one number")
def step_then_check_output(context):
assert isinstance(context.output, np.ndarray)
assert all(isinstance(x, (int, float)) for x in context.output)
assert len(context.output) == 1

@then("the output should be an array of three numbers")
def step_then_check_output(context):
assert isinstance(context.output, np.ndarray)
assert all(isinstance(x, (int, float)) for x in context.output)
assert len(context.output) == 3

Run the Test
behave features/minimum_func_test_input.feature

Output:
Feature: Test my_ml_model

Scenario: Test integer input
Given I have an integer input of 42
When I run the model
Then the output should be an array of one number

Scenario: Test float input
Given I have a float input of 3.14
When I run the model
Then the output should be an array of one number

Scenario: Test list input
Given I have a list input of [1, 2, 3]
When I run the model
Then the output should be an array of three numbers

1 feature passed, 0 failed, 0 skipped
3 scenarios passed, 0 failed, 0 skipped
9 steps passed, 0 failed, 0 skipped, 0 undefined

Since all the steps passed, we can conclude that the model outputs match our expectations.
Behave’s Trade-offs
This section will outline some drawbacks of using behave compared to pytest, and explain why it may still be worth considering the tool.
Learning Curve
Using Behavior-Driven Development (BDD) in behavior may result in a steeper learning curve than the more traditional testing approach used by pytest.

Counter argument: The focus on collaboration in BDD can lead to better alignment between business requirements and software development, resulting in a more efficient development process overall.

Slower performance
behave tests can be slower than pytest tests because behave must parse the feature files and map them to step definitions before running the tests.

Counter argument: behave’s focus on well-defined steps can lead to tests that are easier to understand and modify, reducing the overall effort required for test maintenance.

Less flexibility
behave is more rigid in its syntax, while pytest allows more flexibility in defining tests and fixtures.

Counter argument: behave’s rigid structure can help ensure consistency and readability across tests, making them easier to understand and maintain over time.

Conclusion
You’ve learned how to use behave to write readable tests for a data science project.
Key takeaways:
How behave works:

Feature files serve as living documentation: They communicate test intent in natural language while driving actual test execution
Step decorators bridge features and code: @given, @when, and @then decorators map feature file steps to Python test implementations

Three essential test types:

Invariance testing: Ensures your model produces consistent results when inputs are paraphrased or slightly modified
Directional testing: Validates that specific changes have the expected positive or negative impact on predictions
Minimum functionality testing: Verifies your model handles different input types correctly

Despite trade-offs like a steeper learning curve and slower performance compared to pytest, behave excels where it matters most for ML testing: making model behavior transparent and testable by both technical and non-technical team members.
Related Tutorials

Configuration Management: Hydra for Python Configuration for managing test specifications with YAML-like syntax

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Behave: Write Readable ML Tests with Behavior-Driven Development Read More »

Faker: Generate Realistic Test Data in Python with One Line of Code

Table of Contents

Motivation
Basics of Faker
Location-Specific Data Generation
Create Text
Create Profile Data
Create Random Python Datatypes
Conclusion

Motivation
Let’s say you want to create data with certain data types (bool, float, text, integers) with special characteristics (names, address, color, email, phone number, location) to test some Python libraries or specific implementation. But it takes time to find that specific kind of data. You wonder: is there a quick way that you can create your own data?
What if there is a package that enables you to create fake data in one line of code such as this:
fake.profile()

{
'address': '076 Steven Trace\nJillville, ND 12393',
'birthdate': datetime.date(1981, 11, 19),
'blood_group': 'O-',
'company': 'Johnson-Rodriguez',
'current_location': (Decimal('61.969848'), Decimal('121.407164')),
'job': 'Patent examiner',
'mail': 'ohicks@hotmail.com',
'name': 'Katie Romero',
'residence': '271 Smith Wells\nMichaelport, MN 40933',
'sex': 'F',
'ssn': '281-84-3963',
'username': 'eparker',
'website': ['https://www.gonzalez.com/', 'https://rogers-scott.com/']
}

This can be done with Faker, a Python package that generates fake data for you, ranging from a specific data type to specific characteristics of that data, and the origin or language of the data. Let’s discover how we can use Faker to create fake data.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Basics of Faker
Start with installing the package:
pip install Faker

Import Faker:
from faker import Faker

fake = Faker()

Some basic methods of Faker:
print(fake.color_name())
print(fake.name())
print(fake.address())
print(fake.job())
print(fake.date_of_birth(minimum_age=30))
print(fake.city())

Tan
Kristin Buck
715 Peter Views
Abigailport, ME 57602
Systems analyst
1946-03-07
Evanmouth

Let’s say you are an author of a fiction book who want to create a character but find it difficult and time-consuming to come up with a realistic name and information. You can write:
name = fake.name()
color = fake.color_name()
city = fake.city()
job = fake.job()

print(f'Her name is {name}. She lives in {city}. Her favorite color is {color}. She works as a {job}')

Her name is Debra Armstrong. She lives in Beanview. Her favorite color is GreenYellow. She works as a Lawyer

With Faker, you can generate a persuasive example instantly!
Location-Specific Data Generation
Luckily, we can also specify the location of the data we want to fake. Maybe the character you want to create is from Italy. You also want to create instances of her friends. Since you are from the US, it is difficult for you to generate relevant information to that location. That can be easily taken care of by adding location parameter in the class Faker:
fake = Faker('it_IT')

for _ in range(10):
print(fake.name())

Angelica Donarelli-Marangoni
Rosaria Castiglione
Federica Iacovelli
Puccio Armellini
Dina Donini-Alboni
Dott. Carolina Marrone
Olga Nosiglia
Graziella Russo
Paulina Galiazzo
Dott. Riccardo Padovano

Or create information from multiple locations:
fake = Faker(['ja_JP','zh_CN','es_ES','en_US','fr_FR'])

for _ in range(10):
print(fake.city())

齐齐哈尔市
Blakefort
North Joeborough
玉兰市
Saint Suzanne-les-Bains
Melilla
調布市
富津市
Maillot-sur-Mer
East Jamesshire

If you are from these specific countries, I hope you recognize the location. In case you are curious about other locations that you can specify, check out the doc here.
Create Text
Create Random Text
We can create random text with:
fake = Faker('en_US')
print(fake.text())

Gas threat perhaps minute energy thus. Relate group science car discussion budget art.
Let visit reach senior. Story once list almost. Enough major everyone.

Try with the Vietnamese language:
fake = Faker('vi_VN')
print(fake.text())

Như không cho số vậy tại đến. Hơn các thay. Khi từ cũng không rất là.
Gần được cho có nơi như vẫn cho. Nơi đi về giống.
Mà cũng từ nhưng lớn. Từng của nếu khi như nhưng.

None of these random text makes sense, but it is a good way to quickly create text for testing.
Create Text from Selected Words
Or we can also create text from a list of words:
fake = Faker()
my_information = ['dog','swimming', '21', 'slow', 'girl', 'coffee', 'flower','pink']

print(fake.sentence(ext_word_list=my_information))
print(fake.sentence(ext_word_list=my_information))

Coffee pink coffee.
Dog pink 21 pink.
“`text
## Create Profile Data {#create-profile-data}

We can quickly create a profile with:

“`python
fake = Faker()
fake.profile()

{'job': 'Nurse, adult',
'company': 'Johnson, Moore and Glover',
'ssn': '762-56-8929',
'residence': '742 Shane Groves\nLake Jasminefort, GU 12583',
'current_location': (Decimal('-77.3842165'), Decimal('7.407430')),
'blood_group': 'B-',
'website': ['https://brooks.com/'],
'username': 'brownamanda',
'name': 'Carolyn Navarro',
'sex': 'F',
'address': '505 Lewis Grove Apt. 588\nHowardville, ID 68181',
'mail': 'larry00@hotmail.com',
'birthdate': datetime.date(1946, 6, 13)}

As we can see, most relevant information about a person is created with ease, even with mail, ssn, username, and website.
What is even more useful is that we can create a dataframe of 100 users from different countries:
import pandas as pd

fake = Faker(['it_IT','ja_JP', 'zh_CN', 'de_DE','en_US'])
profiles = [fake.profile() for i in range(100)]

pd.DataFrame(profiles).head()

job
company
ssn
residence
current_location
blood_group
website
username
name
sex
address
mail
birthdate

0
Physiological scientist
Sobrero-Mazzanti Group
CLGTNO59H42A473Z
Incrocio Cabrini, 14 Appartamento 59\n74100, L…
(-88.2637715, 149.968584)
AB+
[http://federici-endrizzi.it/, http://www.paru…]
giuliagreco
Dott. Liliana Serraglio
F
Vicolo Milo, 0\n64020, Ripattoni (TE)
giolittiflavio@gmail.com
1998-10-10

1
花火師
阿部運輸株式会社
701-41-9799
和歌山県印旛郡本埜村鳥越20丁目23番18号
(79.245074, 109.117174)
O+
[https://suzuki.com/, http://ishikawa.jp/]
lyamamoto
斉藤 明美
F
東京都江戸川区神明内40丁目12番20号
akemiyamada@yahoo.com
1916-12-09

2
小説家
小林食品株式会社
103-28-5057
島根県富津市細野7丁目16番1号
(-84.3304275, 38.093874)
A+
[https://tanaka.jp/, http://www.fujita.net/, h…]
minoru62
渡辺 英樹
M
青森県川崎市川崎区長畑22丁目27番12号
minoru35@yahoo.com
2008-02-17

3
ゲームクリエイター
佐藤水産有限会社
123-85-7967
宮城県調布市隼町3丁目22番12号 アーバン台東327
(-49.3689775, -134.762867)
AB-
[http://www.sato.org/, http://kato.net/, http:…]
ayamamoto
鈴木 洋介
M
栃木県川崎市中原区虎ノ門30丁目27番20号
yuta56@hotmail.com
1917-01-25

4
薬剤師
合同会社高橋建設
891-98-2169
山梨県山武郡横芝光町轟4丁目22番10号 コート天神島159
(-62.1493985, -105.171377)
B+
[http://yamashita.jp/, http://www.shimizu.com/]
yosukekimura
田中 真綾
F
山口県府中市下吉羽6丁目20番2号
hayashiyuki@yahoo.com
2001-08-09

Create Random Python Datatypes
If we just care about the type of your data, without caring so much about the information, we can easily generate random datatypes such as:
Boolean:
print(fake.pybool())

False

A list of 5 elements with different data_type:
print(fake.pylist(nb_elements=5, variable_nb_elements=True))

['juan28@example.org', 8515, 6618, 'UexWQJkGrJFGBAVfHgUt']

A decimal with 5 left digits and 6 right digits (after the .):
print(fake.pydecimal(left_digits=5, right_digits=6, positive=False, min_value=None, max_value=None))

-26114.564612

You can find more about other Python datatypes that you can create here.
Conclusion
I hope you find Faker a helpful tool to create data efficiently. You may find this tool useful for what you are working on or may not at the moment. But it is helpful to know that there exists a tool that enables you to generate data with ease for your specific needs such as testing.
Feel free to check out more information about Faker here.

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Faker: Generate Realistic Test Data in Python with One Line of Code Read More »

Simulate External Services in Testing with Mock Objects

Testing code that relies on external services, like a database, can be difficult since the behaviors of these services can change. 

A mock object can control the behavior of a real object in a testing environment by simulating responses from external services.

The code above uses a mock object to test the get_data function’s behavior when calling an API that may either succeed or fail.

Simulate External Services in Testing with Mock Objects Read More »

Pandera: Data Validation Made Simple for Python DataFrames

Poor data quality can lead to incorrect conclusions and bad model performance. Thus, it is important to check data for consistency and reliability before using it.

pandera makes it easy to perform data validation on dataframe-like objects. If the dataframe does not pass validation checks, pandera provides useful error messages.

Pandera: Data Validation Made Simple for Python DataFrames Read More »

Exploring Test Case Strategies: Individual Functions and Pytest Parameterize

To test the same function with multiple test cases, you can do either of the following:

Separate test functions:

This approach involves creating individual test functions for each test case.

def test_add_positive():
assert add(2, 3) == 5

def test_add_negative():
assert add(-2, -3) == -5

def test_add_mixed():
assert add(-2, 3) == 1

def test_add_zero():
assert add(0, 5) == 4

Output:

pytest_parametrize_example.py …F [100%]

=================================== FAILURES ===================================
________________________________ test_add_zero _________________________________

def test_add_zero():
> assert add(0, 5) == 4
E assert 5 == 4
E + where 5 = add(0, 5)

pytest_parametrize_example.py:14: AssertionError
=========================== short test summary info ============================
FAILED pytest_parametrize_example.py::test_add_zero – assert 5 == 4
========================= 1 failed, 3 passed in 0.05s ==========================

Pros:

Each test case is clearly isolated and easy to understand at a glance.

Cons:

Code duplication – the test structure is repeated for each case.

Adding new test cases requires writing a new function each time.

Changes to test structure only need to be made in multiple places.

Use pytest parameterize

This approach uses pytest’s parametrize decorator to run the same test function with different inputs.

import pytest

def add(num1, num2):
return num1 + num2

@pytest.mark.parametrize(
"a, b, expected",
[(2, 3, 5), (-2, -3, -5), (-2, 3, 1), (0, 5, 4)],
ids=["positive numbers", "negative numbers", "mixed signs", "zero and positive"],
)
def test_add(a, b, expected):
assert add(a, b) == expected

Output:

pytest_parametrize_example.py …F [100%]

=================================== FAILURES ===================================
_________________________ test_add[zero and positive] __________________________

a = 0, b = 5, expected = 4

@pytest.mark.parametrize(
"a, b, expected",
[(2, 3, 5), (-2, -3, -5), (-2, 3, 1), (0, 5, 4)],
ids=["positive numbers", "negative numbers", "mixed signs", "zero and positive"],
)
def test_add(a, b, expected):
> assert add(a, b) == expected
E assert 5 == 4
E + where 5 = add(0, 5)

pytest_parametrize_example.py:14: AssertionError
=========================== short test summary info ============================
FAILED pytest_parametrize_example.py::test_add[zero and positive] – assert 5 == 4
========================= 1 failed, 3 passed in 0.06s ==========================

Pros:

Easy to add new test cases by adding to the parameter list.

Changes to test structure only need to be made in one place.

Cons:

The purpose of each test case might be less immediately clear, especially for complex tests.

Choosing between these methods depends on your project’s needs. Use individual functions when clarity is crucial. Use parametrize when dealing with numerous similar cases.

Exploring Test Case Strategies: Individual Functions and Pytest Parameterize Read More »

pytest-mock vs unittest.mock: Simplifying Mocking in Python Tests

Traditional mocking with unittest.mock often requires repetitive setup and teardown code, which can make test code harder to read and maintain.

pytest-mock addresses this issue by leveraging pytest’s fixture system, simplifying the mocking process and reducing boilerplate code.

Consider the following example that demonstrates the difference between unittest.mock and pytest-mock.

Using unittest.mock:

%%writefile test_rm_file.py
from unittest.mock import patch
import os

def rm_file(filename):
os.remove(filename)

def test_with_unittest_mock():
with patch("os.remove") as mock_remove:
rm_file("file")
mock_remove.assert_called_once_with("file")

Using pytest-mock:

%%writefile test_rm_file.py
import os

def rm_file(filename):
os.remove(filename)

def test_unix_fs(mocker):
mocker.patch("os.remove")
rm_file("file")
os.remove.assert_called_once_with("file")

Key differences:

Setup: pytest-mock uses the mocker fixture, automatically provided by pytest, eliminating the need to import patching utilities.

Patching: With pytest-mock, you simply call mocker.patch('os.remove'), whereas unittest.mock requires a context manager or decorator.

Cleanup: pytest-mock automatically undoes mocking after the test, while unittest.mock relies on the context manager for cleanup.

Accessing mocks: pytest-mock allows direct access to the patched function (e.g., os.remove.assert_called_once_with()), while unittest.mock requires accessing the mock through a variable (e.g., mock_remove.assert_called_once_with()).

Link to pytest-mock.

pytest-mock vs unittest.mock: Simplifying Mocking in Python Tests Read More »

Mocking External Dependencies: Achieving Reliable Test Results

Testing code that relies on external services, like a database, can be difficult since the behaviors of these services can change.

A mock object can control the behavior of a real object in a testing environment by simulating responses from external services.

Here are two common use cases with examples:

Mocking Time-Dependent Functions

When testing functions that depend on the current time or date, you can mock the time to ensure consistent results.

Example: Testing a function that returns data for the last week

from datetime import datetime, timedelta

def get_data_for_last_week():
end_date = datetime.now().date()
start_date = end_date – timedelta(days=7)
return {
"start_date": start_date.strftime("%Y-%m-%d"),
"end_date": end_date.strftime("%Y-%m-%d"),
}

Now, let’s create a test for this function using mock:

from datetime import datetime
from unittest.mock import patch
from main import get_data_for_last_week

@patch("main.datetime")
def test_get_data_for_last_week(mock_datetime):
# Set a fixed date for the test
mock_datetime.now.return_value = datetime(2024, 8, 5)

# Call the function
result = get_data_for_last_week()

# Assert the results
assert result["start_date"] == "2024-07-29"
assert result["end_date"] == "2024-08-05"

# Verify that datetime.now() was called
mock_datetime.now.assert_called_once()

This test mocks the datetime.now() method to return a fixed date, allowing for predictable and consistent test results.

Mocking API calls

When testing code that makes external API calls, mocking helps avoid actual network requests during testing.

Example: Testing a function that makes an API call

import requests
from requests.exceptions import ConnectionError

def get_data():
"""Make an API call to Postgres"""
try:
response = requests.get("http://localhost:5432")
return response.json()
except ConnectionError:
return None

from unittest.mock import patch
from requests.exceptions import ConnectionError
from main import get_data

@patch("main.requests.get")
def test_get_data_fails(mock_get):
"""Test the get_data function when the API call fails"""
# Define what happens when the function is called
mock_get.side_effect = ConnectionError
assert get_data() is None

@patch("main.requests.get")
def test_get_data_succeeds(mock_get):
"""Test the get_data function when the API call succeeds"""
# Define the return value of the function
mock_get.return_value.json.return_value = {"data": "test"}
assert get_data() == {"data": "test"}

These tests mock the requests.get() function to simulate both successful and failed API calls, allowing us to test our function’s behavior in different scenarios without making actual network requests.

By using mocks in these ways, we can create more reliable and controlled unit tests for our data projects, ensuring that our code behaves correctly under various conditions.

Mocking External Dependencies: Achieving Reliable Test Results Read More »

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran