Motivation
For data scientists who work in a team with various roles, writing clean code is essential because:
- Clean code enhances readability, making it easier for team members to understand and contribute to the codebase.
- Clean code improves maintainability, simplifying tasks such as debugging, modifying, and extending the existing code.
To achieve maintainability, your Python function should:
- Be small
- Do one task
- Have no duplication
- Have one level of abstraction
- Have a descriptive name
- Have fewer than four arguments
Inspired by the book Robert C. Martin’s book Clean Code: A Handbook of Agile Software Craftsmanship, this article aims to guide data scientists in writing better Python functions.
The source code of this article could be found here.
Get Started
Let’s start by taking a look at the function get_data
below.
import xml.etree.ElementTree as ET
import zipfile
from pathlib import Path
import gdown
def get_data(
url: str,
zip_path: str,
raw_train_path: str,
raw_test_path: str,
processed_train_path: str,
processed_test_path: str,
):
# Download data from Google Drive
zip_path = "Twitter.zip"
gdown.download(url, zip_path, quiet=False)
# Unzip data
with zipfile.ZipFile(zip_path, "r") as zip_ref:
zip_ref.extractall(".")
# Extract texts from files in the train directory
t_train = []
for file_path in Path(raw_train_path).glob("*.xml"):
list_train_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
train_doc_1 = " ".join(t for t in list_train_doc_1)
t_train.append(train_doc_1)
t_train_docs = " ".join(t_train)
# Extract texts from files in the test directory
t_test = []
for file_path in Path(raw_test_path).glob("*.xml"):
list_test_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
test_doc_1 = " ".join(t for t in list_test_doc_1)
t_test.append(test_doc_1)
t_test_docs = " ".join(t_test)
# Write processed data to a train file
with open(processed_train_path, "w") as f:
f.write(t_train_docs)
# Write processed data to a test file
with open(processed_test_path, "w") as f:
f.write(t_test_docs)
if __name__ == "__main__":
get_data(
url="https://drive.google.com/uc?id=1jI1cmxqnwsmC-vbl8dNY6b4aNBtBbKy3",
zip_path="Twitter.zip",
raw_train_path="Data/train/en",
raw_test_path="Data/test/en",
processed_train_path="Data/train/en.txt",
processed_test_path="Data/test/en.txt",
)
Even though there are many comments in this function, it is difficult to understand what this function does because:
- The function is long.
- The function tries to do multiple tasks.
- The code within the function is at different levels of abstraction.
- The function has many arguments.
- There are multiple code duplications.
- The function lacks a descriptive name.
We will refactor this code by using the six practices mentioned at the beginning of the article.
Small
A function should be kept small to enhance its readability. Ideally, a function should not exceed 20 lines of code. Additionally, a function’s indentation level should not exceed one or two.
import zipfile
import gdown
def get_raw_data(url: str, zip_path: str) -> None:
gdown.download(url, zip_path, quiet=False)
with zipfile.ZipFile(zip_path, "r") as zip_ref:
zip_ref.extractall(".")
Do One Task
Functions should have a singular focus and perform a single task. The function get_data
tries to complete multiple tasks, including retrieving data from Google Drive, performing text extraction, and saving the extracted texts.
Thus, this function should be split into several smaller functions, as shown below:
def main(
url: str,
zip_path: str,
raw_train_path: str,
raw_test_path: str,
processed_train_path: str,
processed_test_path: str,
) -> None:
get_raw_data(url, zip_path)
t_train, t_test = get_train_test_docs(raw_train_path, raw_test_path)
save_train_test_docs(processed_train_path, processed_test_path, t_train, t_test)
Each of these functions should have a single purpose:
def get_raw_data(url: str, zip_path: str) -> None:
gdown.download(url, zip_path, quiet=False)
with zipfile.ZipFile(zip_path, "r") as zip_ref:
zip_ref.extractall(".")
The function get_raw_data
performs only one action, which is to get raw data.
Duplication
We should avoid duplication because:
- Duplicated code diminishes code readability.
- Duplicated code makes code modification more complicated. If changes are required, they need to be made in multiple locations, increasing the likelihood of errors.
The following code contains duplication, with the code used to retrieve training and test data being nearly identical.
from pathlib import Path
# Extract texts from files in the train directory
t_train = []
for file_path in Path(raw_train_path).glob("*.xml"):
list_train_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
train_doc_1 = " ".join(t for t in list_train_doc_1)
t_train.append(train_doc_1)
t_train_docs = " ".join(t_train)
# Extract texts from files in the test directory
t_test = []
for file_path in Path(raw_test_path).glob("*.xml"):
list_test_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
test_doc_1 = " ".join(t for t in list_test_doc_1)
t_test.append(test_doc_1)
t_test_docs = " ".join(t_test)
We can eliminate duplication by consolidating the duplicated code into a single function called extract_texts_from_multiple_files
that extracts texts from multiple files in a specified location.
def extract_texts_from_multiple_files(folder_path) -> str:
all_docs = []
for file_path in Path(folder_path).glob("*.xml"):
list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]
text_in_one_file = " ".join(list_of_text_in_one_file)
all_docs.append(text_in_one_file)
return " ".join(all_docs)
Now you can use this function to extract texts from various locations without code duplication.
t_train = extract_texts_from_multiple_files(raw_train_path)
t_test = extract_texts_from_multiple_files(raw_test_path)
One Level of Abstraction
The level of abstraction is the amount of complexity of a system. A high level refers to a more generalized view of the system, while a low level refers to more specific aspects of the system.
It is a good practice to keep the same level of abstraction within a code segment to make the code easier to understand.
To demonstrate this, consider the following function:
def extract_texts_from_multiple_files(folder_path) -> str:
all_docs = []
for file_path in Path(folder_path).glob("*.xml"):
list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]
text_in_one_file = " ".join(list_of_text_in_one_file)
all_docs.append(text_in_one_file)
return " ".join(all_docs)
The function itself is at a higher level, but the code within the for
loop involves lower-level operations related to XML parsing, text extraction, and string manipulation.
To address this mix of abstraction levels, we can encapsulate the lower-level operations in the extract_texts_from_each_file
function:
def extract_texts_from_multiple_files(folder_path: str) -> str:
all_docs = []
for file_path in Path(folder_path).glob("*.xml"):
text_in_one_file = extract_texts_from_each_file(file_path)
all_docs.append(text_in_one_file)
return " ".join(all_docs)
def extract_texts_from_each_file(file_path: str) -> str:
list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]
return " ".join(list_of_text_in_one_file)
This introduces a higher level of abstraction for the text extraction process, making the code more readable.
Descriptive Names
A function’s name should be descriptive enough for users to understand its purpose without reading the code. It’s better to have longer, descriptive names than vague ones. For example, naming a function get_texts
is not as clear as naming it extract_texts_from_multiple_files
.
However, if a function’s name becomes too long, like retrieve_data_extract_text_and_save_data
, it’s a sign that the function may be doing too much and should be split into smaller functions.
Have Fewer than Four Arguments
As the number of function arguments increases, it becomes challenging to keep track of the order, purpose, and relationships between numerous arguments. This makes it difficult for developers to understand and use the function.
def main(
url: str,
zip_path: str,
raw_train_path: str,
raw_test_path: str,
processed_train_path: str,
processed_test_path: str,
) -> None:
get_raw_data(url, zip_path)
t_train, t_test = get_train_test_docs(raw_train_path, raw_test_path)
save_train_test_docs(processed_train_path, processed_test_path, t_train, t_test)
To improve the code readability, you can encapsulate multiple related arguments within a single data structure with a dataclass or a Pydantic model.
from pydantic import BaseModel
class RawLocation(BaseModel):
url: str
zip_path: str
path_train: str
path_test: str
class ProcessedLocation(BaseModel):
path_train: str
path_test: str
def main(raw_location: RawLocation, processed_location: ProcessedLocation) -> None:
get_raw_data(raw_location)
t_train, t_test = get_train_test_docs(raw_location)
save_train_test_docs(processed_location, t_train, t_test)
How do I write a function like this?
You don’t need to remember all of these best practices when writing Python functions. An excellent indicator of a Python function’s quality is its testability. If a function can be easily tested, it indicates that the function is modular, performs a single task, and has no code duplication.
def save_data(processed_path: str, processed_data: str) -> None:
with open(processed_path, "w") as f:
f.write(processed_data)
def test_save_data(tmp_path):
processed_path = tmp_path / "processed_data.txt"
processed_data = "Sample processed data"
save_data(processed_path, processed_data)
assert processed_path.exists()
assert processed_path.read_text() == processed_data
Conclusion
Congratulations! You have just learned 6 best practices to write readable and testable functions. By making your code simple to understand, your colleagues will be more likely to reuse it for downstream tasks.
References
Martin, R. C. (2009). Clean code: A handbook of agile software craftsmanship. Upper Saddle River: Prentice Hall.
Comments are closed.