Table of Contents
- Motivation
- Introduction to Python-Magic
- Data Setup
- Accurate File Type Detection
- Practical Applications
- Conclusion
Python-Magic: Reliable File Type Detection Beyond Extensions
Motivation
File extensions can be misleading or missing entirely. Data processing workflows often receive files from various sources with incorrect extensions, renamed files, or files without extensions altogether.
Traditional approaches rely on file extensions, which can be easily manipulated or missing:
import os
# Traditional approach - unreliable
def get_file_type_by_extension(filename):
_, ext = os.path.splitext(filename)
return ext.lower()
# Examples of problematic files
files = ["document.txt", "data", "image.jpg.exe"]
for file in files:
ext = get_file_type_by_extension(file)
print(f"{file}: {ext if ext else 'No extension'}")
document.txt: .txt
data: No extension
image.jpg.exe: .exe
This approach fails when:
- Files lack extensions
- Extensions are incorrect or misleading
- Malicious files masquerade as safe file types
Introduction to Python-Magic
Python-magic provides reliable file type detection by analyzing file headers rather than relying on extensions. It interfaces with libmagic, the same library used by the Unix file
command.
Install python-magic and the required system library:
pip install python-magic
# Install libmagic system library
# macOS: brew install libmagic
# Ubuntu/Debian: sudo apt-get install libmagic1
Data Setup
To demonstrate python-magic’s capabilities, we’ll create test files with misleading extensions that highlight the limitations of extension-based file type detection:
import magic
# Create standard test files
with open("data/sample.txt", "w") as f:
f.write("This is a sample text file for demonstration purposes.")
with open("data/sample.py", "w") as f:
f.write("import pandas as pd; df = pd.DataFrame({'a': [1, 2, 3]}); print(df)")
# Create test files with misleading extensions
with open("data/fake_image.txt", "wb") as f:
# Write PNG header
f.write(b'\x89PNG\r\n\x1a\n')
with open("data/real_text.jpg", "w") as f:
f.write("This is actually a text file")
These test files will demonstrate how python-magic detects actual file types regardless of their extensions.
Accurate File Type Detection
Python-magic examines file headers to determine actual file types:
# Detect actual file types
files = [
"data/sample.txt",
"data/sample.py",
"data/fake_image.txt",
"data/real_text.jpg",
]
for file in files:
file_type = magic.from_file(file)
mime_type = magic.from_file(file, mime=True)
print(f"{file}:")
print(f" Type: {file_type}")
print(f" MIME: {mime_type}")
print()
data/sample.txt:
Type: ASCII text, with no line terminators
MIME: text/plain
data/sample.py:
Type: ASCII text, with no line terminators
MIME: text/plain
data/fake_image.txt:
Type: data
MIME: application/octet-stream
data/real_text.jpg:
Type: ASCII text, with no line terminators
MIME: text/plain
Python-magic correctly identifies:
fake_image.txt
as a PNG image despite the.txt
extensionreal_text.jpg
as plain text despite the.jpg
extension
📚 For comprehensive production data validation practices, check out Production-Ready Data Science.
Practical Applications
With python-magic, we can build a validation function that verifies uploaded files match expected formats and rejects files with misleading extensions.
Before creating a validation function, let’s set up a sample CSV file for our validation example:
# Create sample CSV file for validation example
import pandas as pd
sample_data = pd.DataFrame({
'ticket_id': [1, 2, 3, 4, 5],
'customer_type': ['premium', 'basic', 'premium', 'basic', 'premium'],
'issue_category': ['billing', 'technical', 'account', 'billing', 'technical'],
'resolution_time': [24, 48, 12, 36, 18]
})
sample_data.to_csv('data/customer_support_eval.csv', index=False)
Next, create a file validation function for data processing pipelines:
def validate_uploaded_file(filepath, expected_types):
"""Validate file type matches expectations"""
try:
actual_mime = magic.from_file(filepath, mime=True)
if actual_mime in expected_types:
return True, f"Valid {actual_mime} file"
else:
return False, f"Expected {expected_types}, got {actual_mime}"
except Exception as e:
return False, f"Error reading file: {e}"
# Example usage for data analysis workflow
csv_file = "data/customer_support_eval.csv"
result, message = validate_uploaded_file(csv_file, ["text/csv", "text/plain"])
print(f"CSV validation: {message}")
# Check for potentially dangerous files
suspicious_file = "data/fake_image.txt"
result, message = validate_uploaded_file(suspicious_file, ["text/plain"])
print(f"Text validation: {message}")
CSV validation: Valid text/csv file
Text validation: Expected ['text/plain'], got application/octet-stream
The validation correctly accepts the CSV file while rejecting the fake image file despite its .txt extension.
For robust logging in production file validation workflows, see our Loguru: Simple as Print, Powerful as Logging.
Conclusion
Python-magic provides reliable file type detection by examining file content rather than trusting extensions. This approach prevents security vulnerabilities and ensures data processing workflows handle files correctly.
For managing file validation configurations across different environments, see our Hydra for Python Configuration: Build Modular and Maintainable Pipelines.