Python-Magic: Reliable File Type Detection Beyond Extensions

Table of Contents

Motivation
Introduction to Python-Magic
Data Setup
Accurate File Type Detection
Practical Applications
Conclusion

Python-Magic: Reliable File Type Detection Beyond Extensions
Motivation
File extensions can be misleading or missing entirely. Data processing workflows often receive files from various sources with incorrect extensions, renamed files, or files without extensions altogether.
Traditional approaches rely on file extensions, which can be easily manipulated or missing:
import os

# Traditional approach – unreliable
def get_file_type_by_extension(filename):
_, ext = os.path.splitext(filename)
return ext.lower()

# Examples of problematic files
files = ["document.txt", "data", "image.jpg.exe"]
for file in files:
ext = get_file_type_by_extension(file)
print(f"{file}: {ext if ext else 'No extension'}")

document.txt: .txt
data: No extension
image.jpg.exe: .exe

This approach fails when:

Files lack extensions
Extensions are incorrect or misleading
Malicious files masquerade as safe file types

Introduction to Python-Magic
Python-magic provides reliable file type detection by analyzing file headers rather than relying on extensions. It interfaces with libmagic, the same library used by the Unix file command.
Install python-magic and the required system library:
pip install python-magic

# Install libmagic system library
# macOS: brew install libmagic
# Ubuntu/Debian: sudo apt-get install libmagic1

Data Setup
To demonstrate python-magic’s capabilities, we’ll create test files with misleading extensions that highlight the limitations of extension-based file type detection:
import magic

# Create standard test files
with open("data/sample.txt", "w") as f:
f.write("This is a sample text file for demonstration purposes.")

with open("data/sample.py", "w") as f:
f.write("import pandas as pd; df = pd.DataFrame({'a': [1, 2, 3]}); print(df)")

# Create test files with misleading extensions
with open("data/fake_image.txt", "wb") as f:
# Write PNG header
f.write(b'\x89PNG\r\n\x1a\n')

with open("data/real_text.jpg", "w") as f:
f.write("This is actually a text file")

These test files will demonstrate how python-magic detects actual file types regardless of their extensions.
Accurate File Type Detection
Python-magic examines file headers to determine actual file types:
# Detect actual file types
files = [
"data/sample.txt",
"data/sample.py",
"data/fake_image.txt",
"data/real_text.jpg",
]

for file in files:
file_type = magic.from_file(file)
mime_type = magic.from_file(file, mime=True)
print(f"{file}:")
print(f" Type: {file_type}")
print(f" MIME: {mime_type}")
print()

data/sample.txt:
Type: ASCII text, with no line terminators
MIME: text/plain

data/sample.py:
Type: ASCII text, with no line terminators
MIME: text/plain

data/fake_image.txt:
Type: data
MIME: application/octet-stream

data/real_text.jpg:
Type: ASCII text, with no line terminators
MIME: text/plain

Python-magic correctly identifies:

fake_image.txt as a PNG image despite the .txt extension
real_text.jpg as plain text despite the .jpg extension

📚 For comprehensive production data validation practices, check out Production-Ready Data Science.

Practical Applications
With python-magic, we can build a validation function that verifies uploaded files match expected formats and rejects files with misleading extensions.
Before creating a validation function, let’s set up a sample CSV file for our validation example:
# Create sample CSV file for validation example
import pandas as pd

sample_data = pd.DataFrame({
'ticket_id': [1, 2, 3, 4, 5],
'customer_type': ['premium', 'basic', 'premium', 'basic', 'premium'],
'issue_category': ['billing', 'technical', 'account', 'billing', 'technical'],
'resolution_time': [24, 48, 12, 36, 18]
})

sample_data.to_csv('data/customer_support_eval.csv', index=False)

Next, create a file validation function for data processing pipelines:
def validate_uploaded_file(filepath, expected_types):
"""Validate file type matches expectations"""
try:
actual_mime = magic.from_file(filepath, mime=True)

if actual_mime in expected_types:
return True, f"Valid {actual_mime} file"
else:
return False, f"Expected {expected_types}, got {actual_mime}"
except Exception as e:
return False, f"Error reading file: {e}"

# Example usage for data analysis workflow
csv_file = "data/customer_support_eval.csv"
result, message = validate_uploaded_file(csv_file, ["text/csv", "text/plain"])
print(f"CSV validation: {message}")

# Check for potentially dangerous files
suspicious_file = "data/fake_image.txt"
result, message = validate_uploaded_file(suspicious_file, ["text/plain"])
print(f"Text validation: {message}")

CSV validation: Valid text/csv file
Text validation: Expected ['text/plain'], got application/octet-stream

The validation correctly accepts the CSV file while rejecting the fake image file despite its .txt extension.

For robust logging in production file validation workflows, see our Loguru: Simple as Print, Powerful as Logging.

Conclusion
Python-magic provides reliable file type detection by examining file content rather than trusting extensions. This approach prevents security vulnerabilities and ensures data processing workflows handle files correctly.

For managing file validation configurations across different environments, see our Hydra for Python Configuration: Build Modular and Maintainable Pipelines.

Favorite

Python-Magic: Reliable File Type Detection Beyond Extensions Read More »