Python-Magic: Reliable File Type Detection Beyond Extensions

July 23, 2025

Python-Magic: Reliable File Type Detection Beyond Extensions

Khuyen Tran

Motivation
Introduction to Python-Magic
Data Setup
Accurate File Type Detection
Practical Applications
Conclusion

Python-Magic: Reliable File Type Detection Beyond Extensions

Motivation

File extensions can be misleading or missing entirely. Data processing workflows often receive files from various sources with incorrect extensions, renamed files, or files without extensions altogether.

Traditional approaches rely on file extensions, which can be easily manipulated or missing:

import os

# Traditional approach - unreliable
def get_file_type_by_extension(filename):
    _, ext = os.path.splitext(filename)
    return ext.lower()

# Examples of problematic files
files = ["document.txt", "data", "image.jpg.exe"]
for file in files:
    ext = get_file_type_by_extension(file)
    print(f"{file}: {ext if ext else 'No extension'}")

document.txt: .txt
data: No extension
image.jpg.exe: .exe

This approach fails when:

Files lack extensions
Extensions are incorrect or misleading
Malicious files masquerade as safe file types

Introduction to Python-Magic

Python-magic provides reliable file type detection by analyzing file headers rather than relying on extensions. It interfaces with libmagic, the same library used by the Unix file command.

Install python-magic and the required system library:

pip install python-magic

# Install libmagic system library
# macOS: brew install libmagic
# Ubuntu/Debian: sudo apt-get install libmagic1

Data Setup

To demonstrate python-magic’s capabilities, we’ll create test files with misleading extensions that highlight the limitations of extension-based file type detection:

import magic

# Create standard test files
with open("data/sample.txt", "w") as f:
    f.write("This is a sample text file for demonstration purposes.")

with open("data/sample.py", "w") as f:
    f.write("import pandas as pd; df = pd.DataFrame({'a': [1, 2, 3]}); print(df)")

# Create test files with misleading extensions
with open("data/fake_image.txt", "wb") as f:
    # Write PNG header
    f.write(b'\x89PNG\r\n\x1a\n')

with open("data/real_text.jpg", "w") as f:
    f.write("This is actually a text file")

These test files will demonstrate how python-magic detects actual file types regardless of their extensions.

Accurate File Type Detection

Python-magic examines file headers to determine actual file types:

# Detect actual file types
files = [
    "data/sample.txt",
    "data/sample.py",
    "data/fake_image.txt",
    "data/real_text.jpg",
]

for file in files:
    file_type = magic.from_file(file)
    mime_type = magic.from_file(file, mime=True)
    print(f"{file}:")
    print(f"  Type: {file_type}")
    print(f"  MIME: {mime_type}")
    print()

data/sample.txt:
  Type: ASCII text, with no line terminators
  MIME: text/plain

data/sample.py:
  Type: ASCII text, with no line terminators
  MIME: text/plain

data/fake_image.txt:
  Type: data
  MIME: application/octet-stream

data/real_text.jpg:
  Type: ASCII text, with no line terminators
  MIME: text/plain

Python-magic correctly identifies:

fake_image.txt as a PNG image despite the .txt extension
real_text.jpg as plain text despite the .jpg extension

📚 For comprehensive production data validation practices, check out Production-Ready Data Science.

Practical Applications

With python-magic, we can build a validation function that verifies uploaded files match expected formats and rejects files with misleading extensions.

Before creating a validation function, let’s set up a sample CSV file for our validation example:

# Create sample CSV file for validation example
import pandas as pd

sample_data = pd.DataFrame({
    'ticket_id': [1, 2, 3, 4, 5],
    'customer_type': ['premium', 'basic', 'premium', 'basic', 'premium'],
    'issue_category': ['billing', 'technical', 'account', 'billing', 'technical'],
    'resolution_time': [24, 48, 12, 36, 18]
})

sample_data.to_csv('data/customer_support_eval.csv', index=False)

Next, create a file validation function for data processing pipelines:

def validate_uploaded_file(filepath, expected_types):
    """Validate file type matches expectations"""
    try:
        actual_mime = magic.from_file(filepath, mime=True)

        if actual_mime in expected_types:
            return True, f"Valid {actual_mime} file"
        else:
            return False, f"Expected {expected_types}, got {actual_mime}"
    except Exception as e:
        return False, f"Error reading file: {e}"


# Example usage for data analysis workflow
csv_file = "data/customer_support_eval.csv"
result, message = validate_uploaded_file(csv_file, ["text/csv", "text/plain"])
print(f"CSV validation: {message}")

# Check for potentially dangerous files
suspicious_file = "data/fake_image.txt"
result, message = validate_uploaded_file(suspicious_file, ["text/plain"])
print(f"Text validation: {message}")

CSV validation: Valid text/csv file
Text validation: Expected ['text/plain'], got application/octet-stream

The validation correctly accepts the CSV file while rejecting the fake image file despite its .txt extension.

For robust logging in production file validation workflows, see our Loguru: Simple as Print, Powerful as Logging.

Conclusion

Python-magic provides reliable file type detection by examining file content rather than trusting extensions. This approach prevents security vulnerabilities and ensures data processing workflows handle files correctly.

For managing file validation configurations across different environments, see our Hydra for Python Configuration: Build Modular and Maintainable Pipelines.

Hydra: YAML-Based Config Management Made Simple

February 24, 2025

Loguru: Configure Professional Logging in a Single Line

February 24, 2025

Simplifying Type Annotations with MonkeyType

February 11, 2025

Python-Magic: Reliable File Type Detection Beyond Extensions

Python-Magic: Reliable File Type Detection Beyond Extensions

Khuyen Tran

Table of Contents

Python-Magic: Reliable File Type Detection Beyond Extensions

Motivation

Introduction to Python-Magic

Data Setup

Accurate File Type Detection

Practical Applications

Conclusion

Related Posts

Leave a Comment Cancel Reply

Drop a line

Get in touch

Follow Us on Social Media

Python-Magic: Reliable File Type Detection Beyond Extensions

Python-Magic: Reliable File Type Detection Beyond Extensions

Khuyen Tran

Table of Contents

Python-Magic: Reliable File Type Detection Beyond Extensions

Motivation

Introduction to Python-Magic

Data Setup

Accurate File Type Detection

Practical Applications

Conclusion

Related Posts

Leave a Comment Cancel Reply

Work with Khuyen Tran

Work with Khuyen Tran