File Operations in Python

Articles

This article explains file operations in Python.

With "safety," "efficiency," and "readability" in mind, this guide explains concepts step-by-step, from basics to practical techniques.

YouTube Video

File Operations in Python

File operations are an essential fundamental skill, from small scripts to large-scale applications.

Opening and closing files

First, let's look at examples of reading and writing text files. Using the with statement (context manager) ensures the file is properly closed.

The following code opens a text file, reads its contents, and processes them line by line.

1# Open and read a text file safely using a context manager.
2with open("example.txt", "r", encoding="utf-8") as f:
3    # Read all lines lazily (as an iterator) and process each line.
4    for line in f:
5        print(line.rstrip())

Explicitly specifying encoding="utf-8" helps reduce platform-dependent issues.
Even with large files, for line in f is memory efficient.

Writing (overwrite and append)

Pay attention to the mode when writing to files. w is for overwriting, a is for appending. Use the with statement when writing as well.

The following code shows basic examples of overwriting and appending.

1# Write (overwrite) to a file.
2with open("output.txt", "w", encoding="utf-8") as f:
3    f.write("First line\n")
4    f.write("Second line\n")
5
6# Append to the same file.
7with open("output.txt", "a", encoding="utf-8") as f:
8    f.write("Appended line\n")

This code writes text to output.txt and then appends more text to the same file. The "w" mode overwrites the file, while "a" mode appends a new line to the end of the existing content.
If you need to flush, call flush(), but usually the content is flushed automatically when the context ends.
If multiple processes or threads may write at the same time, you need to consider exclusive control, such as file locking.

Reading and writing binary data

Images and compressed files are handled in binary mode (rb or wb). Unlike text mode, encoding is ignored.

The following code reads a binary file and copies it to another file in binary mode.

1# Read and write binary files.
2with open("image.png", "rb") as src:
3    data = src.read()
4
5with open("copy.png", "wb") as dst:
6    dst.write(data)

When handling large binaries, avoid reading everything at once with read(); reading and writing in chunks is more memory efficient.

Example of handling large files in chunks

For huge files that don't fit into memory, read and write them in fixed-size chunks. Since it's I/O-bound, adjusting the buffer size is useful.

The following code copies a file in 64KB chunks. This operates quickly while keeping memory usage low.

1# Copy a large file in chunks to avoid using too much memory.
2CHUNK_SIZE = 64 * 1024  # 64 KB
3
4with open("large_source.bin", "rb") as src,      open("large_dest.bin", "wb") as dst:
5    while True:
6        chunk = src.read(CHUNK_SIZE)
7        if not chunk:
8            break
9        dst.write(chunk)

You can adjust the chunk size according to the characteristics of your network or disk. On SSDs, a slightly larger chunk size often works better.

Example using the buffering argument

By specifying the buffering argument in the open() function, you can control the buffer size used internally by Python. This allows you to further optimize input/output efficiency.

 1# Explicitly set the internal buffer size to 128 KB for faster I/O.
 2CHUNK_SIZE = 64 * 1024
 3BUFFER_SIZE = 128 * 1024  # 128 KB internal buffer
 4
 5with open("large_source.bin", "rb", buffering=BUFFER_SIZE) as src,      open("large_dest.bin", "wb", buffering=BUFFER_SIZE) as dst:
 6    while True:
 7        chunk = src.read(CHUNK_SIZE)
 8        if not chunk:
 9            break
10        dst.write(chunk)

If you set the value of buffering to 0, I/O is performed without buffering; 1 enables line buffering, while values of 2 or higher use a buffer of the specified size in bytes.
In general, the default value is sufficient because the OS efficiently handles caching, but adjusting this parameter can be effective for very large files or special devices.

Modern file operations using Pathlib

The standard pathlib module makes path handling more intuitive. It improves readability and safety compared to string paths.

 1from pathlib import Path
 2
 3path = Path("data") / "notes.txt"
 4
 5# Ensure parent directory exists.
 6path.parent.mkdir(parents=True, exist_ok=True)
 7
 8# Write and read using pathlib.
 9path.write_text("Example content\n", encoding="utf-8")
10content = path.read_text(encoding="utf-8")
11print(content)

This code demonstrates creating a directory with pathlib, writing to a text file, and reading its contents. With a Path object, you can handle paths both intuitively and safely.
Path has convenient APIs such as iterdir() and glob(). You can write code without worrying about path separators between different operating systems.

Temporary files and directories (tempfile)

Temporary files can be created safely with tempfile. This avoids security race conditions and name collisions.

The following code shows an example of creating temporary data using a temporary file.

 1import tempfile
 2
 3# Create a temporary file that is deleted on close.
 4with tempfile.NamedTemporaryFile(
 5    mode="w+",
 6    encoding="utf-8",
 7    delete=True
 8) as tmp:
 9    tmp.write("temporary data\n")
10    tmp.seek(0)
11    print(tmp.read())
12
13# tmp is deleted here

This code creates a temporary file, writes and reads data, and deletes it automatically when the with block ends. Using tempfile.NamedTemporaryFile, you can handle temporary files safely and without conflicts. Because delete=True is specified, the file is deleted automatically.
On Windows, you may not be able to open the file from another handle immediately, so you can set delete=False and manage deletion yourself.

shutil: High-level operations for copying, moving, and deleting

Recursive copying, moving, and deleting of files and directories is easy with shutil.

 1import shutil
 2
 3# Copy a file preserving metadata.
 4shutil.copy2("source.txt", "dest.txt")
 5
 6# Move a file or directory.
 7shutil.move("old_location", "new_location")
 8
 9# Remove an entire directory tree (use with caution).
10shutil.rmtree("some_directory")

shutil.copy2 also copies metadata (such as modification time). move will fall back to moving the file even if rename cannot be used.
rmtree is a dangerous operation, so always confirm and back up before deleting.

File metadata (os.stat) and permission handling

File size, modification time, and permissions can be read and modified with os and stat.

The following code obtains basic file information with os.stat and changes permissions with os.chmod.

 1import os
 2import stat
 3from datetime import datetime
 4
 5st = os.stat("example.txt")
 6print("size:", st.st_size)
 7print("modified:", datetime.fromtimestamp(st.st_mtime))
 8
 9# Make file read-only for owner.
10os.chmod("example.txt", stat.S_IREAD)

Permissions behave differently on POSIX and Windows systems. If cross-platform compatibility is important, use high-level APIs or add conditional handling.

File locking (exclusive control) — Differences between Unix and Windows

Exclusive control is needed when multiple processes access the same file in parallel. UNIX uses fcntl, and Windows uses msvcrt.

The following code uses fcntl.flock on UNIX systems to acquire an exclusive lock when writing.

 1# Unix-style file locking example
 2import fcntl
 3
 4with open("output.txt", "a+", encoding="utf-8") as f:
 5    fcntl.flock(f, fcntl.LOCK_EX)
 6    try:
 7        f.write("Safe write\n")
 8        f.flush()
 9    finally:
10        fcntl.flock(f, fcntl.LOCK_UN)

This code acquires an exclusive lock using fcntl.flock on UNIX-like systems to safely write and prevent simultaneous writes to the file. Always release the lock after processing to allow access by other processes.
On Windows, use msvcrt.locking(). For higher-level usage, consider external libraries such as portalocker or filelock.

Atomic file write patterns

To prevent file corruption during updates, write to a temporary file and replace it using os.replace on success.

Writing to a temporary file and then replacing avoids corruption if a crash occurs.

 1import os
 2from pathlib import Path
 3import tempfile
 4
 5def atomic_write(path: Path, data: str, encoding="utf-8"):
 6    # Write to a temp file in the same directory and atomically replace.
 7    dirpath = path.parent
 8    with tempfile.NamedTemporaryFile(
 9        mode="w",
10        encoding=encoding,
11        dir=str(dirpath),
12        delete=False
13    ) as tmp:
14        tmp.write(data)
15        tmp.flush()
16        tempname = tmp.name
17
18    # Atomic replacement (on most OSes)
19    os.replace(tempname, str(path))
20
21# Usage
22atomic_write(Path("config.json"), '{"key":"value"}\n')

os.replace performs an atomic replacement within the same file system. Note that atomicity is not guaranteed across different mounts.

Fast access using mmap (for large-scale data)

For random access to large files, mmap improves I/O performance. It mainly involves binary operations.

The following code memory-maps a file and reads or writes specific byte ranges. Be careful when changing the file size.

1import mmap
2
3with open("data.bin", "r+b") as f:
4    mm = mmap.mmap(f.fileno(), 0)  # map entire file
5    try:
6        print(mm[:20])  # first 20 bytes
7        mm[0:4] = b"\x00\x01\x02\x03"  # modify bytes
8    finally:
9        mm.close()

This code maps a binary file to memory with mmap and performs direct byte-level read/write operations. Memory mapping enables fast random access to large datasets.
mmap is efficient, but incorrect use may lead to data consistency issues. Call flush() to synchronize as needed.

CSV / JSON / Pickle: Reading and writing by format

Specific data formats have dedicated modules. Use csv for CSV, json for JSON, and pickle for saving Python objects.

The following code gives basic examples of reading and writing CSV and JSON, and using Pickle. Since Pickle can execute arbitrary code, avoid loading data from untrusted sources.

 1import csv
 2import json
 3import pickle
 4
 5# CSV write
 6with open("rows.csv", "w", encoding="utf-8", newline="") as f:
 7    writer = csv.writer(f)
 8    writer.writerow(["name", "age"])
 9    writer.writerow(["Alice", 30])
10
11# JSON write
12data = {"items": [1, 2, 3]}
13with open("data.json", "w", encoding="utf-8") as f:
14    json.dump(data, f, ensure_ascii=False, indent=2)
15
16# Pickle write (only for trusted environments)
17obj = {"key": "value"}
18with open("obj.pkl", "wb") as f:
19    pickle.dump(obj, f)

Specifying newline="" for CSV is recommended to prevent extra blank lines on Windows. With ensure_ascii=False, JSON keeps UTF-8 characters readable.

Direct reading and writing of compressed files (gzip / bz2 / zipfile)

Directly handling gzip and zip can save disk space. The standard library includes corresponding modules.

The following code is a simple example of reading and writing gzip files as text.

1import gzip
2
3# Write gzipped text.
4with gzip.open("text.gz", "wt", encoding="utf-8") as f:
5    f.write("Compressed content\n")
6
7# Read gzipped text.
8with gzip.open("text.gz", "rt", encoding="utf-8") as f:
9    print(f.read())

There is a trade-off between compression ratio and speed depending on the compression level and format.

Security and vulnerability measures

The following points can be considered for security and vulnerability measures.

Do not use untrusted input directly in file names or paths.
Use Pickle only with trusted sources.
Minimize execution privileges, and give only the minimum permissions to processes handling files.
Use temporary files with tempfile, and do not store plain files in public directories.

If using user input in file paths, normalization and validation are required. For example, use Path.resolve() and check parent directories.

 1from pathlib import Path
 2
 3def safe_path(base_dir: Path, user_path: str) -> Path:
 4    candidate = (base_dir / user_path).resolve()
 5    if base_dir.resolve() not in candidate.parents and base_dir.resolve() != candidate:
 6        raise ValueError("Invalid path")
 7    return candidate
 8
 9# Usage
10base = Path("/srv/app/data")
11
12# Raises an exception: attempted path traversal outside `base`
13safe = safe_path(base, '../etc/passwd')

Be especially careful when using external input as file paths in web apps or public APIs.

Summary of common patterns

Always use the with statement (automatic close).
Explicitly specify encoding for text data.
Read and write large files in chunks.
Introduce file locking for shared resources.
For critical updates, use the atomic pattern of 'write to a temporary file → os.replace'.
Always confirm and create backups before performing dangerous operations (such as deletion or overwriting).
Normalize and validate when using external input as file paths.

Summary

For file operations, it is important to use safe and reliable techniques such as using the with statement, explicitly specifying encoding, and atomic writes. For large-scale processing or parallel access, it is necessary to implement locking and log management systems to prevent data corruption and conflicts. Balancing efficiency and safety is the key to reliable file operations.

You can follow along with the above article using Visual Studio Code on our YouTube channel. Please also check out the YouTube channel.

YouTube Video

Regular Expressions in Python