Skip to content

Archiving and Compression

Archiving and compression are two different operations that are often combined. Archiving bundles multiple files and directories into a single file, preserving directory structure, permissions, timestamps, and ownership - but without reducing size. Compression reduces file size by encoding redundant data more efficiently, but operates on a single file. On Linux, these are usually separate steps (unlike zip, which does both at once): tar creates the archive, then a compression tool like gzip or xz shrinks it. The tar command can invoke compression tools in a single command for convenience.


tar

tar (tape archive) creates, extracts, and lists archives.

Creating Archives

tar -cf archive.tar file1.txt file2.txt     # create archive
tar -cf archive.tar directory/              # archive a directory
tar -cf archive.tar *.log                   # archive matching files

Creating Compressed Archives

Add a compression flag:

tar -czf archive.tar.gz directory/          # gzip compression
tar -cjf archive.tar.bz2 directory/         # bzip2 compression
tar -cJf archive.tar.xz directory/          # xz compression
Flag Compression Extension Speed Ratio
-z gzip .tar.gz or .tgz Fast Good
-j bzip2 .tar.bz2 Slow Better
-J xz .tar.xz Slowest Best

Extracting

tar -xf archive.tar                         # extract (auto-detects compression)
tar -xf archive.tar.gz                      # works the same
tar -xf archive.tar.xz                      # works the same
tar -xf archive.tar -C /target/directory/   # extract to specific directory

Modern tar auto-detects the compression format, so you don't need -z, -j, or -J when extracting.

Modern tar auto-detects compression when extracting

You don't need to remember whether an archive uses gzip, bzip2, or xz. Just use tar xf archive.tar.* and tar reads the magic bytes to determine the compression format automatically. The -z, -j, -J flags are only needed when creating archives - tar needs to know which compressor to invoke.

Listing Contents

tar -tf archive.tar.gz                      # list files without extracting
tar -tvf archive.tar.gz                     # verbose listing (like ls -l)

Common Options

Option Meaning
-c Create archive
-x Extract archive
-t List contents
-f Specify archive filename (put -f last when combining flags, e.g., -czf not -cfz)
-v Verbose output
-C Change to directory before extracting
--exclude Exclude files matching pattern
-p Preserve permissions

Practical Examples

# Archive excluding certain patterns
tar -czf backup.tar.gz --exclude='*.log' --exclude='.git' project/

# Extract a single file from an archive
tar -xf archive.tar.gz path/to/specific/file.txt

# Create an archive with a date in the filename
tar -czf "backup_$(date +%Y%m%d).tar.gz" /var/www

# Append files to an existing (uncompressed) archive
tar -rf archive.tar newfile.txt

tar -rf only works on uncompressed archives

The -r (append) flag only works on plain .tar files, not on compressed archives (.tar.gz, .tar.xz). Attempting tar -rf archive.tar.gz newfile silently fails or corrupts the archive. To add files to a compressed archive, you must decompress it first, append, then recompress.

# Compare archive against filesystem
tar -df archive.tar

gzip / gunzip

gzip compresses individual files. It replaces the original file with a .gz version.

gzip replaces the original file by default

Running gzip file.txt deletes the original and creates file.txt.gz. This catches many people off guard. Use gzip -k (keep) to preserve the original file, or gzip -c file.txt > file.txt.gz to write to stdout without modifying the original.

gzip file.txt                   # creates file.txt.gz, removes file.txt
gzip -k file.txt                # keep the original file
gzip -9 file.txt                # maximum compression (slower)
gzip -1 file.txt                # fastest compression (less compression)
gzip -d file.txt.gz             # decompress (same as gunzip)

Default gzip level (-6) is usually optimal

Compression levels -1 through -9 trade speed for size, but the returns diminish sharply above -6. The jump from -6 to -9 typically saves only 5-15% more space while taking 2-3x longer. Use -1 when speed matters (pipelines, real-time compression) and -9 only when compressing once for many downloads (software releases).

Compression levels from -1 to -9 control the tradeoff between speed and compression ratio. Lower numbers use less CPU time and memory but produce larger files. Higher numbers spend more CPU and memory searching for better ways to encode the data. The difference in file size between -1 and -9 is often modest (5-15% on typical files), so the default level (-6 for gzip) is usually the right choice. Use -1 when speed matters (compressing data in a pipeline or on a slow machine) and -9 only when you're compressing once and distributing many times (like software releases).

gunzip decompresses:

gunzip file.txt.gz              # creates file.txt, removes file.txt.gz

zcat reads compressed files without decompressing:

zcat file.txt.gz                # print contents to STDOUT
zcat file.txt.gz | grep "error" # search compressed file

Also available: zless, zgrep for working with gzip files directly.


bzip2 / bunzip2

bzip2 compresses with a better ratio than gzip but is slower.

bzip2 file.txt                  # creates file.txt.bz2
bzip2 -k file.txt               # keep original
bzip2 -d file.txt.bz2           # decompress (same as bunzip2)

bunzip2 decompresses:

bunzip2 file.txt.bz2

bzcat reads compressed files:

bzcat file.txt.bz2

xz / unxz

xz provides the best compression ratio of the three but is the slowest.

xz file.txt                    # creates file.txt.xz
xz -k file.txt                 # keep original
xz -9 file.txt                 # maximum compression
xz -d file.txt.xz              # decompress (same as unxz)
xz -T 0 file.txt               # use all CPU cores (much faster)

Use xz -T 0 to use all CPU cores

By default, xz uses a single CPU core, making it painfully slow on large files. The -T 0 flag enables multithreaded compression using all available cores, dramatically reducing compression time. For a 1GB file on an 8-core machine, this can cut compression time by 5-7x.

unxz decompresses:

unxz file.txt.xz

xzcat reads compressed files:

xzcat file.txt.xz

zip / unzip

zip creates archives compatible with Windows and macOS. It handles both archiving and compression in one step.

zip archive.zip file1.txt file2.txt        # create zip with files
zip -r archive.zip directory/              # recursive (include directories)
zip -e archive.zip sensitive.txt           # encrypt with password
zip -u archive.zip newfile.txt             # add/update files in existing zip

unzip extracts:

unzip archive.zip                          # extract to current directory
unzip archive.zip -d /target/directory/    # extract to specific directory
unzip -l archive.zip                       # list contents
unzip -o archive.zip                       # overwrite without prompting

zip Limitations

zip doesn't preserve Unix file permissions

Extracted files get default permissions based on your umask, not the original permissions. Scripts that were executable before zipping won't be executable after unzipping. For Unix-to-Unix transfers where permissions, ownership, and symlinks matter, use tar instead.

Classic zip has a few limitations to be aware of. It doesn't preserve Unix file permissions by default - extracted files get default permissions based on your umask, which can break scripts that need to be executable. The original zip format has a 4GB limit for individual files and a 4GB limit for the total archive size. Modern implementations support zip64 extensions to overcome this, but not all unzip tools handle zip64 correctly. For Unix-to-Unix transfers where you need to preserve permissions, ownership, and symlinks, tar archives are the better choice.


When to Use Which

Compression format decision tree: sharing with Windows leads to zip, max compression leads to tar.xz, speed priority leads to tar.gz, default is tar.gz
Format Use When
.tar.gz General purpose. Fast, good compression, universally supported on Linux. Default choice for most things.
.tar.bz2 You need better compression and can wait longer. Less common now that xz exists.
.tar.xz Maximum compression matters (distributing software, long-term storage). Standard for Linux distro packages.
.zip Sharing with Windows/macOS users, or when recipients might not have tar.
.gz (no tar) Compressing a single file (like log rotation).

Concrete scenarios:

  • Distributing software - .tar.xz is the standard. Users download once, the slow compression time is paid by the developer, and the small size saves bandwidth.
  • Log rotation - .gz (single file, no tar needed). logrotate uses gzip by default because the fast compression/decompression cycle matters when rotating logs on a busy server.
  • Backups - .tar.gz balances compression with speed. For large backup jobs, the time difference between gzip and xz can be hours.
  • Sharing with non-Linux users - .zip is universally supported on Windows and macOS without extra software.
  • Archiving for long-term storage - .tar.xz gives the best size reduction. If the data won't be accessed frequently, the slow compression is worth it.

Compression Comparison

Approximate results for a typical 100MB text file (actual results vary with content):

Format Compressed Size Compression Time Decompression Time
gzip ~30MB (~70% reduction) ~2 seconds ~1 second
bzip2 ~25MB (~75% reduction) ~8 seconds ~4 seconds
xz ~20MB (~80% reduction) ~30 seconds ~2 seconds
zip ~30MB (~70% reduction) ~2 seconds ~1 second

Notable: xz decompresses much faster than it compresses, making it a good choice when you compress once and decompress many times (like software distribution). Binary files and already-compressed data (images, video) will see much smaller reductions.


Further Reading


Previous: System Information | Next: Best Practices | Back to Index

Comments