Archiving and compressing
In this section we will be working with a ZIP file that you can download and unpack with
$ wget http://bit.ly/bashfile -O bfiles.zip
$ unzip bfiles.zip
(alternative download link https://autumnschool2022.westdri.ca/files/bfiles.zip)
Unlike an SSD or a hard drive on your laptop, the filesystem on HPC cluster was designed to store large files,
ideally with parallel I/O. As a result, it handles any large number of small I/O requests (reads or writes)
very poorly, sometimes bringing the I/O system to a halt. For this reason, we strongly recommend that users do
not store many thousands of small files – instead you should pack them into a small number of large
archives. This is where the archiving tool tar
comes in handy.
Working with tar
and gzip/gunzip
Covered topics: tar
and g(un)zip
.
Let’s download some files in Windows’ ZIP format:
$ wget http://bit.ly/bashfile -O bfiles.zip
$ unzip bfiles.zip
$ rm bfiles.zip
$ ls
$ ls data-shell
ZIP is a compression format from Windows, and it is not very popular in the Unix world. Let’s archive the
directory data-shell
using Unix’s native tar
command:
$ tar cvf bfiles.tar data-shell/
$ gzip bfiles.tar
You can also create a gzipped TAR file in one step:
$ rm bfiles.tar.gz
$ tar cvfz bfiles.tar.gz data-shell/
Let’s remove the directory and the original ZIP file (if still there), and extract directory from our new archive:
$ /bin/rm -r data-shell/ bfiles.zip
$ tar xvfz bfiles.tar.gz
You can watch a video for this topic after the workshop.
Managing many files with Disk ARchiver (DAR)
tar
is by far the most widely used archiving tool on UNIX-like systems. Since it was originally
designed for sequential write/read on magnetic tapes, it does not index data for random access to its
contents. A number of 3rd-party tools can add indexing to tar
. However, there is a modern version of
tar
called DAR (stands for Disk ARchiver) that has some nice features:
- each DAR archive includes an index for fast file list/restore,
- DAR supports full / differential / incremental backup,
- DAR has build-in compression on a file-by-file basis to make it more resilient against data corruption and to avoid compressing already compressed files such as video,
- DAR supports strong encryption,
- DAR can detect corruption in both headers and saved data and recover with minimal data loss,
and so on. Learning DAR is not part of this course. In the future, if you want to know more about working with DAR, please watch our DAR webinar (scroll down to see it), or check our DAR documentation page.