.. _archiving:
Archiving
=========
Archiving simulation inputs, scripts and output data is a common need for computational physicists.
Here are some popular tools and workflows to make archiving easy.
.. _archiving-hpss:
HPC Systems: HPSS
-----------------
A very common tape filesystem is HPSS, e.g., on `NERSC `__ or `OLCF `__.
* What's in my archive file system? ``hsi ls``
* Already something in my archive location? ``hsi ls 2019/cool_campaign/`` as usual
* Let's create a neat **directory structure**:
* new directory on the archive: ``hsi mkdir 2021``
* create sub-dirs per campaign as usual: ``hsi mkdir 2021/reproduce_paper``
* **Create** an archive of a simulation: ``htar -cvf 2021/reproduce_paper/sim_042.tar /global/cfs/cdirs/m1234/ahuebl/reproduce_paper/sim_042``
* This *copies* all files over to the tape filesystem and stores them as a single ``.tar`` archive
* The first argument here will be the new archive ``.tar`` file on the archive file system, all following arguments (can be multiple, separated by a space) are locations to directories and files on the parallel file system.
* Don't be confused, these tools also create an index ``.tar.idx`` file along it; just leave that file be and don't interact with it
* **Change permissions** of your archive, so your team can read your files:
* Check the unix permissions via ``hsi ls -al 2021/`` and ``hsi ls -al 2021/reproduce_paper/``
* *Files* must be group (g) readable (r): ``hsi chmod g+r 2021/reproduce_paper/sim_042.tar``
* *Directories* must be group (g) readable (r) and group accessible (x): ``hsi chmod -R g+rx 2021``
* **Restore** things:
* ``mkdir here_we_restore``
* ``cd here_we_restore``
* ``htar -xvf 2021/reproduce_paper/sim_42.tar``
* this *copies* the ``.tar`` file back from tape to our parallel filesystem and extracts its content in the current directory
Argument meaning: ``-c`` create; ``-x`` extract; ``-v`` verbose; ``-f`` tar filename.
That's it, folks!
.. note::
Sometimes, for large dirs, ``htar`` takes a while.
You could then consider running it as part of a (single-node/single-cpu) job script.
.. _archiving-desktop:
Desktops/Laptops: Cloud Drives
------------------------------
Even for small simulation runs, it is worth to create data archives.
A good location for such an archive might be the cloud storage provided by one's institution.
Tools like `rclone `__ can help with this, e.g., to quickly sync a large amount of directories to a Google Drive.
.. _archiving-globus:
Asynchronous File Copies: Globus
--------------------------------
The scientific data service `Globus `__ allows to perform large-scale data copies, between HPC centers as well as local computers, with ease and a graphical user interface.
Copies can be kicked off asynchronously, often use dedicated internet backbones and are checked when transfers are complete.
Many HPC centers also add their archives as a storage endpoint and one can download a client program to add also one's desktop/laptop.
.. _archiving-open-data:
Scientific Data for Publications
--------------------------------
It is good practice to make computational results accessible, scrutinizable and ideally even reusable.
For data artifacts up to approximately 50 GB, consider using free services like `Zenodo `__ and `Figshare `__ to store supplementary materials of your publications.
For more information, see the open science movement, open data and open access.
.. note::
More information, guidance and templates will be posted here in the future.