Space management

Contents

Space management
1. Space usage
Working with sequence files
Leaving MIB
1. Quick links

Space usage

Every users should bear in mind that space management is tricky, on the one hand it is hard to know how much is needed and on the other hand, it is never enough. The amount of space cannot be infinite (for the time being). However, with a good management it is possible to extend the life span of the available storage and make sure everyone can work. On our system we distinguish 2 types of storage, the administrative storage and the working storage.

Administrative storage

The seq and home folders are administrative partitions and their access is limited. seq partition is where all the sequencing runs are saved. Only admins can modify the folders in this partition. Which means that only admins can save the sequencing run there. User can see and use all sequencing runs. In the home partition, users have access to their personal folders. These folders should not be used to store any sequencing runs or any output files related to sequencing run analysis or big files in general. Another administrative partition is tools, this folder is only meant to host all the bioinformatics tools, no data and no databases. Users can install tools via 'conda'. Please address a message to admins to install a tool which is not present on conda.

Working storage

The working partitions are projects and work, these two folders are meant to store users projects files. To start working in the work partition, make a personal folder with the same name as your login name. By default, other users can see files in that folder but not modify it, feel free to adjust the permissions of the folders. That personal folder will host users analyses data only. After running their analyses, all users are expected to clean their work space, remove any files/folders that are not useful for the upcoming steps of analyses (intermediate files). After finishing all analyses, it is important to take the time to delete any file/folder that is not worth keeping for publications and request a project folder to bioinfo.mib@wur.nl to save the important ones. For those who wants to keep intermediate output files until publication is accepted several options are available,

WUR YODA offers to have some significant amount of space for free.
Take a look at SurfSara service fees
You can buy an extra hard drive of 2TB or 4TB or more
It is also possible to rent space from many providers, WUR, amazon,...

The work partition IS NOT backed up. It is meant to be a temporary storage. All users are asked to sort their files regularly, back up relevant FILES and delete the remaining ones.

projects partition is meant to host after publication files, the files that should be kept for 10 years and those folder should be kept clean with a README file that describe the content of the folders. Make sure to compress all data text files using either gzip or bzip2. Gzip is faster and more suitable for non redundant files, you can also use the multithreaded version pigz to speed up the compression. bzip2 offers a better compression level and takes a bit more time, compression can be accelerated using the multithreaded version pbzip2. Use bzip2 by default. DO NOT STORE RAW SEQUENCING FILES HERE!

Documentation

Documentation is a very important part of data management, it helps your supervisor and anyone interested in your data knowing what you have already done, what is left to do, where to find the data. Many platforms provide a wiki allowing users to document their work, here are a few options:

gitlab, this is managed by FB-IT @WUR. There is a MIB repository that could also be used jupyter notebook redmine, ask bioinfo.mib@wur.nl for questions galaxy with galaxy pages And many external ones... It is recommended to use internal tools for documentation as they are easily accessible to all and are not subject to external rules. A good tip could also be to print your documentations in a PDF format and place it at the root of the project folder. This is handy to unsure that the documentation is always next to the data. The downside is it involves manual work.

Databases

Bioinformatics database can be of very significant size, therefore it is recommended to manipulate them with care. A database partition exists on the server, admins will store database of common interest there. More specific databases are temporarily stored in the /work/database partition and REMOVED after use. The owner of the database must give permissions to all users (except for students) after creation of the database. to do that use the following command and scheme chmod -R g+rwx PATH_TO_FOLDER, this will help others keeping the DB up-to-date.

Working with sequence files

The point of this section is to raise awareness about the size of the datasets we usually process and to give some ideas of actions we could take to avoid unnecessarily filling up the storage. Sequencing files are diverse and their sizes vary depending on the technology used for their generation and the aim of the project.

Given that any data used for publication should respect the FAIR principles, analysis outputs are expected to be reproducible. As long as the previous statement is true, the only files you need to store and keep are the data that are cited in the publication and the raw data + metadata as well as a complete description of how to obtain those files (commands and parameters). Any intermediate file can be deleted but it is important to document how they were obtained.

Through the different phases of analysis we are often repeating files or pieces of information contained in the files. A lot of output files carry the same information as the input. For example, let's consider a subset of common workflow for metagenomics data,

Raw reads [FASTQ]
Remove adapter and bad quality reads, reads with Ns or any other artifacts... [FASTQ]
Filter out short reads [FASTQ]
Assemble cleaned reads [FASTA]
Discard potential short and chimeric contigs [FASTA]
Map raw reads to contigs [BAM]
Generate Statistics about assembly and mapping [TXT]
Remove unmapped reads, (this step is optional and only used in this example for illustration) [BAM]
Identify hypothetical markers in contigs [FASTA + GFF3]
Bin the assembly [FASTA]
Discard incomplete and redundant bins [FASTA]

...

Observations::

: The steps 2 and 3 contains a subset of reads from the raw reads files, those output files can also be easily reproduce if sufficient information.
: Step 5 outputs a subset of step 4
: Step 8 outputs a subset of step 6
: Step 9 the FASTA file is mainly used for downstream analysis and GFF3 file for backup
: Step 9 and 10 outputs a subset from step 5
: Step 11 outputs a subset of step 10

Conclusions::

: Many could be learned from the previous scheme, we see that any step can be easily reproduce if we have the information of how each file was obtained. It usually also do not take a long time to reproduce, if sufficient computational resources the entire workflow can be ran in a day (maybe 2 for very complex and large samples).
: It is a good habit to check and keep the log files generated by programs as they usually contain all the metadata relative to the run (software, parameters and version).
: Most (maybe all) of the software used to process the datasets can use Gzipped or Bzipped2 compressed files as input. All users are expected to use compressed files as input as often as possible. Users should also compress any text result file before backing them up, the size of the file does not matter.
: It is not needed to backup output from intermediate steps/runs for example steps 2,3,5,6 & 8 in the scheme above, the same applies to any step that can be easily repeated. !WARNING! It is important to document the metadata about all the steps.
: We advice users to backup individual files and not entire output of a run. A run usually creates a lot of temporary files and the size of the end folder can be 10+ times more than the size of the files users are actually interested in. For example, most assembly software uses several kmer sizes as parameter and outputs 1 assembly per kmer which increases drastically the size of the folder. Also sometimes 2 assembly files are generated (before and after scafolding), they are usually the same files. The best practice in this case is to save the log file, pick one of the assembly files and compress it, the rest can be deleted.
: In some cases, a step is an exact subset of a previous step. For example, the bins generated in step 10 are the exact subset of the raw assembly generated in step 4. In this case and only in this case, it is possible to only save the name of the sequences for each bins. Using a software like seqtk it is possible to easily extract a FASTA file from the raw assembly file for each bin.

Which files are important to backup?

If we consider the workflow described above,

The raw files should always be kept untouched.
The first assembly file generated in step 4 should be backed up in your project folder. Of course, the assembly file should be compressed and the LOG file should be saved next to it. We keep these files as the assembly step can be time consuming and it is used in many downstream analysis.
The statistics files generated in step 7 should be saved as well, the steps needed to generate them can also be long as they usually don't take a so much space.
The GFF file in step 9 is good to back up.
And of course the documentation (could also be online).

What to do of the temporary files?

It can be handy to keep some temporary files before the paper is published, because of the huge size of those files, the server is not a good place to host them. We recommend users to arrange a temporary storage where these files can be stored to help keeping the server storage as available as possible. A temporary storage can be an external hard drive or a cloud storage that respects the WUR policies, look here for examples https://library.wur.nl/storagefinder/ and https://www.surf.nl/files/2022-12/surf-services-and-rates-2023_version-aug-2022.pdf.

Leaving MIB

Think about tidying up

End of contract is always a tricky period for everyone as there are numerous of things to do and cleaning up your workspace on the server is yet another task that is expected from you. Before leaving all important files (publication related files) should be stored in a project folder under the projects partition. Your folder in the work partition should be left empty. Make sure all files are documented somewhere. Your home folder should also be emptied before you leave.

Student files

Students working with the server are not responsible of the files they create, their supervisors are. Supervisors are expected to assess the quality of their students work and make sure important files are saved to the correct place in the correct way. Students folders should be empty before they leave. A documentation of the student work will be much appreciated by their supervisor.

Other IT related actions to take

In addition,

your laptop must be returned back to IT.
Docking station and adapters should be handled to admins (bioinfo.mib@wur.nl).

Quick links

#top