Working with sequence files

The point of this section is to raise awareness about the size of the datasets we usually process and to give some ideas of actions we could take to avoid unnecessarily filling up the storage. Sequencing files are diverse and their sizes vary depending on the technology used for their generation and the aim of the project.

Given that any data used for publication should respect the FAIR principles, analysis outputs are expected to be reproducible. As long as the previous statement is true, the only files you need to store and keep are the data that are cited in the publication and the raw data + metadata as well as a complete description of how to obtain those files (commands and parameters). Any intermediate file can be deleted but it is important to document how they were obtained.

Through the different phases of analysis we are often repeating files or pieces of information contained in the files. A lot of output files carry the same information as the input. For example, let's consider a subset of common workflow for metagenomics data,

  1. Raw reads [FASTQ]
  2. Remove adapter and bad quality reads, reads with Ns or any other artifacts... [FASTQ]
  3. Filter out short reads [FASTQ]
  4. Assemble cleaned reads [FASTA]
  5. Discard potential short and chimeric contigs [FASTA]
  6. Map raw reads to contigs [BAM]
  7. Generate Statistics about assembly and mapping [TXT]
  8. Remove unmapped reads, (this step is optional and only used in this example for illustration) [BAM]
  9. Identify hypothetical markers in contigs [FASTA + GFF3]
  10. Bin the assembly [FASTA]
  11. Discard incomplete and redundant bins [FASTA]

...

Observations::

The steps 2 and 3 contains a subset of reads from the raw reads files, those output files can also be easily reproduce if sufficient information.
Step 5 outputs a subset of step 4
Step 8 outputs a subset of step 6

Step 9 the FASTA file is mainly used for downstream analysis and GFF3 file for backup

Step 9 and 10 outputs a subset from step 5
Step 11 outputs a subset of step 10

Conclusions::

Many could be learned from the previous scheme, we see that any step can be easily reproduce if we have the information of how each file was obtained. It usually also do not take a long time to reproduce, if sufficient computational resources the entire workflow can be ran in a day (maybe 2 for very complex and large samples).
It is a good habit to check and keep the log files generated by programs as they usually contain all the metadata relative to the run (software, parameters and version).

Most (maybe all) of the software used to process the datasets can use Gzipped or Bzipped2 compressed files as input. All users are expected to use compressed files as input as often as possible. Users should also compress any text result file before backing them up, the size of the file does not matter.

It is not needed to backup output from intermediate steps/runs for example steps 2,3,5,6 & 8 in the scheme above, the same applies to any step that can be easily repeated. !WARNING! It is important to document the metadata about all the steps.

We advice users to backup individual files and not entire output of a run. A run usually creates a lot of temporary files and the size of the end folder can be 10+ times more than the size of the files users are actually interested in. For example, most assembly software uses several kmer sizes as parameter and outputs 1 assembly per kmer which increases drastically the size of the folder. Also sometimes 2 assembly files are generated (before and after scafolding), they are usually the same files. The best practice in this case is to save the log file, pick one of the assembly files and compress it, the rest can be deleted.

In some cases, a step is an exact subset of a previous step. For example, the bins generated in step 10 are the exact subset of the raw assembly generated in step 4. In this case and only in this case, it is possible to only save the name of the sequences for each bins. Using a software like seqtk it is possible to easily extract a FASTA file from the raw assembly file for each bin.

Which files are important to backup?

If we consider the workflow described above,

What to do of the temporary files?

It can be handy to keep some temporary files before the paper is published, because of the huge size of those files, the server is not a good place to host them. We recommend users to arrange a temporary storage where these files can be stored to help keeping the server storage as available as possible. A temporary storage can be an external hard drive or a cloud storage that respects the WUR policies, look here for examples https://library.wur.nl/storagefinder/ and https://www.surf.nl/files/2022-12/surf-services-and-rates-2023_version-aug-2022.pdf.