The space differs between steps but generally you’d need to have 2 times the largest input file per sample and account for samples running simultaneously on multiple core machines.

Example genome configuration files are available, and automatically installed for natively supported genomes.

Create these by hand to support additional organisms or builds.

Reference these using the naming schemes described in the reference data repository.

For more information on the hg38 truth set preparation see the work on validation on build 38 and conversion of human build 37 truth sets to build 38.

For human, GRCh37 and hg19, we use the 1000 genome references provided in the GATK resource bundle.

You can use pre-existing data and reference indexes by pointing bcbio-nextgen at these resources.

This requires which converts reads output by an Illumina as 3 files (read 1, read 2, and UMIs) into paired reads with UMIs in the fastq names.

