No one asked, but here are some things that would make me love a genomic data center.
Sensible file naming: Require a certain ID format before submission. Ensure that the sequencing center and customers use consistent file naming conventions. Sequencing typical requires output like: SAMPLEID_MACHINE_RUNID. However, if a customer provides a sample ID like “ABC-001”, the sequencing center should NOT be returning “ABC_001_XYZ0001_001234”. Avoid discrepancies between the customer’s original ID and the sequencing center’s modified ID. If modifications are required, ensure that the customer does first it on their side by submitting “ABC_001”.
Protocol file: Provide a summary file that lists the number of samples, batch IDs, sample handling steps (e.g. library preparation kit used, sequencing lane, sequencer machine used), and any alignment or reference genome information. The customer should be responsible for recording what they have ordered but we must assume that they will loose it if it is not included in the data pack which they are returned.
If you map reads to GRCh37 or hg19, use hs37-1kg: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz. If you map to GRCh37 and believe decoy sequences help with better variant calling, use hs37d5: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz. If you map reads to GRCh38 or hg38, use the following: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.
Variant calling: If providing variant calling services, such as using GATK best-practices, include a summary of the protocol.
Producing a simple report will be valuable for the majority of customers. Avoid reporting overly complex information.