Genomic data formats

Genomic data formatsMikhail DozmorovVirginia Commonwealth University02-10-20211 / 48

The reference genome as a coordinate system

2 / 48

Binary and "plain text" files

"Plain text" ("flat") file formats

Information often structured into lines and columns
Human-readable, processable with simple command-line tools
Space is often saved through the means of archiving (e.g. zip) and human-readable information coding (e.g. flags)

Binary file formats

Not human-readable (not intended for reading)
Require special software for processing (programs intended for plain text processing will not work properly on them, e.g. wc, grep)
Efficient storage, (significant) reduction to file size when compared to a plain text counterpart (e.g. 75 % space saved)

3 / 48

Some common file formats

FASTA - Simple collections of named DNA/protein sequences (text)
FASTQ - Extension of FASTA format, contains additional quality information. Widely used for storing unaligned sequencing reads (text)
SAM/BAM - Alignments of sequencing reads to a reference genome (text/binary)
BED - Region-based genome annotation information (e.g. a list of genes and their genomic locations) (text)
GFF/GTF - gene-centric annotations (text)
(big)WIG - continuous signal representation (text, binary)
VCF - variant call format, to store information about genomic variants (text)

4 / 48

FASTA format

Origin in 1980’s FASTP/FASTA software packages for sequence similarity searching
Convenient format for storing/transferring simple sequence collections
Intended for both, nucleotide and protein sequences (FAST- A, "Fast-All")
Common output format of sequence databases – Input format for various tools

5 / 48

FASTA format

Each sequence entry consists of:

a single line with sequence metadata, starting with the "greater-than" symbol (">")
any number of lines representing the actual sequence

>SRR493818.1.1 HWUSI-EAS519_0001_FC61TCNAAXX:2:1:960:20843 length=54
NGTAAATCCCGTTACCTCTCTGAGCCTCGGTNNCCNNNNNTGTAAAAAGGNNNN
>SRR493818.2.1 HWUSI-EAS519_0001_FC61TCNAAXX:2:1:961:7550 length=54
NACACTACAATGTAAAAGCTTGGCCTAACTTNNTTNNNNNGGCTGTTATTNNNN

6 / 48

FASTA format

The nucleic acid codes that can be found in FASTA file:

A --> adenosine           M --> A C (amino)
C --> cytidine            S --> G C (strong)
G --> guanine             W --> A T (weak)
T --> thymidine           B --> G T C
U --> uridine             D --> G A T
R --> G A (purine)        H --> A C T
Y --> T C (pyrimidine)    V --> G C A
K --> G T (keto)          N --> A G C T (any)
                          -  gap of indeterminate length

http://zhanglab.ccmb.med.umich.edu/FASTA/

7 / 48

FASTA format

Human genome is there! http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/
Get chromosome Y sequence:

$ URL=http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chrY.fa.gz
$ wget $URL; mv chrY.fa.gz hg38.chrY.fa.gz
# or
$ curl $URL > hg38.chrY.fa.gz
# Ungzip
$ gzip -d hg38.chrY.fa.gz
# Peek into the file
$ head hg38.chrY.fa
$ tail hg38.chrY.fa
# How many lines
$ wc -l hg38.chrY.fa
# Look somewhere in the middle
$ head -n 10000 hg38.chrY.fa | tail

8 / 48

Orienting in FASTA file

Do not run:
```
$ # grep > hg38.chrY.fa
```
What this command will do?
```
$ grep ">" hg38.chrY.fa
```
What this command will do?
```
$ grep -v ">" hg38.chrY.fa
```

9 / 48

chrY summary statistics

How long is chrY?

$ grep -v ">" hg38.chrY.fa | grep -o "[ATCGatcg]" | wc -l
26415043

How many adenosines are there?

$ grep -v ">" hg38.chrY.fa | grep -o -i "A" | wc -l
7886192

10 / 48

Low-complexity regions

Approximately 45% of the human genome can currently be recognized as being derived from transposable elements
The majority are non-long terminal repeat (LTR) retrotransposons, such as LINE-1 (L1), Alu and SVA elements
Other low-complexity biases: CG- and AT-rich DNA
Represented by lower-case letters

GATCAAAGTGTCATACAGTAACAGCCCAGACAGACGATAGGTATGGCAGa
aaagaaaaaaactaaaaaaaaaaaaaaaaaaaaaaaTCGCATGGGAAGTT
TCCCCGCCTCCTCTTTGGCCATTCTGTGCCCGGAGATCAAAGTTCTCATT

11 / 48

The conversion of sequencing signal to a nucleotide sequence

Nearly all sequencing technologies produce sequencing reads in FASTQ format

Sequence ID
Sequence
Separator
Quality scores

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

https://en.wikipedia.org/wiki/FASTQ_format

12 / 48

Sequence ID

13 / 48

FASTQ quality scores (Phred scores)

PHRED Base quality (Q) – integer value derived from the estimated probability (P) of the corresponding base being determined wrong

$Q = - 10 * l o g_{10} (P_{e r r})$ (rounded to nearest integer)

The Ph in Phred comes from Phil Green, the inventor of the encoding
P might be computed in different ways, depending on the sequencing platform, but the meaning of Q remains the same

Peter J. A. Cock, Christopher J. Fields, Naohisa Goto, Michael L. Heuer, and Peter M. Rice, Nucleic Acids Res. 2010 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/

http://www.gs.washington.edu/faculty/green.htm

14 / 48

FASTQ quality scores (Phred scores)

PHRED Base quality (Q) – integer value derived from the estimated probability (P) of the corresponding base being determined wrong

A higher quality score is better (>=30 is considered "good")

15 / 48

ASCII table

$A S C I I (c h a r) = Q + 33$ - Phred + 33 encoding scheme is currently used

http://www.asciicharstable.com/_site_media/ascii/ascii-chars-table-landscape.pdf

16 / 48

FASTQC - quality control

Java program, HTML output

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
FASTQC manual: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/
Video tutorial how to interpret, https://www.youtube.com/watch?v=bz93ReOv87Y

17 / 48

fastx toolkit: manipulating FASTQ and FASTA files

http://hannonlab.cshl.edu/fastx_toolkit/

18 / 48

fastx toolkit: manipulating FASTQ and FASTA files

fastx_quality_stats -h
fastx_quality_stats -Q 0 -i file.fastq -o fastx_report.txt
head fastx_report.txt
fastx_trimmer -h
fastx_trimmer -f 1 -l 100 -i file.fastq -o file_trimmed.fastq

19 / 48

The Sequence Alignment/Map format (SAM)

Intended for storing read alignments against reference sequences
Has a binary version with good software support (BAM format)

The SAM format consists of two sections:

Header section

Used to describe source of data, reference sequence, method of alignment, etc.

Alignment section

Used to describe the read, quality of the read, and nature alignment of the read to a region of the genome

SAM format specification https://samtools.github.io/hts-specs/SAMv1.pdf

20 / 48

BAM file format of aligned data

BAM is the binary version of a SAM file. Smaller, but not easily readable
Compressed using lossless BGZF format
Other BAM compression strategies are a subject of research. See ‘CRAM’ format for example

http://www.internationalgenome.org/faq/what-are-cram-files/

21 / 48

BAM file format of aligned data

BAM files are usually indexed. An index is stored alongside the BAM file with a ".bai" extension
Indexing aims to achieve fast retrieval of alignments overlapping a specified region without going through the whole alignments.
BAM must be sorted before indexing. Depending on the downstream tools, sort by
- Name
- Coordinate

22 / 48

SAM example

23 / 48

SAM/BAM header section

Describes source of data, genome, method of alignment, etc.
Each section begins with ‘@’ followed by a two-letter record type code. These are followed by two-letter tags and values

@HD The header line
VN: format version
SO: Sorting order of alignments
@SQ Reference sequence dictionary
SN: reference sequence name
LN: reference sequence length
SP: species
@RG Read group
ID: read group identifier
CN: name of sequencing center
SM: sample name
@PG Program
PN: program name
VN: program version

24 / 48

SAM/BAM alignment section

https://samtools.github.io/hts-specs/SAMv1.pdf

25 / 48

Using SAM flags to filter subsets of reads

12 bitwise flags describing the alignment
"Bitwise" – values contributing to the sum are powers of two (each can be represented by a single positive bit in a binary number)
These flags are stored as a binary string of length 11
Value of "1" indicates the flag is set. e.g. 00100000000
All combinations can be represented as a number from 1 to 2048 (i.e. $2^{11} - 1$ ). This number is used in the BAM/SAM file. You can specify "required" or "filter" flags in samtools view command using the -f and -F options, respectively

26 / 48

Using SAM flags to filter subsets of reads

Unambiguous – there is exactly one possible decomposition of each FLAG sum
Each value contributing to the sum represents one fulfilled condition

https://broadinstitute.github.io/picard/explain-flags.html

https://samtools.github.io/hts-specs/SAMv1.pdf

27 / 48

SAM – bitwise FLAG (example)

28 / 48

SAM – bitwise FLAG (example)

FLAG 83

Binary numbers: FLAG 1010011 is a sum of 1000000, 10000, 10 and 1
- Easy to break down, but binary numbers are long
In decimal numbers: FLAG 83 is a sum of 64, 16, 2 and 1
- 83 = 64 + 19 = 64 + (16 + 3) = 64 + 16 + (2 + 1)
What does FLAG 83 mean?

29 / 48

CIGAR string

The CIGAR string is a sequence of base lengths and associated "operations" that are used to indicate which bases align to the reference (either a match or mismatch), are deleted, inserted, represent introns, etc.
81M859N19M - Read as: A 100 bp read consists of: 81 bases of alignment to reference, 859 bases skipped (an intron), 19 bases of alignment

https://samtools.github.io/hts-specs/SAMv1.pdf

30 / 48

Tools to work with SAM/BAM files

samtools - view, sort, index, QC, stats on SAM/BAM files, and more
sambamba - view, sort, index, merge, stats, mark duplicates. fast laternative to samtools
picard - QC, validation, duplicates removal and many more utility tools

https://github.com/samtools/samtools

https://lomereiter.github.io/sambamba/index.html

https://broadinstitute.github.io/picard/

31 / 48

SAMTOOLS Commands: Manipulating SAM/BAM

  -- Indexing
     dict - create a sequence dictionary file
     faidx - index/extract FASTA
     index - index alignment
  -- Editing
     calmd - recalculate MD/NM tags and '=' bases
     fixmate - fix mate information
     reheader - replace BAM header
     rmdup - remove PCR duplicates
     targetcut - cut fosmid regions (for fosmid pool only)
     addreplacerg - adds or replaces RG tags
  -- Viewing
     flags - explain BAM flags
     tview - text alignment viewer
     view - SAM<->BAM<->CRAM conversion
     depad - convert padded BAM to unpadded BAM

32 / 48

SAMTOOLS Commands: Manipulating SAM/BAM

  -- File operations
     collate - shuffle and group alignments by name
     cat - concatenate BAMs
     merge - merge sorted alignments
     mpileup - multi-way pileup
     sort - sort alignment file
     split - splits a file by read group
     quickcheck - quickly check if SAM/BAM/CRAM file appears intact
     fastq - converts a BAM to a FASTQ
     fasta - converts a BAM to a FASTA
  -- Statistics
     bedcov - read depth per BED region
     depth - compute the depth
     flagstat - simple stats
     idxstats - BAM index stats
     phase - phase heterozygotes
     stats - generate stats (former bamcheck)

http://quinlanlab.org/tutorials/samtools/samtools.html

33 / 48

BED format

Text-based, tab-separated list of genomic regions
Each region is specified by a reference sequence and the start and end positions on it
Optionally, each region can have additional properties defined – E.g. strand, name, score, color
Intended for visualizing genomic annotations in IGV, UCSC Genome Browser (context of expression, regulation, variation, conservation, ...)

http://genome.ucsc.edu/FAQ/FAQformat.html#format1

34 / 48

BED format

3 mandatory columns (must be in correct order)

"chrom" – chromosome
"chromStart" – the first base of the region with respect to the chromosome (counting starts from 0)
"chromEnd" – the first base after the region with respect to the chromosome
[chromStart, chromEnd) allows easy region-length calculation
Optional fields: "name", "score", "strand", other annotation columns

chr1  115263684 115263685 rs10489525  0 +
chr12 97434219  97434220  rs6538761   0 +
chr14 102360744 102360745 rs7142002   0 +
chr16 84213683  84213684  rs4150167   0 -
chr2  206086170 206086171 rs4675502   0 +
chr20 14747470  14747471  rs4141463   0 +

35 / 48

BED format – optional fields

9 additional optional fields, their order is binding (unlike with SAM format). If n optional fields are included, they will be considered to be the first n optional fields from the format specification
All regions must have the same optional fields (unlike with SAM format)

Most important optional fields:

"name" – name of the region
"score" – score value between 0 and 1000 (read-count, transformed p-value, "quality", ...)
Can be interpreted as shades of grey during visualization
"strand" – either "+" or "-" (not "1"/"-1")
BED12 format specification available

https://genome.ucsc.edu/FAQ/FAQformat.html#format1.7

36 / 48

Tools to work with BED files

bedtools - universal tools for manipulating genomic regions
bedops - complementary to bedtools, providing additional functionality and speedup

https://bedtools.readthedocs.io/en/latest/

37 / 48

GFF/GTF file format

Generic feature format for storing genomic annotation data
GFF - Generic Feature Format - describes gene-centric elements
GTF - Gene Transfer Format - similar to GFF, but more restrictive

GFF specifications: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
GTF specifications: http://www.gencodegenes.org/gencodeformat.html
More about file formats, http://journals.plos.org/ploscompbiol/article/file?type=supplementary&id=info:doi/10.1371/journal.pcbi.1004393.s008

38 / 48

GFF/GTF file format

Tab delimited text file (with optional header lines):
- contig (chromosome)
- source
- type
- start
- end
- score
- strand
- phase
- attributes

wget ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.gtf.gz

GTF gene annotations: http://www.ensembl.org/info/data/ftp/index.html

39 / 48

WIG formatText format to represent continuous signal - "wiggle" format
variable step format - for data with irregular intervals variableStep  chrom=chrN
[span=windowSize]
chromStartA  dataValueA
chromStartB  dataValueB
... etc ...  ... etc ...
fixed step format - for data with regular intervalsfixedStep  chrom=chrN
start=position  step=stepInterval
[span=windowSize]
dataValue1
dataValue2
... etc ...
40 / 48

Tools to work with WIG format

Wiggler - converts aligner reads to continuous WIG signal
WiggleTools - Basic operations on the space of numerical functions defined on the genome. Works with multiple genomic file formats. Operates on single file or pairs of files

https://genome.ucsc.edu/goldenpath/help/wiggle.html

https://code.google.com/archive/p/align2rawsignal/

https://github.com/Ensembl/WiggleTools

41 / 48

VCF format

Variant Call Format (VCF) is a flexible and extendable format for variation data

42 / 48

VCF formatCHROMO: chromosome / contig
POS: the reference position with the 1st base having position 1The value in POS refers to the position of the first base in the string. For indels, the reference string must include the base before the event (and this must be reflected in POS)

ID: an id; rs number if dbSNP variant
REF: reference base
ALT: comma sepearated list of alternate non-ref alleles called on at least one of the samplesIf no alternate alleles then the missing value should be used “.”

QUAL: phred-scaled quality score of the assertion made in ALT (whether variant or non-variant)
FILTER: PASS if the position has passed all filters (defined in meta-data)
INFO: additional information
43 / 48

VCF Format

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/pdf/btr330.pdf

44 / 48

Mutation Annotation Format (MAF)

Tab-delimited
In the context of human cancer, MAF files come in two types--protected and somatic
- TCGA Protected MAFs include all sequenced mutations for the given technique, and these are further classified as either somatic mutations or other mutations grouped under germline mutations
- TCGA Somatic MAFs retain mutations validated in a secondary assay to ensure it is not due to sequencing error and of those that remain unvalidated retain mutations within protein coding regions or splice sites. All other mutations are filtered out

https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+%28MAF%29+Specification+-+v2.4

https://www.biostars.org/p/69222/

https://github.com/mskcc/vcf2maf

45 / 48

Formats use different coordinate systems

BED: 0-based, half-open
GFF: 1-based, closed
SAM: 1-based, closed
BAM: 0-based, half-open
VCF: 1-based, closed

https://www.biostars.org/p/84686/

46 / 48

Formats use different chromosome naming conventions

The UCSC and Ensembl databases name their chromosomes differently.
By convention, UCSC chromosomes start with chr while Ensembl chromosome names do not.
UCSC will call the fifth chromosome chr5 and Ensembl will call it 5

47 / 48

A comparison of genomics data types

NGS technology
Total bases
Compressed bytes
Equivalent size
Core hours to analyse 100 samples
Comments


Single-cell RNA sequencing
725 million
300 MB
50 MP3 songs
20
>100,000 such samples in SRA, >50,000 from humans

Bulk RNA sequencing
4 billion
2 GB
2 CD-ROMs
100
>400,000 such samples in SRA, >100,000 from humans

Human reference genome (GRCh38)
3 billion
800 MB
1 CD-ROM
NA


Whole-exome sequencing
9.5 billion
4.5 GB
1 DVD movie
4,000
~1,300 human samples from 1000 Genomes Project alone

Whole-genome sequencing of human DNA
75 billion
25 GB
1 Blu-ray movie
30,000
~18,000 human samples with 30x coverage from the TOPMed project alone


48 / 48

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

NGS technology	Total bases	Compressed bytes	Equivalent size	Core hours to analyse 100 samples	Comments
Single-cell RNA sequencing	725 million	300 MB	50 MP3 songs	20	>100,000 such samples in SRA, >50,000 from humans
Bulk RNA sequencing	4 billion	2 GB	2 CD-ROMs	100	>400,000 such samples in SRA, >100,000 from humans
Human reference genome (GRCh38)	3 billion	800 MB	1 CD-ROM	NA
Whole-exome sequencing	9.5 billion	4.5 GB	1 DVD movie	4,000	~1,300 human samples from 1000 Genomes Project alone
Whole-genome sequencing of human DNA	75 billion	25 GB	1 Blu-ray movie	30,000	~18,000 human samples with 30x coverage from the TOPMed project alone

Genomic data formats

Mikhail Dozmorov

Virginia Commonwealth University

02-10-2021

The reference genome as a coordinate system

Binary and "plain text" files

Some common file formats

FASTA format

FASTA format

FASTA format

FASTA format

Orienting in FASTA file

chrY summary statistics

Low-complexity regions

The conversion of sequencing signal to a nucleotide sequence

Sequence ID

FASTQ quality scores (Phred scores)

FASTQ quality scores (Phred scores)

ASCII table

FASTQC - quality control

fastx toolkit: manipulating FASTQ and FASTA files

fastx toolkit: manipulating FASTQ and FASTA files

The Sequence Alignment/Map format (SAM)

BAM file format of aligned data

BAM file format of aligned data

SAM example

SAM/BAM header section

SAM/BAM alignment section

Using SAM flags to filter subsets of reads

Using SAM flags to filter subsets of reads

SAM – bitwise FLAG (example)

SAM – bitwise FLAG (example)

CIGAR string

Tools to work with SAM/BAM files

SAMTOOLS Commands: Manipulating SAM/BAM

SAMTOOLS Commands: Manipulating SAM/BAM

BED format

BED format

BED format – optional fields

Tools to work with BED files

GFF/GTF file format

GFF/GTF file format

WIG format

Tools to work with WIG format

VCF format

VCF format

VCF Format

Mutation Annotation Format (MAF)

Formats use different coordinate systems

Formats use different chromosome naming conventions

A comparison of genomics data types

The reference genome as a coordinate system

Help