+ - 0:00:00
Notes for current slide
Notes for next slide

Genomic data formats

Mikhail Dozmorov

Virginia Commonwealth University

02-10-2021

1 / 48

The reference genome as a coordinate system

2 / 48

Binary and "plain text" files

"Plain text" ("flat") file formats

  • Information often structured into lines and columns
  • Human-readable, processable with simple command-line tools
  • Space is often saved through the means of archiving (e.g. zip) and human-readable information coding (e.g. flags)

Binary file formats

  • Not human-readable (not intended for reading)
  • Require special software for processing (programs intended for plain text processing will not work properly on them, e.g. wc, grep)
  • Efficient storage, (significant) reduction to file size when compared to a plain text counterpart (e.g. 75 % space saved)
3 / 48

Some common file formats

  • FASTA - Simple collections of named DNA/protein sequences (text)

  • FASTQ - Extension of FASTA format, contains additional quality information. Widely used for storing unaligned sequencing reads (text)

  • SAM/BAM - Alignments of sequencing reads to a reference genome (text/binary)

  • BED - Region-based genome annotation information (e.g. a list of genes and their genomic locations) (text)

  • GFF/GTF - gene-centric annotations (text)

  • (big)WIG - continuous signal representation (text, binary)

  • VCF - variant call format, to store information about genomic variants (text)

4 / 48

FASTA format

  • Origin in 1980’s FASTP/FASTA software packages for sequence similarity searching

  • Convenient format for storing/transferring simple sequence collections

  • Intended for both, nucleotide and protein sequences (FAST- A, "Fast-All")

  • Common output format of sequence databases – Input format for various tools

5 / 48

FASTA format

Each sequence entry consists of:

  • a single line with sequence metadata, starting with the "greater-than" symbol (">")

  • any number of lines representing the actual sequence

>SRR493818.1.1 HWUSI-EAS519_0001_FC61TCNAAXX:2:1:960:20843 length=54
NGTAAATCCCGTTACCTCTCTGAGCCTCGGTNNCCNNNNNTGTAAAAAGGNNNN
>SRR493818.2.1 HWUSI-EAS519_0001_FC61TCNAAXX:2:1:961:7550 length=54
NACACTACAATGTAAAAGCTTGGCCTAACTTNNTTNNNNNGGCTGTTATTNNNN
6 / 48

FASTA format

The nucleic acid codes that can be found in FASTA file:

A --> adenosine M --> A C (amino)
C --> cytidine S --> G C (strong)
G --> guanine W --> A T (weak)
T --> thymidine B --> G T C
U --> uridine D --> G A T
R --> G A (purine) H --> A C T
Y --> T C (pyrimidine) V --> G C A
K --> G T (keto) N --> A G C T (any)
- gap of indeterminate length

http://zhanglab.ccmb.med.umich.edu/FASTA/

7 / 48

FASTA format

$ URL=http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chrY.fa.gz
$ wget $URL; mv chrY.fa.gz hg38.chrY.fa.gz
# or
$ curl $URL > hg38.chrY.fa.gz
# Ungzip
$ gzip -d hg38.chrY.fa.gz
# Peek into the file
$ head hg38.chrY.fa
$ tail hg38.chrY.fa
# How many lines
$ wc -l hg38.chrY.fa
# Look somewhere in the middle
$ head -n 10000 hg38.chrY.fa | tail
8 / 48

Orienting in FASTA file

  • Do not run:

    $ # grep > hg38.chrY.fa
  • What this command will do?

    $ grep ">" hg38.chrY.fa
  • What this command will do?

    $ grep -v ">" hg38.chrY.fa
9 / 48

chrY summary statistics

  • How long is chrY?

    $ grep -v ">" hg38.chrY.fa | grep -o "[ATCGatcg]" | wc -l
    26415043
  • How many adenosines are there?

    $ grep -v ">" hg38.chrY.fa | grep -o -i "A" | wc -l
    7886192
10 / 48

Low-complexity regions

  • Approximately 45% of the human genome can currently be recognized as being derived from transposable elements
  • The majority are non-long terminal repeat (LTR) retrotransposons, such as LINE-1 (L1), Alu and SVA elements
  • Other low-complexity biases: CG- and AT-rich DNA
  • Represented by lower-case letters
GATCAAAGTGTCATACAGTAACAGCCCAGACAGACGATAGGTATGGCAGa
aaagaaaaaaactaaaaaaaaaaaaaaaaaaaaaaaTCGCATGGGAAGTT
TCCCCGCCTCCTCTTTGGCCATTCTGTGCCCGGAGATCAAAGTTCTCATT
11 / 48

The conversion of sequencing signal to a nucleotide sequence

Nearly all sequencing technologies produce sequencing reads in FASTQ format

Sequence ID
Sequence
Separator
Quality scores

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

https://en.wikipedia.org/wiki/FASTQ_format

12 / 48

Sequence ID

13 / 48

FASTQ quality scores (Phred scores)

PHRED Base quality (Q) – integer value derived from the estimated probability (P) of the corresponding base being determined wrong

Q=10log10(Perr) (rounded to nearest integer)

  • The Ph in Phred comes from Phil Green, the inventor of the encoding

  • P might be computed in different ways, depending on the sequencing platform, but the meaning of Q remains the same

Peter J. A. Cock, Christopher J. Fields, Naohisa Goto, Michael L. Heuer, and Peter M. Rice, Nucleic Acids Res. 2010 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/

http://www.gs.washington.edu/faculty/green.htm

14 / 48

FASTQ quality scores (Phred scores)

PHRED Base quality (Q) – integer value derived from the estimated probability (P) of the corresponding base being determined wrong

  • A higher quality score is better (>=30 is considered "good")

15 / 48

ASCII table

ASCII(char)=Q+33 - Phred + 33 encoding scheme is currently used

http://www.asciicharstable.com/_site_media/ascii/ascii-chars-table-landscape.pdf

16 / 48

FASTQC - quality control

  • Java program, HTML output

17 / 48

fastx toolkit: manipulating FASTQ and FASTA files

http://hannonlab.cshl.edu/fastx_toolkit/

18 / 48

fastx toolkit: manipulating FASTQ and FASTA files

fastx_quality_stats -h
fastx_quality_stats -Q 0 -i file.fastq -o fastx_report.txt
head fastx_report.txt
fastx_trimmer -h
fastx_trimmer -f 1 -l 100 -i file.fastq -o file_trimmed.fastq
19 / 48

The Sequence Alignment/Map format (SAM)

  • Intended for storing read alignments against reference sequences
  • Has a binary version with good software support (BAM format)

The SAM format consists of two sections:

Header section

  • Used to describe source of data, reference sequence, method of alignment, etc.

Alignment section

  • Used to describe the read, quality of the read, and nature alignment of the read to a region of the genome

SAM format specification https://samtools.github.io/hts-specs/SAMv1.pdf

20 / 48

BAM file format of aligned data

  • BAM is the binary version of a SAM file. Smaller, but not easily readable

  • Compressed using lossless BGZF format

  • Other BAM compression strategies are a subject of research. See ‘CRAM’ format for example

http://www.internationalgenome.org/faq/what-are-cram-files/

21 / 48

BAM file format of aligned data

  • BAM files are usually indexed. An index is stored alongside the BAM file with a ".bai" extension

  • Indexing aims to achieve fast retrieval of alignments overlapping a specified region without going through the whole alignments.

  • BAM must be sorted before indexing. Depending on the downstream tools, sort by

    • Name
    • Coordinate
22 / 48

SAM example

23 / 48

SAM/BAM header section

  • Describes source of data, genome, method of alignment, etc.
  • Each section begins with ‘@’ followed by a two-letter record type code. These are followed by two-letter tags and values
@HD The header line
VN: format version
SO: Sorting order of alignments
@SQ Reference sequence dictionary
SN: reference sequence name
LN: reference sequence length
SP: species
@RG Read group
ID: read group identifier
CN: name of sequencing center
SM: sample name
@PG Program
PN: program name
VN: program version
24 / 48

SAM/BAM alignment section

https://samtools.github.io/hts-specs/SAMv1.pdf

25 / 48

Using SAM flags to filter subsets of reads

  • 12 bitwise flags describing the alignment

  • "Bitwise" – values contributing to the sum are powers of two (each can be represented by a single positive bit in a binary number)

  • These flags are stored as a binary string of length 11

  • Value of "1" indicates the flag is set. e.g. 00100000000

  • All combinations can be represented as a number from 1 to 2048 (i.e. 2111). This number is used in the BAM/SAM file. You can specify "required" or "filter" flags in samtools view command using the -f and -F options, respectively

26 / 48

Using SAM flags to filter subsets of reads

  • Unambiguous – there is exactly one possible decomposition of each FLAG sum
  • Each value contributing to the sum represents one fulfilled condition

27 / 48

SAM – bitwise FLAG (example)

28 / 48

SAM – bitwise FLAG (example)

FLAG 83

  • Binary numbers: FLAG 1010011 is a sum of 1000000, 10000, 10 and 1
    • Easy to break down, but binary numbers are long
  • In decimal numbers: FLAG 83 is a sum of 64, 16, 2 and 1
    • 83 = 64 + 19 = 64 + (16 + 3) = 64 + 16 + (2 + 1)
  • What does FLAG 83 mean?

29 / 48

CIGAR string

  • The CIGAR string is a sequence of base lengths and associated "operations" that are used to indicate which bases align to the reference (either a match or mismatch), are deleted, inserted, represent introns, etc.
  • 81M859N19M - Read as: A 100 bp read consists of: 81 bases of alignment to reference, 859 bases skipped (an intron), 19 bases of alignment

https://samtools.github.io/hts-specs/SAMv1.pdf

30 / 48

Tools to work with SAM/BAM files

  • samtools - view, sort, index, QC, stats on SAM/BAM files, and more

  • sambamba - view, sort, index, merge, stats, mark duplicates. fast laternative to samtools

  • picard - QC, validation, duplicates removal and many more utility tools

https://github.com/samtools/samtools

https://lomereiter.github.io/sambamba/index.html

https://broadinstitute.github.io/picard/

31 / 48

SAMTOOLS Commands: Manipulating SAM/BAM

-- Indexing
dict - create a sequence dictionary file
faidx - index/extract FASTA
index - index alignment
-- Editing
calmd - recalculate MD/NM tags and '=' bases
fixmate - fix mate information
reheader - replace BAM header
rmdup - remove PCR duplicates
targetcut - cut fosmid regions (for fosmid pool only)
addreplacerg - adds or replaces RG tags
-- Viewing
flags - explain BAM flags
tview - text alignment viewer
view - SAM<->BAM<->CRAM conversion
depad - convert padded BAM to unpadded BAM
32 / 48

SAMTOOLS Commands: Manipulating SAM/BAM

-- File operations
collate - shuffle and group alignments by name
cat - concatenate BAMs
merge - merge sorted alignments
mpileup - multi-way pileup
sort - sort alignment file
split - splits a file by read group
quickcheck - quickly check if SAM/BAM/CRAM file appears intact
fastq - converts a BAM to a FASTQ
fasta - converts a BAM to a FASTA
-- Statistics
bedcov - read depth per BED region
depth - compute the depth
flagstat - simple stats
idxstats - BAM index stats
phase - phase heterozygotes
stats - generate stats (former bamcheck)

http://quinlanlab.org/tutorials/samtools/samtools.html

33 / 48

BED format

  • Text-based, tab-separated list of genomic regions

  • Each region is specified by a reference sequence and the start and end positions on it

  • Optionally, each region can have additional properties defined – E.g. strand, name, score, color

  • Intended for visualizing genomic annotations in IGV, UCSC Genome Browser (context of expression, regulation, variation, conservation, ...)

http://genome.ucsc.edu/FAQ/FAQformat.html#format1

34 / 48

BED format

3 mandatory columns (must be in correct order)

  • "chrom" – chromosome
  • "chromStart" – the first base of the region with respect to the chromosome (counting starts from 0)
  • "chromEnd" – the first base after the region with respect to the chromosome
  • [chromStart, chromEnd) allows easy region-length calculation
  • Optional fields: "name", "score", "strand", other annotation columns
chr1 115263684 115263685 rs10489525 0 +
chr12 97434219 97434220 rs6538761 0 +
chr14 102360744 102360745 rs7142002 0 +
chr16 84213683 84213684 rs4150167 0 -
chr2 206086170 206086171 rs4675502 0 +
chr20 14747470 14747471 rs4141463 0 +
35 / 48

BED format – optional fields

  • 9 additional optional fields, their order is binding (unlike with SAM format). If n optional fields are included, they will be considered to be the first n optional fields from the format specification
  • All regions must have the same optional fields (unlike with SAM format)

Most important optional fields:

  • "name" – name of the region
  • "score" – score value between 0 and 1000 (read-count, transformed p-value, "quality", ...)
  • Can be interpreted as shades of grey during visualization
  • "strand" – either "+" or "-" (not "1"/"-1")
  • BED12 format specification available

https://genome.ucsc.edu/FAQ/FAQformat.html#format1.7

36 / 48

Tools to work with BED files

  • bedtools - universal tools for manipulating genomic regions

  • bedops - complementary to bedtools, providing additional functionality and speedup

https://bedtools.readthedocs.io/en/latest/

https://bedtools.readthedocs.io/en/latest/

37 / 48

GFF/GTF file format

  • Generic feature format for storing genomic annotation data

  • GFF - Generic Feature Format - describes gene-centric elements

  • GTF - Gene Transfer Format - similar to GFF, but more restrictive

38 / 48

GFF/GTF file format

  • Tab delimited text file (with optional header lines):
    • contig (chromosome)
    • source
    • type
    • start
    • end
    • score
    • strand
    • phase
    • attributes
wget ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.gtf.gz

GTF gene annotations: http://www.ensembl.org/info/data/ftp/index.html

39 / 48

WIG format

  • Text format to represent continuous signal - "wiggle" format
  • variable step format - for data with irregular intervals
    variableStep chrom=chrN
    [span=windowSize]
    chromStartA dataValueA
    chromStartB dataValueB
    ... etc ... ... etc ...
  • fixed step format - for data with regular intervals
    fixedStep chrom=chrN
    start=position step=stepInterval
    [span=windowSize]
    dataValue1
    dataValue2
    ... etc ...
40 / 48

Tools to work with WIG format

  • Wiggler - converts aligner reads to continuous WIG signal

  • WiggleTools - Basic operations on the space of numerical functions defined on the genome. Works with multiple genomic file formats. Operates on single file or pairs of files

https://genome.ucsc.edu/goldenpath/help/wiggle.html

https://code.google.com/archive/p/align2rawsignal/

https://github.com/Ensembl/WiggleTools

41 / 48

VCF format

  • Variant Call Format (VCF) is a flexible and extendable format for variation data

42 / 48

VCF format

  • CHROMO: chromosome / contig
  • POS: the reference position with the 1st base having position 1
    • The value in POS refers to the position of the first base in the string. For indels, the reference string must include the base before the event (and this must be reflected in POS)
  • ID: an id; rs number if dbSNP variant
  • REF: reference base
  • ALT: comma sepearated list of alternate non-ref alleles called on at least one of the samples
    • If no alternate alleles then the missing value should be used “.”
  • QUAL: phred-scaled quality score of the assertion made in ALT (whether variant or non-variant)
  • FILTER: PASS if the position has passed all filters (defined in meta-data)
  • INFO: additional information
43 / 48

Mutation Annotation Format (MAF)

  • Tab-delimited
  • In the context of human cancer, MAF files come in two types--protected and somatic
    • TCGA Protected MAFs include all sequenced mutations for the given technique, and these are further classified as either somatic mutations or other mutations grouped under germline mutations
    • TCGA Somatic MAFs retain mutations validated in a secondary assay to ensure it is not due to sequencing error and of those that remain unvalidated retain mutations within protein coding regions or splice sites. All other mutations are filtered out

https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+%28MAF%29+Specification+-+v2.4

https://www.biostars.org/p/69222/

https://github.com/mskcc/vcf2maf

45 / 48

Formats use different coordinate systems

  • BED: 0-based, half-open
  • GFF: 1-based, closed
  • SAM: 1-based, closed
  • BAM: 0-based, half-open
  • VCF: 1-based, closed

https://www.biostars.org/p/84686/

46 / 48

Formats use different chromosome naming conventions

  • The UCSC and Ensembl databases name their chromosomes differently.

  • By convention, UCSC chromosomes start with chr while Ensembl chromosome names do not.

  • UCSC will call the fifth chromosome chr5 and Ensembl will call it 5

47 / 48

A comparison of genomics data types

NGS technology Total bases Compressed bytes Equivalent size Core hours to analyse 100 samples Comments
Single-cell RNA sequencing 725 million 300 MB 50 MP3 songs 20 >100,000 such samples in SRA, >50,000 from humans
Bulk RNA sequencing 4 billion 2 GB 2 CD-ROMs 100 >400,000 such samples in SRA, >100,000 from humans
Human reference genome (GRCh38) 3 billion 800 MB 1 CD-ROM NA
Whole-exome sequencing 9.5 billion 4.5 GB 1 DVD movie 4,000 ~1,300 human samples from 1000 Genomes Project alone
Whole-genome sequencing of human DNA 75 billion 25 GB 1 Blu-ray movie 30,000 ~18,000 human samples with 30x coverage from the TOPMed project alone
48 / 48

The reference genome as a coordinate system

2 / 48
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow