"Plain text" ("flat") file formats
Binary file formats
wc
, grep
)FASTA - Simple collections of named DNA/protein sequences (text)
FASTQ - Extension of FASTA format, contains additional quality information. Widely used for storing unaligned sequencing reads (text)
SAM/BAM - Alignments of sequencing reads to a reference genome (text/binary)
BED - Region-based genome annotation information (e.g. a list of genes and their genomic locations) (text)
GFF/GTF - gene-centric annotations (text)
(big)WIG - continuous signal representation (text, binary)
VCF - variant call format, to store information about genomic variants (text)
Origin in 1980’s FASTP/FASTA software packages for sequence similarity searching
Convenient format for storing/transferring simple sequence collections
Intended for both, nucleotide and protein sequences (FAST- A, "Fast-All")
Common output format of sequence databases – Input format for various tools
Each sequence entry consists of:
a single line with sequence metadata, starting with the "greater-than" symbol (">")
any number of lines representing the actual sequence
>SRR493818.1.1 HWUSI-EAS519_0001_FC61TCNAAXX:2:1:960:20843 length=54NGTAAATCCCGTTACCTCTCTGAGCCTCGGTNNCCNNNNNTGTAAAAAGGNNNN>SRR493818.2.1 HWUSI-EAS519_0001_FC61TCNAAXX:2:1:961:7550 length=54NACACTACAATGTAAAAGCTTGGCCTAACTTNNTTNNNNNGGCTGTTATTNNNN
The nucleic acid codes that can be found in FASTA file:
A --> adenosine M --> A C (amino)C --> cytidine S --> G C (strong)G --> guanine W --> A T (weak)T --> thymidine B --> G T CU --> uridine D --> G A TR --> G A (purine) H --> A C TY --> T C (pyrimidine) V --> G C AK --> G T (keto) N --> A G C T (any) - gap of indeterminate length
http://zhanglab.ccmb.med.umich.edu/FASTA/
$ URL=http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chrY.fa.gz$ wget $URL; mv chrY.fa.gz hg38.chrY.fa.gz# or$ curl $URL > hg38.chrY.fa.gz# Ungzip$ gzip -d hg38.chrY.fa.gz# Peek into the file$ head hg38.chrY.fa$ tail hg38.chrY.fa# How many lines$ wc -l hg38.chrY.fa# Look somewhere in the middle$ head -n 10000 hg38.chrY.fa | tail
Do not run:
$ # grep > hg38.chrY.fa
What this command will do?
$ grep ">" hg38.chrY.fa
What this command will do?
$ grep -v ">" hg38.chrY.fa
How long is chrY?
$ grep -v ">" hg38.chrY.fa | grep -o "[ATCGatcg]" | wc -l26415043
How many adenosines are there?
$ grep -v ">" hg38.chrY.fa | grep -o -i "A" | wc -l7886192
GATCAAAGTGTCATACAGTAACAGCCCAGACAGACGATAGGTATGGCAGaaaagaaaaaaactaaaaaaaaaaaaaaaaaaaaaaaTCGCATGGGAAGTTTCCCCGCCTCCTCTTTGGCCATTCTGTGCCCGGAGATCAAAGTTCTCATT
Nearly all sequencing technologies produce sequencing reads in FASTQ format
Sequence ID
Sequence
Separator
Quality scores
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
https://en.wikipedia.org/wiki/FASTQ_format
PHRED Base quality (Q) – integer value derived from the estimated probability (P) of the corresponding base being determined wrong
Q=−10∗log10(Perr) (rounded to nearest integer)
The Ph in Phred comes from Phil Green, the inventor of the encoding
P might be computed in different ways, depending on the sequencing platform, but the meaning of Q remains the same
Peter J. A. Cock, Christopher J. Fields, Naohisa Goto, Michael L. Heuer, and Peter M. Rice, Nucleic Acids Res. 2010 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/
http://www.gs.washington.edu/faculty/green.htm
PHRED Base quality (Q) – integer value derived from the estimated probability (P) of the corresponding base being determined wrong
ASCII(char)=Q+33 - Phred + 33 encoding scheme is currently used
http://www.asciicharstable.com/_site_media/ascii/ascii-chars-table-landscape.pdf
fastx_quality_stats -hfastx_quality_stats -Q 0 -i file.fastq -o fastx_report.txthead fastx_report.txtfastx_trimmer -hfastx_trimmer -f 1 -l 100 -i file.fastq -o file_trimmed.fastq
The SAM format consists of two sections:
Header section
Alignment section
SAM format specification https://samtools.github.io/hts-specs/SAMv1.pdf
BAM is the binary version of a SAM file. Smaller, but not easily readable
Compressed using lossless BGZF format
Other BAM compression strategies are a subject of research. See ‘CRAM’ format for example
http://www.internationalgenome.org/faq/what-are-cram-files/
BAM files are usually indexed. An index is stored alongside the BAM file with a ".bai" extension
Indexing aims to achieve fast retrieval of alignments overlapping a specified region without going through the whole alignments.
BAM must be sorted before indexing. Depending on the downstream tools, sort by
@HD The header lineVN: format versionSO: Sorting order of alignments@SQ Reference sequence dictionarySN: reference sequence nameLN: reference sequence lengthSP: species@RG Read groupID: read group identifierCN: name of sequencing centerSM: sample name@PG ProgramPN: program nameVN: program version
12 bitwise flags describing the alignment
"Bitwise" – values contributing to the sum are powers of two (each can be represented by a single positive bit in a binary number)
These flags are stored as a binary string of length 11
Value of "1" indicates the flag is set. e.g. 00100000000
All combinations can be represented as a number from 1 to 2048 (i.e. 211−1). This number is used in the BAM/SAM file. You can specify "required" or "filter" flags in samtools view
command using the -f
and -F
options, respectively
FLAG 83
https://samtools.github.io/hts-specs/SAMv1.pdf
samtools - view, sort, index, QC, stats on SAM/BAM files, and more
sambamba - view, sort, index, merge, stats, mark duplicates. fast laternative to samtools
picard - QC, validation, duplicates removal and many more utility tools
https://github.com/samtools/samtools
https://lomereiter.github.io/sambamba/index.html
https://broadinstitute.github.io/picard/
-- Indexing dict - create a sequence dictionary file faidx - index/extract FASTA index - index alignment -- Editing calmd - recalculate MD/NM tags and '=' bases fixmate - fix mate information reheader - replace BAM header rmdup - remove PCR duplicates targetcut - cut fosmid regions (for fosmid pool only) addreplacerg - adds or replaces RG tags -- Viewing flags - explain BAM flags tview - text alignment viewer view - SAM<->BAM<->CRAM conversion depad - convert padded BAM to unpadded BAM
-- File operations collate - shuffle and group alignments by name cat - concatenate BAMs merge - merge sorted alignments mpileup - multi-way pileup sort - sort alignment file split - splits a file by read group quickcheck - quickly check if SAM/BAM/CRAM file appears intact fastq - converts a BAM to a FASTQ fasta - converts a BAM to a FASTA -- Statistics bedcov - read depth per BED region depth - compute the depth flagstat - simple stats idxstats - BAM index stats phase - phase heterozygotes stats - generate stats (former bamcheck)
http://quinlanlab.org/tutorials/samtools/samtools.html
Text-based, tab-separated list of genomic regions
Each region is specified by a reference sequence and the start and end positions on it
Optionally, each region can have additional properties defined – E.g. strand, name, score, color
Intended for visualizing genomic annotations in IGV, UCSC Genome Browser (context of expression, regulation, variation, conservation, ...)
http://genome.ucsc.edu/FAQ/FAQformat.html#format1
3 mandatory columns (must be in correct order)
chr1 115263684 115263685 rs10489525 0 +chr12 97434219 97434220 rs6538761 0 +chr14 102360744 102360745 rs7142002 0 +chr16 84213683 84213684 rs4150167 0 -chr2 206086170 206086171 rs4675502 0 +chr20 14747470 14747471 rs4141463 0 +
Most important optional fields:
https://genome.ucsc.edu/FAQ/FAQformat.html#format1.7
bedtools - universal tools for manipulating genomic regions
bedops - complementary to bedtools
, providing additional functionality and speedup
https://bedtools.readthedocs.io/en/latest/
https://bedtools.readthedocs.io/en/latest/
Generic feature format for storing genomic annotation data
GFF - Generic Feature Format - describes gene-centric elements
GTF - Gene Transfer Format - similar to GFF, but more restrictive
GFF specifications: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
GTF specifications: http://www.gencodegenes.org/gencodeformat.html
More about file formats, http://journals.plos.org/ploscompbiol/article/file?type=supplementary&id=info:doi/10.1371/journal.pcbi.1004393.s008
wget ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.gtf.gz
GTF gene annotations: http://www.ensembl.org/info/data/ftp/index.html
variableStep chrom=chrN[span=windowSize]chromStartA dataValueAchromStartB dataValueB... etc ... ... etc ...
fixedStep chrom=chrNstart=position step=stepInterval[span=windowSize]dataValue1dataValue2... etc ...
Wiggler - converts aligner reads to continuous WIG signal
WiggleTools - Basic operations on the space of numerical functions defined on the genome. Works with multiple genomic file formats. Operates on single file or pairs of files
https://genome.ucsc.edu/goldenpath/help/wiggle.html
https://code.google.com/archive/p/align2rawsignal/
https://github.com/Ensembl/WiggleTools
https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+%28MAF%29+Specification+-+v2.4
https://www.biostars.org/p/69222/
https://github.com/mskcc/vcf2maf
https://www.biostars.org/p/84686/
The UCSC and Ensembl databases name their chromosomes differently.
By convention, UCSC chromosomes start with chr while Ensembl chromosome names do not.
UCSC will call the fifth chromosome chr5 and Ensembl will call it 5
NGS technology | Total bases | Compressed bytes | Equivalent size | Core hours to analyse 100 samples | Comments |
---|---|---|---|---|---|
Single-cell RNA sequencing | 725 million | 300 MB | 50 MP3 songs | 20 | >100,000 such samples in SRA, >50,000 from humans |
Bulk RNA sequencing | 4 billion | 2 GB | 2 CD-ROMs | 100 | >400,000 such samples in SRA, >100,000 from humans |
Human reference genome (GRCh38) | 3 billion | 800 MB | 1 CD-ROM | NA | |
Whole-exome sequencing | 9.5 billion | 4.5 GB | 1 DVD movie | 4,000 | ~1,300 human samples from 1000 Genomes Project alone |
Whole-genome sequencing of human DNA | 75 billion | 25 GB | 1 Blu-ray movie | 30,000 | ~18,000 human samples with 30x coverage from the TOPMed project alone |
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |