First steps with NGS data

class: center, middle, inverse, title-slide

# First steps with NGS data
## DUBii - Module 5
### Valentin Loux - Olivier Rué
### 2020-03-09

---

# Program

- Introduction (5 min)
- Get data from public resources (30 min)
- FASTQ format
- Quality control (45 min)
- Cleaning of reads (30 min)
- Mapping of reads (30 min)
- FASTA format
- SAM format

---
<img src="images/TP.png" class="handson">
# Preparation of your working directory

## Instruction

- Go in your home directory
- Create a directory called M5 (i.e Module5) and move in
- Create this directory structure:

```bash
tree ~/M5
[orue@clust-slurm-client M5]$ tree .
.
├── CLEANING
├── FASTQ
├── MAPPING
└── QC

4 directories, 0 files
```

---
<img src="images/TP.png" class="handson">
## Correction

```bash
mkdir -p ~/M5/FASTQ
mkdir -p ~/M5/CLEANING
mkdir -p ~/M5/MAPPING
mkdir -p ~/M5/QC
cd ~/M5
```

---
class: heading-slide, middle, center
# The Data

---

# What is data

## Definition

- `Data` is <i>a symbolic representation of information</i>
- `Data` is stored in files whose format allows an easy way to access and manipulate
- `Data` represent the knowledge at a given time.

## Properties

- The same information may be represented in different formats
- The content depends on technologies

<div class="alert comment">Understanding data formats, what information is encoded in each, and when it
is appropriate to use one format over another is an essential skill of a bioinfor-
matician.</div>

---

# Genomics sequences resources

The International Nucleotide Sequence Database Collaboration (INSDC) is a long-standing foundational initiative that operates between DDBJ, EMBL-EBI and NCBI. INSDC covers the spectrum of data raw reads, through alignments and assemblies to functional annotation, enriched with contextual information relating to samples and experimental configurations.

<div class="figure" style="text-align: center">
<img src="images/public_resources.png" alt="INDSC resources" width="70%" />
<p class="caption">INDSC resources</p>
</div>

---
# International Nucleotide Sequence Database Collaboration

The member organizations of this collaboration are:
- NCBI: National Center for Biotechnology Information
- EMBL: European Molecular Biology Laboratory
- DDBJ: DNA Data Bank of Japan

The INSDC has set up rules on the types of data that will be mirrored. The most important
of these from a bioinformatician’s perspective are:
- GenBank/Ebi ENA contains all annotated and identified DNA sequence information
- SRA [NCBI Sequence Reads Archive](https://trace.ncbi.nlm.nih.gov/Traces/sra/) / ENA [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/search): Short Read Archive contains measurements from high throughput sequencing
experiments (raw data)

Deposit of sequencing (raw) and processed (analyzed) datas are (most of the time) a prerequiste for publication.

---
# Other sequence resources

## NAR Database Issue

Once a year the journal Nucleic Acids Research publishes its so-called “database issue”. Each
article of this issue of the journal will provide an overview of generic and specific
databases written by the maintainers of that resource.
- View the NAR: 2019 Database Issue.

<div class="figure" style="text-align: center">
<img src="images/NAR_db.png" alt="NAR 2019 database issue overview" width="50%" />
<p class="caption">NAR 2019 database issue overview</p>
</div>

---
class: heading-slide, middle, center
# Getting raw data

---
# Getting raw data

## Sequencing data

- Specialized Tools or API are offered by the public repository to easily get data locally
- ENA: enaBrowserTools (command line, python, R)
- NCBI: sra-toolkit (command line, python, R)

Common command lines (wget) are most of the time also available

---
<img src="images/TP.png" class="handson">
# Hands-on: Getting raw data

## Instruction

Get the raw shot read data (Illumina) associated with this article <a name=cite-Allue-Guardiae01052-18></a>([Allué-Guardia, Nyong, Koenig, Vargas, Bono, and Eppinger, 2019](https://mra.asm.org/content/8/2/e01052-18)).

- In the "Data availability" section, extract the accession for Illumina data : SRX4909245
- Explore [SRA](https://www.ncbi.nlm.nih.gov/sra/SRX4909245) and [ENA](https://www.ebi.ac.uk/ena/browser/view/SRX4909245)

Get the data by the method of your choice:
- use <code>wget</code> or <code>fasterq-dump</code> from <code>sra-tools</code>

Compress FASTQ files with <code>gzip</code>

---
<img src="images/TP.png" class="handson">
## Correction

- Direct download via the web brower
- Using wget :

```bash
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_2.fastq.gz
```
- Using sra-toolkit

```bash
module load  sra-tools
srun fasterq-dump -S -p SRR8082143 --outdir . --threads 1
```
- enaBrowserTool is also available

- Compress FASTQ files

```bash
gzip *.fastq
```

```bash
ls -ltrh ~/M5/FASTQ/
total 236M
-rw-rw-r-- 1 orue orue 127M  6 mars  12:32 SRR8082143_2.fastq.gz
-rw-rw-r-- 1 orue orue 109M  6 mars  12:32 SRR8082143_1.fastq.gz

```

---
# Sequencing  - Vocabulary

.pull-left[
**Read** :  piece of sequenced DNA

**DNA fragment** = 1 or more reads depending on whether the sequencing is single end or paird-end

**Insert** = Fragment size

**Depth** = `$N*L/G$` 
N= number of reads, L = size, G : genome size

**Coverage** = % of genome covered
]
.pull-right[
<img src="images/se-pe.png" width="80%" style="display: block; margin: auto;" />

<div class="figure" style="text-align: center">
<img src="images/depth-breadth.png" alt="Single-End , Paired-End" width="80%" />
<p class="caption">Single-End , Paired-End</p>
</div>

]
---
class: heading-slide, middle, center
# FASTQ format

---

# FASTQ syntax

The FASTQ format is the de facto standard by which all sequencing instruments represent
data. It may be thought of as a variant of the FASTA format that allows it to associate a
quality measure to each sequence base:   **FASTA with QUALITIES**.

The FASTQ format consists of 4 sections:
1. A FASTA-like header, but instead of the <code>></code> symbol it uses the <code>@</code> symbol. This is followed
by an ID and more optional text, similar to the FASTA headers.
2. The second section contains the measured sequence (typically on a single line), but it
may be wrapped until the <code>+</code> sign starts the next section.
3. The third section is marked by the <code>+</code> sign and may be optionally followed by the same
sequence id and header as the first section
4. The last line encodes the quality values for the sequence in section 2, and must be of
the same length as section 2.

<i>Example</i>

```bash
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
```

---

# FASTQ quality

The weird characters in the 4th section are the so called “encoded” numerical values.
In a nutshell, each character represents a numerical value: a so-called Phred score,
encoded via a single letter encoding.

```bash
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
|    |    |    |    |    |    |    |    |
0....5...10...15...20...25...30...35...40
|    |    |    |    |    |    |    |    |
worst................................best
```

The quality values of the FASTQ files are on top. The numbers in the middle of the scale
from 0 to 40 are called Phred scores. The numbers represent the error probabilities  via the formula:

Error=10ˆ(-P/10) 
It is basically summarized as:

- P=0 means 1/1 (100% probability of error)
- P=10 means 1/10 (10% probability of error)
- P=20 means 1/100 (1% probability of error)
- P=30 means 1/1000 (0.1% probability of error)
- P=40 means 1/10000 (0.01% probability of error)

---

# FASTQ quality encoding specificities

There was a time when instrumentation makers could not decide at what
character to start the scale. The **current standard** shown above is the so-called Sanger (+33)
format where the ASCII codes are shifted by 33. There is the so-called +64 format that
starts close to where the other scale ends.

<div class="figure" style="text-align: center">
<img src="images/qualityscore.png" alt="FASTQ encoding values" width="80%" />
<p class="caption">FASTQ encoding values</p>
</div>

---

# FASTQ toolbox

## seqtk

Seqtk <a name=cite-li2012seqtk></a>([Li, 2012](#bib-li2012seqtk)) is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.

```bash
module load seqtk
seqtk

Usage:   seqtk <command> <arguments>
Version: 1.3-r106

Command: seq       common transformation of FASTA/Q
         comp      get the nucleotide composition of FASTA/Q
         sample    subsample sequences
         subseq    extract subsequences from FASTA/Q
         fqchk     fastq QC (base/quality summary)
         mergepe   interleave two PE FASTA/Q files
         trimfq    trim FASTQ using the Phred algorithm
         hety      regional heterozygosity
         gc        identify high- or low-GC regions
         mutfa     point mutate FASTA at specified positions
         mergefa   merge two FASTA/Q files
         famask    apply a X-coded FASTA to a source FASTA
         dropse    drop unpaired from interleaved PE FASTA/Q
         rename    rename sequence names
         randbase  choose a random base from hets
         cutN      cut sequence at long N
         listhet   extract the position of each het
```

---

# FASTQ Header informations

Information is often encoded in the “free” text section of a FASTQ file.

<code>@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG</code> contains the following information:

- <code>EAS139</code>: the unique instrument name
- <code>136</code>: the run id
- <code>FC706VJ</code>: the flowcell id
- <code>2</code>: flowcell lane
- <code>2104</code>: tile number within the flowcell lane
- <code>15343</code>: ‘x’-coordinate of the cluster within the tile
- <code>197393</code>: ‘y’-coordinate of the cluster within the tile
- <code>1</code>: the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
- <code>Y</code>: Y if the read is filtered, N otherwise
- <code>18</code>: 0 when none of the control bits are on, otherwise it is an even number
- <code>ATCACG</code>: index sequence

This information is specific to a particular instrument/vendor and may change with different
versions or releases of that instrument.

---
class: heading-slide, middle, center
# Quality control
---

## Why QC'ing your reads ?

**Try to answer to (not always) simple questions :**
--

- Do the generated sequences conform to the expected level of performance?
  - Size
  - Number of reads
  - Quality
- Residual presence of adapters or indexes ?
- Are there (un)expected techincal biases
- Arte ther (un)expected biological biases

<div class="alert comment"><i class="fas  fa-exclamation-circle "></i> Quality control without context leads to misinterpretation</div>

---
# Quality control for FASTQ files

- FastQC <a name=cite-fastqc></a>([Andrews, 2010](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
  - QC for (Illumina) FastQ files
  - Command line fastqc or graphical interface
  - Complete HTML report to spot problem originating from sequencer, library preparation, contamination
  - Summary graphs and tables to quickly assess your data

<div class="figure" style="text-align: center">
<img src="images/fastqc.png" alt="FastQC software" width="40%" />
<p class="caption">FastQC software</p>
</div>

---
<img src="images/TP.png" class="handson">
# Hands-on : Quality control

## Instruction

- Launch FastQC on the paired-end FastQ files of the sample you previously downloaded
- Inspect the results
  - Are the number coherent with the article ?
  - Comment on the quality of the sequencing

---
<img src="images/TP.png" class="handson">
## Correction

```bash
cd ~
module load fastqc
srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_1.fastq.gz -o QC/ -t 8
srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_2.fastq.gz -o QC/ -t 8
```

```bash
ls -ltrh ~/M5/QC
total 1,9M
-rw-rw-r-- 1 orue orue 321K  6 mars  13:23 SRR8082143_1_fastqc.zip
-rw-rw-r-- 1 orue orue 642K  6 mars  13:23 SRR8082143_1_fastqc.html
-rw-rw-r-- 1 orue orue 333K  6 mars  13:23 SRR8082143_2_fastqc.zip
-rw-rw-r-- 1 orue orue 642K  6 mars  13:23 SRR8082143_2_fastqc.html
```

---
class: heading-slide, middle, center
# Reads cleaning
---

## Objectives

- Detect and remove sequencing adapters (still) present in the FastQ files
- Filter / trim reads according to quality (as plotted in FastQC)

## Tools

- Simple & fast : Sickle <a name=cite-sickle></a>([Joshi and Fass, 2011](#bib-sickle)) (quality), cutadapt <a name=cite-cutadapt></a>([Martin, 2011](#bib-cutadapt)) (adpater removal)
- Ultra-configurable : Trimmomatic 
- All in one & ultra-fast : fastp <a name=cite-fastp></a>([Zhou, Chen, Chen, and Gu, 2018](https://dx.doi.org/10.1093/bioinformatics/bty560))

<div class="figure" style="text-align: center">
<img src="images/fastp_wkwf.png" alt="FASTQ encoding values" width="55%" />
<p class="caption">FASTQ encoding values</p>
</div>

---
<img src="images/TP.png" class="handson">
#  Hands-on : reads cleaning with fastp

## Instruction

- Launch fastp on the paired-end FastQ files of the sample you previously downloaded
  - Detect and Remove the classical Illumina adapters
  - Filter reads with :
      - mean quality >= 20 on a sliding window of 4
      -  40% of the bases with a quality >= 15
      - length of the trimmed read >= 100

- Inspect the results
  - How many reads are filtered ?
  - Where do fastp store its reports. Is it configurable ?

---
<img src="images/TP.png" class="handson">
## Correction

```bash
module load fastp
cd ~/M5
srun --cpus-per-task 8 fastp \
 --in1 FASTQ/SRR8082143_1.fastq.gz \
 --in2 FASTQ/SRR8082143_2.fastq.gz \
 -l 100 \
 --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz \
 --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz \
 --unpaired1 CLEANING/SRR8082143_singles.fastq.gz \
 --unpaired2 CLEANING/SRR8082143_singles.fastq.gz \
 -w 1 \
 -j CLEANING/fastp.json \
 -h CLEANING/fastp.html \
 -t 8
```

```bash
ls -ltrh ~/M5/CLEANING/
total 245M
-rw-rw-r-- 1 orue orue 113M  6 mars  12:59 SRR8082143_1.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 162K  6 mars  12:59 fastp.json
-rw-rw-r-- 1 orue orue 525K  6 mars  12:59 fastp.html
-rw-rw-r-- 1 orue orue 2,2M  6 mars  12:59 SRR8082143_singles.fastq.gz
-rw-rw-r-- 1 orue orue 130M  6 mars  12:59 SRR8082143_2.cleaned_filtered.fastq.gz
```

---

# One report to rule them all 
.pull-left[
MultiqQC <a name=cite-multiqc></a>([Ewels, Magnusson, Lundin, and Käller, 2016](#bib-multiqc)) allow the aggregation of individual reports from FastQC, Fastp, Trimmomactic, Cutadapt and much more
- 78 tools included
- Aggregate all analysis in one report :
  - by tool
  - in one graphe aggregating samples
  ]
.pull-right[
<div class="figure" style="text-align: center">
<img src="images/multiqc-tools.png" alt="MultiQC tools" width="30%" />
<p class="caption">MultiQC tools</p>
</div>
]

<div class="figure" style="text-align: center">
<img src="images/multiqc-example.png" alt="MultiQC Report Example" width="35%" />
<p class="caption">MultiQC Report Example</p>
</div>

---
<img src="images/TP.png" class="handson">
# Hands-on: MultiQC

## Instruction

Run MultiQC to obtain a report with fastqc and fastp results

## Correction

```bash
cd ~/M5
module load multiqc
multiqc -d . -o CLEANING
```

```bash
ls -ltrh ~/M5/CLEANING/
total 248M
-rw-rw-r-- 1 orue orue 113M  6 mars  12:59 SRR8082143_1.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 162K  6 mars  12:59 fastp.json
-rw-rw-r-- 1 orue orue 525K  6 mars  12:59 fastp.html
-rw-rw-r-- 1 orue orue 2,2M  6 mars  12:59 SRR8082143_singles.fastq.gz
-rw-rw-r-- 1 orue orue 130M  6 mars  12:59 SRR8082143_2.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 1,2M  6 mars  13:28 multiqc_report.html
drwxrwxr-x 2 orue orue 2,0M  6 mars  13:28 multiqc_data
```

---
class: heading-slide, middle, center
# Mapping

---
# Mapping

- Map short reads to a reference genome is predict the locus where a read comes from.
- The result of a mapping is the list of the most probable regions with an associated probability.

<div class="alert comment"><i class="fas  fa-exclamation-circle "></i> But what is a reference?</div>

---
# Reference

It can be everything containing DNA information:
- Complete genome
- Assembly
- Set of contigs
- Set of sequences
- Genes, non-coding RNA...

For mapping, references have to be stored in a <code>FASTA</code> file.

---
class: heading-slide, middle, center
# FASTA format

---
# Informations inside

The FASTA format is used to represent sequence information. The format is very simple:
- A <code>></code> symbol on the FASTA header line indicates a fasta record start.
- A string of letters called the sequence id may follow the <code>></code> symbol.
- The header line may contain an arbitrary amount of text (including spaces) on the
same line.
- Subsequent lines contain the sequence.

<i>Example</i>

```bash
>foo
ATGCC
>bar other optional text could go here
CCGTA
>bidou
ACTGCAGT
TTCGN
>repeatmasker
ATGTGTcggggggATTTT
>prot2; my_favourite_prot
MTSRRSVKSGPREVPRDEYEDLYYTPSSGMASP
```

---

# FASTA syntax

The lack of a definition of the FASTA format and its apparent simplicity can be a source of
some of the most confounding errors in bioinformatics. Since the format appears so exceed-
ingly straightforward, software developers have been tacitly assuming that the properties
they are accustomed to are required by some standard - whereas no such thing exists.

## Common problems

- Some tools need 60 characters per line
- Some tools ignore anything following the first space in the header line
- Some tools are very restrictive on the alphabet used
- Some tools require uppercase letters

---

# FASTA formating

## Good practices

The sequence lines should always wrap at the same width (with the exception of the
last line). Some tools will fail to operate correctly and may not even warn the users if
this condition is not satisfied. The following is technically a valid FASTA but it may
cause various subtle problems.

```bash
>foo
ATGCATGCATGCATGCATGC
ATGCATGCA
TGATGCATGCATGCATGCA
```

should be reformated to

```bash
>foo
ATGCATGCATGCATGCATGC
ATGCATGCATGATGCATGCA
TGCATGCA
```

<i>Can be easily to with seqkit <a name=cite-shen2016seqkit></a>([Shen, Le, Li, and Hu, 2016](#bib-shen2016seqkit))</i>

```bash
seqkit seq -w 60 seqs.fa > seqs2.fa
```

---

# FASTA Header

Some data repositories will format FASTA headers to include structured information. Tools
may operate differently when this information is present in the FASTA header. Below is a list of the
recognized FASTA header formats.

<div class="figure" style="text-align: center">
<img src="images/FASTA_headers.png" alt="FASTA header examples" width="50%" />
<p class="caption">FASTA header examples</p>
</div>

---
class: heading-slide, middle, center
# Alignment

---
# Alignment strategies

```bash
GAAGCTCTAGGATTACGATCTTGATCGCCGGGAAATTATGATCCTGACCTGAGTTTAAGGCATGGACCCATAA
                 ATCTTGATCGCCGAC----ATT              # GLOBAL
                 ATCTTGATCGCCGACATT                  # LOCAL, with soft clipping
```

## Global alignment

Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the <code>Needleman–Wunsch algorithm</code>, which is based on dynamic programming.

## Local alignment

Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The <code>Smith–Waterman algorithm</code> is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.

---
# Seed-and-extend especially adapted to NGS data

Seed-and-extend mappers are a class of read mappers that break down each read sequence into seeds (i.e., smaller segments) to ﬁnd locations in the reference genome that closely match the read

.pull-left[
1. First, the mapper obtains a read
2. Second, the mapper selects smaller DNA segments from the read to
serve as seeds
3. Third, the mapper indexes a data structure with each seed to
obtain a list of possible locations within the reference genome that could result in
a match
4. Fourth, for each possible location in the list, the mapper obtains the
corresponding DNA sequence from the reference genome
5. Fifth, the mapper aligns the read sequence to the reference sequence, using an expensive sequence
alignment (i.e., veriﬁcation) algorithm to determine the similarity between the read
sequence and the reference sequence.
]
.pull-right[
<img src="images/seed_and_extend.png" width="90%" style="display: block; margin: auto;" />
]
---
# Mapping tools

- Short reads: BWA <a name=cite-bwa></a>([Li, 2013](#bib-bwa))/ BOWTIE <a name=cite-langmead2012fast></a>([Langmead and Salzberg, 2012](#bib-langmead2012fast))

---
<img src="images/TP.png" class="handson">
# Hands-on: mapping with bwa

## Instruction

- Map the reads to the reference genome <code>/shared/projects/dubii2020/data/module5/seance1/CP031214.1.fasta</code> with <code>bwa</code>

---
<img src="images/TP.png" class="handson">
## Correction

```bash
cd ~/M5
module load bwa
# srun bwa index sequence.fasta
srun --cpus-per-task=33 bwa mem \
  /shared/projects/dubii2020/data/module5/seance1/CP031214.1.fasta \
  CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz \
  CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz \
  -t 32 \
  | \
  samtools view -hbS - > MAPPING/SRR8082143.bam
```

```bash
ls -ltrh ~/M5/MAPPING/
total 249M
-rw-rw-r-- 1 orue orue 249M  6 mars  13:01 SRR8082143.bam
```

---
class: heading-slide, middle, center
# Sequence Alignment Format (SAM)

---
# SAM / BAM formats

The SAM/BAM formats are so-called Sequence Alignment Maps. These files typically represent
the results of aligning a FASTQ file to a reference FASTA file and describe the individual,
pairwise alignments that were found. Different algorithms may create different alignments
(and hence BAM files)

<img src="images/SAM_format.jpg" width="70%" style="display: block; margin: auto;" />
---
# SAM FLAG

[FLAGS](https://broadinstitute.github.io/picard/explain-flags.html) contain a lot of informations.

---
# SAM CIGAR

---
# SAM toolbox

## Samtools & Picard tools

Samtools <a name=cite-samtools></a>([Li, Handsaker, Wysoker, Fennell, Ruan, Homer, Marth, Abecasis, and Durbin, 2009](#bib-samtools)) and Picard tools <a name=cite-picardtools></a>([Broad Institute, 2018](#bib-picardtools)) are Swiss-knifes for operating of SAM/BAM format

- Visualize
- Filter
- Stats
- Index
- Merge
- ...

---
## Some examples with samtools

```bash
# Visualize BAM content in SAM format
samtools view -h MAPPING/SRR8082143.bam
# Sort BAM file
samtools sort MAPPING/SRR8082143.bam -o MAPPING/SRR8082143.sorted.bam
# Index sorted BAM file
samtools index MAPPING/SRR8082143.sorted.bam
# Get some statistics
samtools flagstat MAPPING/SRR8082143.sorted.bam
# Extract specific region
samtools view MAPPING/SRR8082143.sorted.bam CP031214.1:1-1000
# Extract specific region in a BAM file
samtools view -h MAPPING/SRR8082143.sorted.bam CP031214.1:1-1000 |samtools view -bS -  > MAPPING/SRR8082143.1-1000.bam
# ...
```

## Picard tools

```bash
module load picard
picard -h
```

---
# What about Long Reads ?

As global quality and error models ar different, ,algorithms and tools are different for long reads.

The raw read format is also different

- PacBio :
  - internal read correction
  - built in software for QC / correction
  - QC : nanoPlot <a name=cite-101093bioinformaticsbty149></a>([De Coster, D’Hert, Schultz, Cruts, and Van Broeckhoven, 2018](https://doi.org/10.1093/bioinformatics/bty149))
  - Correction (hybrid) : LorDec <a name=cite-salmela2014lordec></a>([Salmela and Rivals, 2014](#bib-salmela2014lordec))
  - Alignment minimap2 <a name=cite-li2018minimap2></a>([Li, 2018](#bib-li2018minimap2)), BLASR <a name=cite-chaisson2012mapping></a>([Chaisson and Tesler, 2012](#bib-chaisson2012mapping))
- NanoPore :
  - Caution to basecaller / chemistry version !
  - QC : nanoPlot
  - Correction : Canu <a name=cite-koren2017canu></a>([Koren, Walenz, Berlin, Miller, Bergman, and Phillippy, 2017](#bib-koren2017canu)), MECAT <a name=cite-xiao2017mecat></a>([Xiao, Chen, Xie, Chen, Wang, Han, Luo, and Xie, 2017](#bib-xiao2017mecat))
  
---
# References

<a name=bib-Allue-Guardiae01052-18></a>[Allué-Guardia, A, E. C. Nyong,
S. S. K. Koenig, et al.](#cite-Allue-Guardiae01052-18) (2019). "Closed
Genome Sequence of Escherichia coli K-12 Group Strain C600". In:
_Microbiology Resource Announcements_ 8.2. Ed. by J. A. Maresca. DOI:
[10.1128/MRA.01052-18](https://doi.org/10.1128%2FMRA.01052-18). eprint:
https://mra.asm.org/content/8/2/e01052-18.full.pdf. URL:
[https://mra.asm.org/content/8/2/e01052-18](https://mra.asm.org/content/8/2/e01052-18).

<a name=bib-fastqc></a>[Andrews, S.](#cite-fastqc) (2010). _FastQC A
Quality Control tool for High Throughput Sequence Data_. URL:
[http://www.bioinformatics.babraham.ac.uk/projects/fastqc/](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

<a name=bib-picardtools></a>[Broad Institute](#cite-picardtools)
(2018). _Picard Tools_. <URL: http://broadinstitute.github.io/picard/>.

<a name=bib-chaisson2012mapping></a>[Chaisson, M. J. and G.
Tesler](#cite-chaisson2012mapping) (2012). "Mapping single molecule
sequencing reads using basic local alignment with successive refinement
(BLASR): application and theory". In: _BMC bioinformatics_ 13.1, p.
238.

<a name=bib-101093bioinformaticsbty149></a>[De Coster, W, S. D’Hert, D.
T. Schultz, et al.](#cite-101093bioinformaticsbty149) (2018).
"NanoPack: visualizing and processing long-read sequencing data". In:
_Bioinformatics_ 34.15, pp. 2666-2669. ISSN: 1367-4803. DOI:
[10.1093/bioinformatics/bty149](https://doi.org/10.1093%2Fbioinformatics%2Fbty149).
eprint:
https://academic.oup.com/bioinformatics/article-pdf/34/15/2666/25230836/bty149.pdf.
URL:
[https://doi.org/10.1093/bioinformatics/bty149](https://doi.org/10.1093/bioinformatics/bty149).

<a name=bib-multiqc></a>[Ewels, P, M. Magnusson, S. Lundin, et
al.](#cite-multiqc) (2016). "MultiQC: summarize analysis results for
multiple tools and samples in a single report". In: _Bioinformatics_
32.19, pp. 3047-3048.

---
# References
<a name=bib-sickle></a>[Joshi, N. and J. Fass](#cite-sickle) (2011).
_Sickle: a sliding-window, adaptive, quality-based trimming tool for
FastQ files_.

<a name=bib-koren2017canu></a>[Koren, S, B. P. Walenz, K. Berlin, et
al.](#cite-koren2017canu) (2017). "Canu: scalable and accurate
long-read assembly via adaptive k-mer weighting and repeat separation".
In: _Genome research_ 27.5, pp. 722-736.

<a name=bib-langmead2012fast></a>[Langmead, B. and S.
Salzberg](#cite-langmead2012fast) (2012). _Fast gapped-read alignment
with bowtie 2 Nat Methods 9 (4): 357-359. pmid: 22388286 View Article
PubMed_.

<a name=bib-li2012seqtk></a>[Li, H.](#cite-li2012seqtk) (2012). _seqtk
Toolkit for processing sequences in FASTA/Q formats_.

<a name=bib-bwa></a>[Li, H.](#cite-bwa) (2013). "Aligning sequence
reads, clone sequences and assembly contigs with BWA-MEM". In: _arXiv
preprint arXiv:1303.3997_.

<a name=bib-li2018minimap2></a>[Li, H.](#cite-li2018minimap2) (2018).
"Minimap2: pairwise alignment for nucleotide sequences". In:
_Bioinformatics_ 34.18, pp. 3094-3100.

<a name=bib-samtools></a>[Li, H, B. Handsaker, A. Wysoker, et
al.](#cite-samtools) (2009). "The sequence alignment/map format and
SAMtools". In: _Bioinformatics_ 25.16, pp. 2078-2079.

<a name=bib-cutadapt></a>[Martin, M.](#cite-cutadapt) (2011). "Cutadapt
removes adapter sequences from high-throughput sequencing reads". In:
_EMBnet. journal_ 17.1, pp. 10-12.

---
# References
<a name=bib-salmela2014lordec></a>[Salmela, L. and E.
Rivals](#cite-salmela2014lordec) (2014). "LoRDEC: accurate and
efficient long read error correction". In: _Bioinformatics_ 30.24, pp.
3506-3514.

<a name=bib-shen2016seqkit></a>[Shen, W, S. Le, Y. Li, et
al.](#cite-shen2016seqkit) (2016). "SeqKit: a cross-platform and
ultrafast toolkit for FASTA/Q file manipulation". In: _PloS one_ 11.10.

<a name=bib-xiao2017mecat></a>[Xiao, C, Y. Chen, S. Xie, et
al.](#cite-xiao2017mecat) (2017). "MECAT: fast mapping, error
correction, and de novo assembly for single-molecule sequencing reads".
In: _nature methods_ 14.11, p. 1072.

<a name=bib-fastp></a>[Zhou, Y, Y. Chen, S. Chen, et al.](#cite-fastp)
(2018). "fastp: an ultra-fast all-in-one FASTQ preprocessor". In:
_Bioinformatics_ 34.17, pp. i884-i890. ISSN: 1367-4803. DOI:
[10.1093/bioinformatics/bty560](https://doi.org/10.1093%2Fbioinformatics%2Fbty560).
eprint:
http://academic.oup.com/bioinformatics/article-pdf/34/17/i884/25702346/bty560.pdf.
URL:
[https://dx.doi.org/10.1093/bioinformatics/bty560](https://dx.doi.org/10.1093/bioinformatics/bty560).