class: center, middle, inverse, title-slide # First steps with NGS data ## DUBii - Module 5 ### Valentin Loux - Olivier Rué ### 2020-03-09 --- # Program - Introduction (5 min) - Get data from public resources (30 min) - FASTQ format - Quality control (45 min) - Cleaning of reads (30 min) - Mapping of reads (30 min) - FASTA format - SAM format <!-- Total : 2h 35 --> --- <img src="images/TP.png" class="handson"> # Preparation of your working directory ## Instruction - Go in your home directory - Create a directory called M5 (i.e Module5) and move in - Create this directory structure: ```bash tree ~/M5 [orue@clust-slurm-client M5]$ tree . . ├── CLEANING ├── FASTQ ├── MAPPING └── QC 4 directories, 0 files ``` --- <img src="images/TP.png" class="handson"> ## Correction ```bash mkdir -p ~/M5/FASTQ mkdir -p ~/M5/CLEANING mkdir -p ~/M5/MAPPING mkdir -p ~/M5/QC cd ~/M5 ``` --- class: heading-slide, middle, center # The Data --- # What is data ## Definition - `Data` is <i>a symbolic representation of information</i> - `Data` is stored in files whose format allows an easy way to access and manipulate - `Data` represent the knowledge at a given time. ## Properties - The same information may be represented in different formats - The content depends on technologies <div class="alert comment">Understanding data formats, what information is encoded in each, and when it is appropriate to use one format over another is an essential skill of a bioinfor- matician.</div> --- # Genomics sequences resources The International Nucleotide Sequence Database Collaboration (INSDC) is a long-standing foundational initiative that operates between DDBJ, EMBL-EBI and NCBI. INSDC covers the spectrum of data raw reads, through alignments and assemblies to functional annotation, enriched with contextual information relating to samples and experimental configurations. <div class="figure" style="text-align: center"> <img src="images/public_resources.png" alt="INDSC resources" width="70%" /> <p class="caption">INDSC resources</p> </div> --- # International Nucleotide Sequence Database Collaboration The member organizations of this collaboration are: - NCBI: National Center for Biotechnology Information - EMBL: European Molecular Biology Laboratory - DDBJ: DNA Data Bank of Japan The INSDC has set up rules on the types of data that will be mirrored. The most important of these from a bioinformatician’s perspective are: - GenBank/Ebi ENA contains all annotated and identified DNA sequence information - SRA [NCBI Sequence Reads Archive](https://trace.ncbi.nlm.nih.gov/Traces/sra/) / ENA [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/search): Short Read Archive contains measurements from high throughput sequencing experiments (raw data) Deposit of sequencing (raw) and processed (analyzed) datas are (most of the time) a prerequiste for publication. --- # Other sequence resources ## NAR Database Issue Once a year the journal Nucleic Acids Research publishes its so-called “database issue”. Each article of this issue of the journal will provide an overview of generic and specific databases written by the maintainers of that resource. - View the NAR: 2019 Database Issue. <div class="figure" style="text-align: center"> <img src="images/NAR_db.png" alt="NAR 2019 database issue overview" width="50%" /> <p class="caption">NAR 2019 database issue overview</p> </div> --- class: heading-slide, middle, center # Getting raw data --- # Getting raw data ## Sequencing data - Specialized Tools or API are offered by the public repository to easily get data locally - ENA: enaBrowserTools (command line, python, R) - NCBI: sra-toolkit (command line, python, R) Common command lines (wget) are most of the time also available --- <img src="images/TP.png" class="handson"> # Hands-on: Getting raw data ## Instruction Get the raw shot read data (Illumina) associated with this article <a name=cite-Allue-Guardiae01052-18></a>([Allué-Guardia, Nyong, Koenig, Vargas, Bono, and Eppinger, 2019](https://mra.asm.org/content/8/2/e01052-18)). <img src="images/MRA.01052-18.png" width="70%" style="display: block; margin: auto;" /> - In the "Data availability" section, extract the accession for Illumina data : SRX4909245 - Explore [SRA](https://www.ncbi.nlm.nih.gov/sra/SRX4909245) and [ENA](https://www.ebi.ac.uk/ena/browser/view/SRX4909245) Get the data by the method of your choice: - use <code>wget</code> or <code>fasterq-dump</code> from <code>sra-tools</code> Compress FASTQ files with <code>gzip</code> --- <img src="images/TP.png" class="handson"> ## Correction - Direct download via the web brower - Using wget : ```bash wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_2.fastq.gz ``` - Using sra-toolkit ```bash module load sra-tools srun fasterq-dump -S -p SRR8082143 --outdir . --threads 1 ``` - enaBrowserTool is also available - Compress FASTQ files ```bash gzip *.fastq ``` ```bash ls -ltrh ~/M5/FASTQ/ total 236M -rw-rw-r-- 1 orue orue 127M 6 mars 12:32 SRR8082143_2.fastq.gz -rw-rw-r-- 1 orue orue 109M 6 mars 12:32 SRR8082143_1.fastq.gz ``` --- # Sequencing - Vocabulary .pull-left[ **Read** : piece of sequenced DNA **DNA fragment** = 1 or more reads depending on whether the sequencing is single end or paird-end **Insert** = Fragment size **Depth** = `\(N*L/G\)` N= number of reads, L = size, G : genome size **Coverage** = % of genome covered ] .pull-right[ <img src="images/se-pe.png" width="80%" style="display: block; margin: auto;" /> <img src="images/fragment-insert.png" width="80%" style="display: block; margin: auto;" /> <div class="figure" style="text-align: center"> <img src="images/depth-breadth.png" alt="Single-End , Paired-End" width="80%" /> <p class="caption">Single-End , Paired-End</p> </div> ] --- class: heading-slide, middle, center # FASTQ format --- # FASTQ syntax The FASTQ format is the de facto standard by which all sequencing instruments represent data. It may be thought of as a variant of the FASTA format that allows it to associate a quality measure to each sequence base: **FASTA with QUALITIES**. The FASTQ format consists of 4 sections: 1. A FASTA-like header, but instead of the <code>></code> symbol it uses the <code>@</code> symbol. This is followed by an ID and more optional text, similar to the FASTA headers. 2. The second section contains the measured sequence (typically on a single line), but it may be wrapped until the <code>+</code> sign starts the next section. 3. The third section is marked by the <code>+</code> sign and may be optionally followed by the same sequence id and header as the first section 4. The last line encodes the quality values for the sequence in section 2, and must be of the same length as section 2. <i>Example</i> ```bash @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ``` --- # FASTQ quality The weird characters in the 4th section are the so called “encoded” numerical values. In a nutshell, each character represents a numerical value: a so-called Phred score, encoded via a single letter encoding. ```bash !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI | | | | | | | | | 0....5...10...15...20...25...30...35...40 | | | | | | | | | worst................................best ``` The quality values of the FASTQ files are on top. The numbers in the middle of the scale from 0 to 40 are called Phred scores. The numbers represent the error probabilities via the formula: Error=10ˆ(-P/10) It is basically summarized as: - P=0 means 1/1 (100% probability of error) - P=10 means 1/10 (10% probability of error) - P=20 means 1/100 (1% probability of error) - P=30 means 1/1000 (0.1% probability of error) - P=40 means 1/10000 (0.01% probability of error) --- # FASTQ quality encoding specificities There was a time when instrumentation makers could not decide at what character to start the scale. The **current standard** shown above is the so-called Sanger (+33) format where the ASCII codes are shifted by 33. There is the so-called +64 format that starts close to where the other scale ends. <div class="figure" style="text-align: center"> <img src="images/qualityscore.png" alt="FASTQ encoding values" width="80%" /> <p class="caption">FASTQ encoding values</p> </div> --- # FASTQ toolbox ## seqtk Seqtk <a name=cite-li2012seqtk></a>([Li, 2012](#bib-li2012seqtk)) is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip. ```bash module load seqtk seqtk Usage: seqtk <command> <arguments> Version: 1.3-r106 Command: seq common transformation of FASTA/Q comp get the nucleotide composition of FASTA/Q sample subsample sequences subseq extract subsequences from FASTA/Q fqchk fastq QC (base/quality summary) mergepe interleave two PE FASTA/Q files trimfq trim FASTQ using the Phred algorithm hety regional heterozygosity gc identify high- or low-GC regions mutfa point mutate FASTA at specified positions mergefa merge two FASTA/Q files famask apply a X-coded FASTA to a source FASTA dropse drop unpaired from interleaved PE FASTA/Q rename rename sequence names randbase choose a random base from hets cutN cut sequence at long N listhet extract the position of each het ``` --- # FASTQ Header informations Information is often encoded in the “free” text section of a FASTQ file. <code>@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG</code> contains the following information: - <code>EAS139</code>: the unique instrument name - <code>136</code>: the run id - <code>FC706VJ</code>: the flowcell id - <code>2</code>: flowcell lane - <code>2104</code>: tile number within the flowcell lane - <code>15343</code>: ‘x’-coordinate of the cluster within the tile - <code>197393</code>: ‘y’-coordinate of the cluster within the tile - <code>1</code>: the member of a pair, 1 or 2 (paired-end or mate-pair reads only) - <code>Y</code>: Y if the read is filtered, N otherwise - <code>18</code>: 0 when none of the control bits are on, otherwise it is an even number - <code>ATCACG</code>: index sequence This information is specific to a particular instrument/vendor and may change with different versions or releases of that instrument. --- class: heading-slide, middle, center # Quality control --- ## Why QC'ing your reads ? **Try to answer to (not always) simple questions :** -- - Do the generated sequences conform to the expected level of performance? - Size - Number of reads - Quality - Residual presence of adapters or indexes ? - Are there (un)expected techincal biases - Arte ther (un)expected biological biases <div class="alert comment">
Quality control without context leads to misinterpretation</div> --- # Quality control for FASTQ files - FastQC <a name=cite-fastqc></a>([Andrews, 2010](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) - QC for (Illumina) FastQ files - Command line fastqc or graphical interface - Complete HTML report to spot problem originating from sequencer, library preparation, contamination - Summary graphs and tables to quickly assess your data <div class="figure" style="text-align: center"> <img src="images/fastqc.png" alt="FastQC software" width="40%" /> <p class="caption">FastQC software</p> </div> --- <img src="images/TP.png" class="handson"> # Hands-on : Quality control ## Instruction - Launch FastQC on the paired-end FastQ files of the sample you previously downloaded - Inspect the results - Are the number coherent with the article ? - Comment on the quality of the sequencing --- <img src="images/TP.png" class="handson"> ## Correction ```bash cd ~ module load fastqc srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_1.fastq.gz -o QC/ -t 8 srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_2.fastq.gz -o QC/ -t 8 ``` ```bash ls -ltrh ~/M5/QC total 1,9M -rw-rw-r-- 1 orue orue 321K 6 mars 13:23 SRR8082143_1_fastqc.zip -rw-rw-r-- 1 orue orue 642K 6 mars 13:23 SRR8082143_1_fastqc.html -rw-rw-r-- 1 orue orue 333K 6 mars 13:23 SRR8082143_2_fastqc.zip -rw-rw-r-- 1 orue orue 642K 6 mars 13:23 SRR8082143_2_fastqc.html ``` --- class: heading-slide, middle, center # Reads cleaning --- ## Objectives - Detect and remove sequencing adapters (still) present in the FastQ files - Filter / trim reads according to quality (as plotted in FastQC) ## Tools - Simple & fast : Sickle <a name=cite-sickle></a>([Joshi and Fass, 2011](#bib-sickle)) (quality), cutadapt <a name=cite-cutadapt></a>([Martin, 2011](#bib-cutadapt)) (adpater removal) - Ultra-configurable : Trimmomatic - All in one & ultra-fast : fastp <a name=cite-fastp></a>([Zhou, Chen, Chen, and Gu, 2018](https://dx.doi.org/10.1093/bioinformatics/bty560)) <div class="figure" style="text-align: center"> <img src="images/fastp_wkwf.png" alt="FASTQ encoding values" width="55%" /> <p class="caption">FASTQ encoding values</p> </div> --- <img src="images/TP.png" class="handson"> # Hands-on : reads cleaning with fastp ## Instruction - Launch fastp on the paired-end FastQ files of the sample you previously downloaded - Detect and Remove the classical Illumina adapters - Filter reads with : - mean quality >= 20 on a sliding window of 4 - 40% of the bases with a quality >= 15 - length of the trimmed read >= 100 - Inspect the results - How many reads are filtered ? - Where do fastp store its reports. Is it configurable ? --- <img src="images/TP.png" class="handson"> ## Correction ```bash module load fastp cd ~/M5 srun --cpus-per-task 8 fastp \ --in1 FASTQ/SRR8082143_1.fastq.gz \ --in2 FASTQ/SRR8082143_2.fastq.gz \ -l 100 \ --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz \ --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz \ --unpaired1 CLEANING/SRR8082143_singles.fastq.gz \ --unpaired2 CLEANING/SRR8082143_singles.fastq.gz \ -w 1 \ -j CLEANING/fastp.json \ -h CLEANING/fastp.html \ -t 8 ``` ```bash ls -ltrh ~/M5/CLEANING/ total 245M -rw-rw-r-- 1 orue orue 113M 6 mars 12:59 SRR8082143_1.cleaned_filtered.fastq.gz -rw-rw-r-- 1 orue orue 162K 6 mars 12:59 fastp.json -rw-rw-r-- 1 orue orue 525K 6 mars 12:59 fastp.html -rw-rw-r-- 1 orue orue 2,2M 6 mars 12:59 SRR8082143_singles.fastq.gz -rw-rw-r-- 1 orue orue 130M 6 mars 12:59 SRR8082143_2.cleaned_filtered.fastq.gz ``` --- # One report to rule them all .pull-left[ MultiqQC <a name=cite-multiqc></a>([Ewels, Magnusson, Lundin, and Käller, 2016](#bib-multiqc)) allow the aggregation of individual reports from FastQC, Fastp, Trimmomactic, Cutadapt and much more - 78 tools included - Aggregate all analysis in one report : - by tool - in one graphe aggregating samples ] .pull-right[ <div class="figure" style="text-align: center"> <img src="images/multiqc-tools.png" alt="MultiQC tools" width="30%" /> <p class="caption">MultiQC tools</p> </div> ] <div class="figure" style="text-align: center"> <img src="images/multiqc-example.png" alt="MultiQC Report Example" width="35%" /> <p class="caption">MultiQC Report Example</p> </div> --- <img src="images/TP.png" class="handson"> # Hands-on: MultiQC ## Instruction Run MultiQC to obtain a report with fastqc and fastp results -- ## Correction ```bash cd ~/M5 module load multiqc multiqc -d . -o CLEANING ``` ```bash ls -ltrh ~/M5/CLEANING/ total 248M -rw-rw-r-- 1 orue orue 113M 6 mars 12:59 SRR8082143_1.cleaned_filtered.fastq.gz -rw-rw-r-- 1 orue orue 162K 6 mars 12:59 fastp.json -rw-rw-r-- 1 orue orue 525K 6 mars 12:59 fastp.html -rw-rw-r-- 1 orue orue 2,2M 6 mars 12:59 SRR8082143_singles.fastq.gz -rw-rw-r-- 1 orue orue 130M 6 mars 12:59 SRR8082143_2.cleaned_filtered.fastq.gz -rw-rw-r-- 1 orue orue 1,2M 6 mars 13:28 multiqc_report.html drwxrwxr-x 2 orue orue 2,0M 6 mars 13:28 multiqc_data ``` --- class: heading-slide, middle, center # Mapping --- # Mapping - Map short reads to a reference genome is predict the locus where a read comes from. - The result of a mapping is the list of the most probable regions with an associated probability. -- <div class="alert comment">
But what is a reference?</div> --- # Reference It can be everything containing DNA information: - Complete genome - Assembly - Set of contigs - Set of sequences - Genes, non-coding RNA... For mapping, references have to be stored in a <code>FASTA</code> file. --- class: heading-slide, middle, center # FASTA format --- # Informations inside The FASTA format is used to represent sequence information. The format is very simple: - A <code>></code> symbol on the FASTA header line indicates a fasta record start. - A string of letters called the sequence id may follow the <code>></code> symbol. - The header line may contain an arbitrary amount of text (including spaces) on the same line. - Subsequent lines contain the sequence. -- <i>Example</i> ```bash >foo ATGCC >bar other optional text could go here CCGTA >bidou ACTGCAGT TTCGN >repeatmasker ATGTGTcggggggATTTT >prot2; my_favourite_prot MTSRRSVKSGPREVPRDEYEDLYYTPSSGMASP ``` --- # FASTA syntax The lack of a definition of the FASTA format and its apparent simplicity can be a source of some of the most confounding errors in bioinformatics. Since the format appears so exceed- ingly straightforward, software developers have been tacitly assuming that the properties they are accustomed to are required by some standard - whereas no such thing exists. ## Common problems - Some tools need 60 characters per line - Some tools ignore anything following the first space in the header line - Some tools are very restrictive on the alphabet used - Some tools require uppercase letters --- # FASTA formating ## Good practices The sequence lines should always wrap at the same width (with the exception of the last line). Some tools will fail to operate correctly and may not even warn the users if this condition is not satisfied. The following is technically a valid FASTA but it may cause various subtle problems. ```bash >foo ATGCATGCATGCATGCATGC ATGCATGCA TGATGCATGCATGCATGCA ``` should be reformated to ```bash >foo ATGCATGCATGCATGCATGC ATGCATGCATGATGCATGCA TGCATGCA ``` <i>Can be easily to with seqkit <a name=cite-shen2016seqkit></a>([Shen, Le, Li, and Hu, 2016](#bib-shen2016seqkit))</i> ```bash seqkit seq -w 60 seqs.fa > seqs2.fa ``` --- # FASTA Header Some data repositories will format FASTA headers to include structured information. Tools may operate differently when this information is present in the FASTA header. Below is a list of the recognized FASTA header formats. <div class="figure" style="text-align: center"> <img src="images/FASTA_headers.png" alt="FASTA header examples" width="50%" /> <p class="caption">FASTA header examples</p> </div> --- class: heading-slide, middle, center # Alignment --- # Alignment strategies ```bash GAAGCTCTAGGATTACGATCTTGATCGCCGGGAAATTATGATCCTGACCTGAGTTTAAGGCATGGACCCATAA ATCTTGATCGCCGAC----ATT # GLOBAL ATCTTGATCGCCGACATT # LOCAL, with soft clipping ``` ## Global alignment Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the <code>Needleman–Wunsch algorithm</code>, which is based on dynamic programming. ## Local alignment Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The <code>Smith–Waterman algorithm</code> is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place. --- # Seed-and-extend especially adapted to NGS data Seed-and-extend mappers are a class of read mappers that break down each read sequence into seeds (i.e., smaller segments) to find locations in the reference genome that closely match the read .pull-left[ 1. First, the mapper obtains a read 2. Second, the mapper selects smaller DNA segments from the read to serve as seeds 3. Third, the mapper indexes a data structure with each seed to obtain a list of possible locations within the reference genome that could result in a match 4. Fourth, for each possible location in the list, the mapper obtains the corresponding DNA sequence from the reference genome 5. Fifth, the mapper aligns the read sequence to the reference sequence, using an expensive sequence alignment (i.e., verification) algorithm to determine the similarity between the read sequence and the reference sequence. ] .pull-right[ <img src="images/seed_and_extend.png" width="90%" style="display: block; margin: auto;" /> ] --- # Mapping tools <img src="images/mapping_tools.png" width="70%" style="display: block; margin: auto;" /> - Short reads: BWA <a name=cite-bwa></a>([Li, 2013](#bib-bwa))/ BOWTIE <a name=cite-langmead2012fast></a>([Langmead and Salzberg, 2012](#bib-langmead2012fast)) --- <img src="images/TP.png" class="handson"> # Hands-on: mapping with bwa ## Instruction - Map the reads to the reference genome <code>/shared/projects/dubii2020/data/module5/seance1/CP031214.1.fasta</code> with <code>bwa</code> --- <img src="images/TP.png" class="handson"> ## Correction ```bash cd ~/M5 module load bwa # srun bwa index sequence.fasta srun --cpus-per-task=33 bwa mem \ /shared/projects/dubii2020/data/module5/seance1/CP031214.1.fasta \ CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz \ CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz \ -t 32 \ | \ samtools view -hbS - > MAPPING/SRR8082143.bam ``` ```bash ls -ltrh ~/M5/MAPPING/ total 249M -rw-rw-r-- 1 orue orue 249M 6 mars 13:01 SRR8082143.bam ``` --- class: heading-slide, middle, center # Sequence Alignment Format (SAM) --- # SAM / BAM formats The SAM/BAM formats are so-called Sequence Alignment Maps. These files typically represent the results of aligning a FASTQ file to a reference FASTA file and describe the individual, pairwise alignments that were found. Different algorithms may create different alignments (and hence BAM files) <img src="images/SAM_format.jpg" width="70%" style="display: block; margin: auto;" /> --- # SAM FLAG [FLAGS](https://broadinstitute.github.io/picard/explain-flags.html) contain a lot of informations. <img src="images/sam_flag.png" width="70%" style="display: block; margin: auto;" /> --- # SAM CIGAR <img src="images/SAM_example.png" width="70%" style="display: block; margin: auto;" /> --- # SAM toolbox ## Samtools & Picard tools Samtools <a name=cite-samtools></a>([Li, Handsaker, Wysoker, Fennell, Ruan, Homer, Marth, Abecasis, and Durbin, 2009](#bib-samtools)) and Picard tools <a name=cite-picardtools></a>([Broad Institute, 2018](#bib-picardtools)) are Swiss-knifes for operating of SAM/BAM format - Visualize - Filter - Stats - Index - Merge - ... --- ## Some examples with samtools ```bash # Visualize BAM content in SAM format samtools view -h MAPPING/SRR8082143.bam # Sort BAM file samtools sort MAPPING/SRR8082143.bam -o MAPPING/SRR8082143.sorted.bam # Index sorted BAM file samtools index MAPPING/SRR8082143.sorted.bam # Get some statistics samtools flagstat MAPPING/SRR8082143.sorted.bam # Extract specific region samtools view MAPPING/SRR8082143.sorted.bam CP031214.1:1-1000 # Extract specific region in a BAM file samtools view -h MAPPING/SRR8082143.sorted.bam CP031214.1:1-1000 |samtools view -bS - > MAPPING/SRR8082143.1-1000.bam # ... ``` ## Picard tools ```bash module load picard picard -h ``` --- # What about Long Reads ? As global quality and error models ar different, ,algorithms and tools are different for long reads. The raw read format is also different - PacBio : - internal read correction - built in software for QC / correction - QC : nanoPlot <a name=cite-101093bioinformaticsbty149></a>([De Coster, D’Hert, Schultz, Cruts, and Van Broeckhoven, 2018](https://doi.org/10.1093/bioinformatics/bty149)) - Correction (hybrid) : LorDec <a name=cite-salmela2014lordec></a>([Salmela and Rivals, 2014](#bib-salmela2014lordec)) - Alignment minimap2 <a name=cite-li2018minimap2></a>([Li, 2018](#bib-li2018minimap2)), BLASR <a name=cite-chaisson2012mapping></a>([Chaisson and Tesler, 2012](#bib-chaisson2012mapping)) - NanoPore : - Caution to basecaller / chemistry version ! - QC : nanoPlot - Correction : Canu <a name=cite-koren2017canu></a>([Koren, Walenz, Berlin, Miller, Bergman, and Phillippy, 2017](#bib-koren2017canu)), MECAT <a name=cite-xiao2017mecat></a>([Xiao, Chen, Xie, Chen, Wang, Han, Luo, and Xie, 2017](#bib-xiao2017mecat)) --- # References <a name=bib-Allue-Guardiae01052-18></a>[Allué-Guardia, A, E. C. Nyong, S. S. K. Koenig, et al.](#cite-Allue-Guardiae01052-18) (2019). "Closed Genome Sequence of Escherichia coli K-12 Group Strain C600". In: _Microbiology Resource Announcements_ 8.2. Ed. by J. A. Maresca. DOI: [10.1128/MRA.01052-18](https://doi.org/10.1128%2FMRA.01052-18). eprint: https://mra.asm.org/content/8/2/e01052-18.full.pdf. URL: [https://mra.asm.org/content/8/2/e01052-18](https://mra.asm.org/content/8/2/e01052-18). <a name=bib-fastqc></a>[Andrews, S.](#cite-fastqc) (2010). _FastQC A Quality Control tool for High Throughput Sequence Data_. URL: [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). <a name=bib-picardtools></a>[Broad Institute](#cite-picardtools) (2018). _Picard Tools_. <URL: http://broadinstitute.github.io/picard/>. <a name=bib-chaisson2012mapping></a>[Chaisson, M. J. and G. Tesler](#cite-chaisson2012mapping) (2012). "Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory". In: _BMC bioinformatics_ 13.1, p. 238. <a name=bib-101093bioinformaticsbty149></a>[De Coster, W, S. D’Hert, D. T. Schultz, et al.](#cite-101093bioinformaticsbty149) (2018). "NanoPack: visualizing and processing long-read sequencing data". In: _Bioinformatics_ 34.15, pp. 2666-2669. ISSN: 1367-4803. DOI: [10.1093/bioinformatics/bty149](https://doi.org/10.1093%2Fbioinformatics%2Fbty149). eprint: https://academic.oup.com/bioinformatics/article-pdf/34/15/2666/25230836/bty149.pdf. URL: [https://doi.org/10.1093/bioinformatics/bty149](https://doi.org/10.1093/bioinformatics/bty149). <a name=bib-multiqc></a>[Ewels, P, M. Magnusson, S. Lundin, et al.](#cite-multiqc) (2016). "MultiQC: summarize analysis results for multiple tools and samples in a single report". In: _Bioinformatics_ 32.19, pp. 3047-3048. --- # References <a name=bib-sickle></a>[Joshi, N. and J. Fass](#cite-sickle) (2011). _Sickle: a sliding-window, adaptive, quality-based trimming tool for FastQ files_. <a name=bib-koren2017canu></a>[Koren, S, B. P. Walenz, K. Berlin, et al.](#cite-koren2017canu) (2017). "Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation". In: _Genome research_ 27.5, pp. 722-736. <a name=bib-langmead2012fast></a>[Langmead, B. and S. Salzberg](#cite-langmead2012fast) (2012). _Fast gapped-read alignment with bowtie 2 Nat Methods 9 (4): 357-359. pmid: 22388286 View Article PubMed_. <a name=bib-li2012seqtk></a>[Li, H.](#cite-li2012seqtk) (2012). _seqtk Toolkit for processing sequences in FASTA/Q formats_. <a name=bib-bwa></a>[Li, H.](#cite-bwa) (2013). "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM". In: _arXiv preprint arXiv:1303.3997_. <a name=bib-li2018minimap2></a>[Li, H.](#cite-li2018minimap2) (2018). "Minimap2: pairwise alignment for nucleotide sequences". In: _Bioinformatics_ 34.18, pp. 3094-3100. <a name=bib-samtools></a>[Li, H, B. Handsaker, A. Wysoker, et al.](#cite-samtools) (2009). "The sequence alignment/map format and SAMtools". In: _Bioinformatics_ 25.16, pp. 2078-2079. <a name=bib-cutadapt></a>[Martin, M.](#cite-cutadapt) (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads". In: _EMBnet. journal_ 17.1, pp. 10-12. --- # References <a name=bib-salmela2014lordec></a>[Salmela, L. and E. Rivals](#cite-salmela2014lordec) (2014). "LoRDEC: accurate and efficient long read error correction". In: _Bioinformatics_ 30.24, pp. 3506-3514. <a name=bib-shen2016seqkit></a>[Shen, W, S. Le, Y. Li, et al.](#cite-shen2016seqkit) (2016). "SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation". In: _PloS one_ 11.10. <a name=bib-xiao2017mecat></a>[Xiao, C, Y. Chen, S. Xie, et al.](#cite-xiao2017mecat) (2017). "MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads". In: _nature methods_ 14.11, p. 1072. <a name=bib-fastp></a>[Zhou, Y, Y. Chen, S. Chen, et al.](#cite-fastp) (2018). "fastp: an ultra-fast all-in-one FASTQ preprocessor". In: _Bioinformatics_ 34.17, pp. i884-i890. ISSN: 1367-4803. DOI: [10.1093/bioinformatics/bty560](https://doi.org/10.1093%2Fbioinformatics%2Fbty560). eprint: http://academic.oup.com/bioinformatics/article-pdf/34/17/i884/25702346/bty560.pdf. URL: [https://dx.doi.org/10.1093/bioinformatics/bty560](https://dx.doi.org/10.1093/bioinformatics/bty560).