For faster navigation, this Iframe is preloading the Wikiwand page for Pileup format.

Pileup format

Pileup
Filename extensions
.msf, .pup, .pileup
Developed byTony Cox and Zemin Ning
Type of formatBioinformatics
Extended fromTab separated values
Websitewww.htslib.org/doc/samtools-mpileup.html

Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. This format facilitates visual display of SNP/indel calling and alignment. It was first used by Tony Cox and Zemin Ning at the Wellcome Trust Sanger Institute, and became widely known through its implementation within the SAMtools software suite. [1]

Format

Example

Sequence Position Reference Base Read Count Read Results Quality
seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
seq1 279 C 23 A..T,,.,.,...,,,.,..... 75&<<<<<<<<<=<<<9<<:<<<

The columns

Each line consists of 5 (or optionally 6) tab-separated columns:

  1. Sequence identifier
  2. Position in sequence (starting from 1)
  3. Reference nucleotide at that position
  4. Number of aligned reads covering that position (depth of coverage)
  5. Bases at that position from aligned reads
  6. Phred Quality of those bases, represented in ASCII with -33 offset (OPTIONAL)

Column 5: The bases string

  • . (dot) means a base that matched the reference on the forward strand
  • , (comma) means a base that matched the reference on the reverse strand
  • </> (less-/greater-than sign) denotes a reference skip. This occurs, for example, if a base in the reference genome is intronic and a read maps to two flanking exons. If quality scores are given in a sixth column, they refer to the quality of the read and not the specific base.
  • AGTCN (upper case) denotes a base that did not match the reference on the forward strand
  • agtcn (lower case) denotes a base that did not match the reference on the reverse strand
  • A sequence matching the regular expression \+[0-9]+[ACGTNacgtn]+ denotes an insertion of one or more bases starting from the next position. For example, +2AG means insertion of AG in the forward strand
  • A sequence matching the regular expression \-[0-9]+[ACGTNacgtn]+ denotes a deletion of one or more bases starting from the next position. For example, -2ct means deletion of CT in the reverse strand
  • ^ (caret) marks the start of a read segment and the ASCII of the character following `^' minus 33 gives the mapping quality
  • $ (dollar) marks the end of a read segment
  • * (asterisk) is a placeholder for a deleted base in a multiple basepair deletion that was mentioned in a previous line by the -[0-9]+[ACGTNacgtn]+ notation

Column 6: The base quality string

This is an optional column. If present, the ASCII value of the character minus 33 gives the mapping Phred quality of each of the bases in the previous column 5. This is similar to quality encoding in the FASTQ format.

File extension

There is no standard file extension for a Pileup file, but .msf (multiple sequence file), .pup[2] and .pileup[3][4] are used.

See also

References

  1. ^ Li H.; Handsaker B.; Wysoker A.; Fennell T.; Ruan J.; Homer N.; Marth G.; Abecasis G.; Durbin R; 1000 Genome Project Data Processing Subgroup (2009) (2009). "The Sequence alignment/map (SAM) format and SAMtools". Bioinformatics. 25 (16): 2078–2079. doi:10.1093/bioinformatics/btp352. PMC 2723002. PMID 19505943.((cite journal)): CS1 maint: numeric names: authors list (link)
  2. ^ Accelrys (1998-10-02). "QUANTA: Protein Design. 3. Reading and Writing Sequence Data Files". Université de Montréal. Retrieved 2020-03-27.
  3. ^ Glez-Peña, Daniel; Gómez-López, Gonzalo; Reboiro-Jato, Miguel; Fdez-Riverola, Florentino; Pisano, David G (2011-01-24). "PileLine: a toolbox to handle genome position information in next-generation sequencing studies". BMC Bioinformatics. 12: 31. doi:10.1186/1471-2105-12-31. ISSN 1471-2105. PMC 3037855. PMID 21261974.
  4. ^ Chisom, Halimat (2023-03-31). "File Formats Every Bioinformatician — Established or Upcoming — Must Know (and then some)". Medium. Retrieved 2023-11-11.
{{bottomLinkPreText}} {{bottomLinkText}}
Pileup format
Listen to this article

This browser is not supported by Wikiwand :(
Wikiwand requires a browser with modern capabilities in order to provide you with the best reading experience.
Please download and use one of the following browsers:

This article was just edited, click to reload
This article has been deleted on Wikipedia (Why?)

Back to homepage

Please click Add in the dialog above
Please click Allow in the top-left corner,
then click Install Now in the dialog
Please click Open in the download dialog,
then click Install
Please click the "Downloads" icon in the Safari toolbar, open the first download in the list,
then click Install
{{::$root.activation.text}}

Install Wikiwand

Install on Chrome Install on Firefox
Don't forget to rate us

Tell your friends about Wikiwand!

Gmail Facebook Twitter Link

Enjoying Wikiwand?

Tell your friends and spread the love:
Share on Gmail Share on Facebook Share on Twitter Share on Buffer

Our magic isn't perfect

You can help our automatic cover photo selection by reporting an unsuitable photo.

This photo is visually disturbing This photo is not a good choice

Thank you for helping!


Your input will affect cover photo selection, along with input from other users.

X

Get ready for Wikiwand 2.0 🎉! the new version arrives on September 1st! Don't want to wait?