Canterbury corpus

The Canterbury corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results.

In its most commonly used form, the corpus consists of 11 files, selected as "average" documents from 11 classes of documents, totaling 2,810,784 bytes as follows.

Size (bytes)	File name	Description
152,089	alice29.txt	English text
125,179	asyoulik.txt	Shakespeare
24,603	cp.html	HTML source
11,150	fields.c	C source
3,721	grammar.lsp	LISP source
1,029,744	kennedy.xls	Excel spreadsheet
426,754	lcet10.txt	Technical writing
481,861	plrabn12.txt	Poetry (Paradise Lost)
513,216	ptt5	CCITT test set
38,240	sum	SPARC executable
4,227	xargs.1	GNU manual page

The University of Canterbury also offers the following corpora. Additional files may be added, so results should be only reported for individual files.

The Artificial Corpus, a set of files with highly "artificial" data designed to evoke pathological or worst-case behavior. Last updated 2000 (tar timestamp).
The Large Corpus, a set of large (megabyte-size) files. Contains an E. coli genome, a King James bible, and the CIA world fact book. Last updated 1997 (tar timestamp).
The Miscellaneous Corpus. Contains one million digits of pi. Last updated 2000 (tar timestamp).

References

Ian H. Witten; Alistair Moffat; Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92. ISBN 9781558605701.
Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12. ISBN 9781846286032.
"The Canterbury Corpus: Descriptions". corpus.canterbury.ac.nz.

External links

Official website

v t e Standard test items
Pangram Reference implementation Sanity check Standard test image
Artificial intelligence	Chinese room Turing test
Television (test card)	SMPTE color bars EBU colour bars Indian-head test pattern EIA 1956 resolution chart BBC Test Card A, B, C, D, E, F, G, H, J, W, X ETP-1 Philips circle pattern (PM 5538, PM 5540, PM 5544, PM 5644) Snell & Wilcox SW2/SW4 Telefunken FuBK TVE test card UEIT
Computer languages	"Hello, World!" program Quine Trabb Pardo–Knuth algorithm Man or boy test Just another Perl hacker
Data compression	Calgary corpus Canterbury corpus Silesia corpus enwik8, enwik9
3D computer graphics	Cornell box Stanford bunny Stanford dragon Utah teapot List
Machine learning	ImageNet MNIST database List
Typography (filler text)	Etaoin shrdlu Hamburgevons Lorem ipsum The quick brown fox jumps over the lazy dog
Other	3DBenchy Acid 1 2 3 "Bad Apple!!" EICAR test file functions for optimization GTUBE Harvard sentences Lenna "The North Wind and the Sun" "Tom's Diner" SMPTE universal leader EURion constellation Shakedown Webdriver Torso 1951 USAF resolution test chart

Data compression methods

Lossless

Entropy type	Adaptive coding Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Unary Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary type	Byte pair encoding Lempel–Ziv 842 LZ4 LZJB LZO LZRW LZSS LZW LZWL Snappy
Other types	BWT CTW CM Delta Incremental DMC DPCM Grammar Re-Pair Sequitur LDCT MTF PAQ PPM RLE
Hybrid	LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFSE LZ77 + Huffman + ANS Zstandard LZ77 + Huffman + context Brotli LZSS + Huffman LHA/LZH LZ77 + Range LZMA LZHAM RLE + BWT + MTF + Huffman bzip2

Lossy

Transform type	Discrete cosine transform DCT MDCT DST FFT Wavelet Daubechies DWT SPIHT
Predictive type	DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion Compensation Estimation Vector Psychoacoustic

Audio

Concepts	Bit rate ABR CBR VBR Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Silence compression Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CELP LAR LSP WLPC MDCT Psychoacoustic model

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image Texture compression
Methods	Chain code DCT Deflate Fractal KLT LP RLE Wavelet Daubechies DWT EZW SPIHT

Video

Concepts	Bit rate ABR CBR VBR Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	DCT DPCM Deblocking filter Lapped transform Motion Compensation Estimation Vector Wavelet Daubechies DWT

Theory

Community

Hutter Prize

People

Mark Adler

This computer science article is a stub. You can help Misplaced Pages by expanding it.

Categories:

Misplaced Pages

Contents

See also

References

External links