Audio data compression
|
Note: This article is about audio data compression, which reduces the data rate of digital audio signals. This should not be confused with audio level compression which reduces the dynamic range of audio signals, or companding, which uses both compression and complementary dynamic range expansion as a noise reduction technique.
Audio compression is a form of data compression designed to reduce the size of audio data files. Audio compression algorithms are typically referred to as audio codecs. As with other specific forms of data compression, there exist many "lossless" and "lossy" algorithms to achieve the compression effect.
Contents |
|
Lossless compression
Compared with image compression, lossless compression algorithms are not nearly as widely used in audio compression. This is changing with the popularity of lossless formats such as FLAC, as people increasingly want to maintain a permanent archive of their audio files. The primary users of lossless compression are audio engineers, audiophiles and those consumers who want to preserve the full quality of their audio files and who disdain the quality loss from lossy compression techniques such as Vorbis and MP3.
First, the vast majority of sound recordings are natural sounds, recorded from the real world, and such data doesn't compress well. In a similar manner, photos compress less efficiently with lossless methods than computer-generated images do. But worse, even computer generated sounds can contain very complicated waveforms that present a challenge to many compression algorithms. This is due to the nature of audio waveforms, which are generally difficult to simplify without a (necessarily lossy) conversion to frequency information, as performed by the human ear.
The second reason is that values of audio samples change very quickly, so generic data compression algorithms don't work well for audio, and strings of consecutive bytes don't generally appear very often. However, convolution with the filter [-1 1] (that is, taking the first difference) tends to slightly whiten (decorrelate, make flat) the spectrum, thereby allowing traditional lossless compression at the encoder to do its job; integration at the decoder restores the original signal. More advanced codecs such as Shorten, FLAC and TTA use linear prediction to estimate the spectrum of the signal. At the encoder, the estimator's inverse is used to whiten the signal by removing spectral peaks while the estimator is used to reconstruct the original signal at the decoder.
Lossless audio codecs have no quality issues, so the usability can be estimated by
- Speed of compression and decompression
- Compression factor
- Software support
See a comparison at [1] (http://wiki.hydrogenaudio.org/index.php?title=Lossless_comparison) and [2] (http://members.home.nl/w.speek/comparison.htm) and a graph at [3] (http://web.inter.nl.net/users/hvdh/lossless/All.htm)
Lossy compression
As opposed to lossless compression, where information redundancy is reduced, most lossy compression reduces perceptual redundancy; sounds which are considered perceptually irrelevant are coded with decreased accuracy or not coded at all.
Coding methods
Transform domain methods
In order to determine what information in an audio signal is perceptual irrelevant, most lossy compression algorithms use transforms such as the modified discrete cosine transform (MDCT) to convert time domain sampled waveforms into a transform domain. Once transformed, typically into the frequency domain, component frequencies can be allocated bits according to how audible they are. Audibility of spectral components is determined by first calculating a masking threshold, below which it is estimated that sounds will be beyond the limits of human perception.
The masking threshold is calculated using the absolute threshold of hearing and the principles of simultaneous masking - the phenomenon wherein a signal is masked by another signal separated by frequency - and, in some cases, temporal masking - where a signal is masked by another signal separated by time. Equal-loudness contours may also be used to weight the perceptual importance of different components. Models of the human ear-brain combination incorporating such effects are often called psychoacoustic models.
Time domain methods
Other types of lossy compressors, such as the linear predictive coding (LPC) used with speech, are source-based coders. These coders use a model of the sound's generator (such as the human vocal tract with LPC) to whiten the audio signal (i.e., flatten its spectrum) prior to quantization. LP may also be thought of as a basic perceptual coding technique; reconstruction of an audio signal using a linear predictor shapes the coder's quantization noise into the spectrum of the target signal, partially masking it.
Applications
Due to the nature of lossy algorithms, audio quality suffers when a file is decompressed and recompressed (generational losses). This makes lossy-compressed files unsuitable for professional audio engineering applications, such as sound editing and multitrack recording. However, they are very popular with end users (particularly MP3), as a megabyte can store about a minute's worth of music at adequate quality.
Usability
Because lossy formats are often used for the distribution of streaming audio or interactive applications (such as the coding of speech for digital transmission in cell phone networks), the inherent latency of the coding algorithm can be critical. In contrast to the speed of compression, which is proportional to the number of operations required by the algorithm, here latency refers to the number of samples which must be analysed before a block of audio is processed. In the minimum case, latency is 0 zero samples (e.g., if the coder/decoder simply reduces the number of bits used to quantize the signal). Time domain algorithms such as LPC also often have low latencies, hence their popularity in speech coding for telephony. In algorithms such as MP3, however, a large number of samples have to be analyzed in order to implement a psychoacoustic model in the frequency domain, and latency is on the order of 23 ms (46 ms for two-way communication). In general, latency must be 15 ms or lower for transparent interactivity.
Usability of lossy audio codecs is determined by:
- Perceived audio quality
- Compression factor
- Speed of compression and decompression
- Inherent latency of algorithm (critical for real-time streaming applications)
- Software support