Ogg Speex
A free, royalty-free speech codec designed for packet networks and VoIP.
Pro Audio Converter can convert audio files to and from Ogg Speex (.spx).
Speex is a codec, not a container: .spx files wrap Speex-compressed speech in an Ogg transport layer.
The Speex codec exists because there is a need for a speech codec that is open-source and free from software patent royalties. These are essential conditions for being usable in any open-source software. In essence, Speex is to speech what Vorbis is to audio/music. Unlike many other speech codecs, Speex is not designed for mobile phones but rather for packet networks and Voice over IP (VoIP) applications. File-based compression is, of course, also supported.
The Speex codec is designed to be very flexible and support a wide range of speech quality and bitrates. Support for very good quality speech also means Speex can encode wideband speech (16 kHz sampling rate) in addition to narrowband speech (telephone quality, 8 kHz sampling rate).
Speex has largely been superseded by Opus for new applications; Xiph.org recommends Opus for all new voice codecs since 2011. Pro Audio Converter still supports Speex so you can read, edit, or convert legacy .spx files.
Encoding Options
Variable Bit-Rate (VBR) — VBR allows a codec to change its bitrate dynamically to adapt to the "difficulty" of the audio being encoded. For example, sounds like vowels and high-energy transients require a higher bitrate to achieve good quality, while fricatives (s, f sounds) can be coded adequately with fewer bits. VBR can achieve a lower bitrate for the same quality, or better quality for a given bitrate. Drawbacks: by only specifying quality, there's no guarantee about the final average bitrate; and for some real-time applications like VoIP, what counts is the maximum bitrate, which must be low enough for the communication channel.
Average Bit-Rate (ABR) — ABR solves one of the problems of VBR by dynamically adjusting VBR quality to meet a specific target bitrate. Because quality and bitrate are adjusted in real-time (open-loop), the global quality will be slightly lower than that obtained by encoding in VBR with exactly the right quality setting to meet the target average bitrate.
Bitrate — When encoding a speech signal, the bitrate is defined as the number of bits per unit of time required to encode the speech. It is measured in bits per second (bps), or generally kilobits per second.
Complexity (variable) — You can vary the complexity allowed for the encoder by controlling how the search is performed, with an integer ranging from 1 to 10. For normal use, the noise level at complexity 1 is between 1 and 2 dB higher than at complexity 10, but the CPU requirements for complexity 10 are about five times higher. In practice, the best trade-off is between complexity 2 and 4, though higher settings are often useful when encoding non-speech sounds like DTMF tones.
Sampling Rate — The number of samples taken from a signal per second, expressed in Hertz (Hz). For a sampling rate of Fs kHz, the highest frequency that can be represented is Fs/2 kHz (the Nyquist frequency). Speex is mainly designed for three different sampling rates: 8 kHz, 16 kHz, and 32 kHz, referred to as narrowband, wideband, and ultra-wideband.
Frames Per Packet — Specifies the number of Speex frames per Ogg packet. 1 is the default.
Denoise — The denoiser can be used to reduce the amount of background noise present in the input signal. Speech codecs (Speex included) tend to perform poorly on noisy input, which tends to amplify the noise. The denoiser greatly reduces this effect.
Automatic Gain Control (AGC) — AGC is a feature that deals with the fact that recording volume may vary by a large amount between different setups. The AGC provides a way to adjust a signal to a reference volume. This is useful for VoIP because it removes the need for manual adjustment of the microphone gain. A secondary advantage is that by setting the microphone gain to a conservative (low) level, it is easier to avoid clipping.
Voice Activity Detection (VAD) — When enabled, VAD detects whether the audio being encoded is speech or silence/background noise. VAD is always implicitly activated when encoding in VBR, so the option is only useful in non-VBR operation. In that case, Speex detects non-speech periods and encodes them with just enough bits to reproduce the background noise — "comfort noise generation" (CNG).
Discontinuous Transmission (DTX) — DTX is an addition to VAD/VBR operation that allows the encoder to stop transmitting completely when the background noise is stationary. In file-based operation, since we cannot just stop writing to the file, only 5 bits are used for such frames (corresponding to 250 bps).