The likelihood that the plain old telephone system will not endure unchanged over the next decade seems pretty well accepted within the telecommunications industry. All of the major communications equipment manufacturers, including those whose primary business has been traditional telephony, have committed substantial resources to developing equipment for networks in which voice is carried as digital data, often compressed, along with nonvoice data over a common packet-switched infrastructure.
These networks are of various kinds, both wireline and wireless. Central to the thinking behind them is the assumption that, in the future, voice will constitute only a minor fraction of the total traffic to be carried. It will therefore be wisest, the reasoning goes, to optimize the networks for data communication and to fit voice in as well as possible.
That is easier said than done. While data networks are designed to provide large amounts of capacity whenever they can, they may delay data packets until such capacity is available. Such behavior is a poor match for voice traffic, which needs only modest capacity but can tolerate only short delays.
Before long, therefore, most of these data-oriented networks will be fitted with mechanisms for dedicating or reserving capacity for time- and loss-sensitive voice data. But, at present, few have such capability. As a result, voice conversations can be plagued by delays, echoes, and dropped fragments. These effects are aggravated by speech compression, which is implemented to conserve transmission capacity.
Natural speech is compressible because it has a lot of redundancy, so dropping a few packets of uncompressed speech may not affect the perceived quality very much. Compression, though, removes most of the redundancy, so every lost packet hurts. In any event, sending compressed speech over a data-oriented network with no special provisions for handling it generally degrades voice quality.
That loss of quality is one of the main barriers to widespread acceptance of voice-over-packet networks. Consumers, long trained by the public switched telephone network (PSTN) to expect what is known as toll-quality voice, will inevitably compare the new voice networks with the old one for quality. Manufacturers and network operators must therefore be able to make such comparisons as well.
Subjective measures, like taking the mean of the opinions of a group of listeners, were an obvious starting point. Newer, more objective techniques, developed in Britain, the Netherlands and elsewhere, involve comparing the received sound with the transmitted sound and scoring the differences. But how? What exactly is voice quality, and how can it be quantified in an objective and reproducible manner? With some answers in hand, it is possible to pin down the factors that influence the perception of voice quality by the users of these networks, and the steps that can be taken to ensure that an acceptable level of quality is met.
Speech coding and compression
Both speech coding and compression have been used in the PSTN for over 20 years. With the exception of the local customer loop (normally analog, but increasingly digital), almost all voice is carried in digital format. In traditional phone networks, the standard method for converting analog voice signals to digital form is to sample them 8000 times per second and then encode each sample as an 8-bit binary word, as specified in detail in ITU-T standard G.711 (the ITU-T being the Telecommunication Standardization Sector of the Geneva-based International Telecommunication Union, a specialized agency of the United Nations). The result is the familiar 64-kb/s digital data stream known in telephony as a DS-0, the lowest rung in the digital signal hierarchy.
Speech compression is associated by most engineers with digital signal processing, but has actually been practiced in the telephone network since the all-analog era. In the early days it took the form of low-pass filtering of the analog signal. Although the human auditory range is generally regarded as extending up to about 20 kHz, telephone networks band-limit voice signals to approximately the bottom 4 kHz of the speech signal. Doing so greatly simplifies the design (and lowers the cost) of low-noise amplifiers and network equalization circuitry, but not without compromising speech quality. Although most of the energy in human speech falls below 4 kHz, the small fraction at higher frequencies does affect intelligibility: just try to distinguish between the words "fine" and "sign" over the telephone.