Mel scale

The mel scale (after the word melody)^[1] is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The reference point between this scale and normal frequency measurement is defined by assigning a perceptual pitch of 1000 mels to a 1000 Hz tone, 40 dB above the listener's threshold. Above about 500 Hz, increasingly large intervals are judged by listeners to produce equal pitch increments.

Formula

A formula (O'Shaughnessy 1987) to convert f hertz into m mels is^[2] $m=2595\log _{10}\left(1+{\frac {f}{700}}\right).$

Mel-scale from 200 to 1500, in intervals of 50

History and other formulas

The formula from O'Shaughnessy's book can be expressed with different logarithmic bases: $m=2595\log _{10}\left(1+{\frac {f}{700}}\right)=1127\ln \left(1+{\frac {f}{700}}\right).$

The corresponding inverse expressions are $f=700\left(10^{\frac {m}{2595}}-1\right)=700\left(e^{\frac {m}{1127}}-1\right).$

There were published curves and tables on psychophysical pitch scales since Steinberg's 1937^[3] curves based on just-noticeable differences of pitch. More curves soon followed in Fletcher and Munson's 1937^[4] and Fletcher's 1938^[5] and Stevens' 1937^[1] and Stevens and Volkmann's 1940^[6] papers using a variety of experimental methods and analysis approaches.

In 1949 Koenig published an approximation based on separate linear and logarithmic segments, with a break at 1000 Hz.^[7]

Gunnar Fant proposed the current popular linear/logarithmic formula in 1949, but with the 1000 Hz corner frequency.^[8]

An alternate expression of the formula, not depending on choice of logarithm base, is noted in Fant (1968):^[9]^[10] $m={\frac {1000}{\log 2}}\log \left(1+{\frac {f}{1000}}\right).$

In 1976, Makhoul and Cosell published the now-popular version with the 700 Hz corner frequency.^[11] As Ganchev et al. have observed, "The formulae [with 700], when compared to [Fant's with 1000], provide a closer approximation of the Mel scale for frequencies below 1000 Hz, at the price of higher inaccuracy for frequencies higher than 1000 Hz."^[12] Above 7 kHz, however, the situation is reversed, and the 700 Hz version again fits better.

Data by which some of these formulas are motivated are tabulated in Beranek (1949), as measured from the curves of Stevens and Volkmann:^[13]

Beranek 1949 mel scale data from Stevens and Volkmann 1940
Hz	20	160	394	670	1000	1420	1900	2450	3120	4000	5100	6600	9000	14000
mel	0	250	500	750	1000	1250	1500	1750	2000	2250	2500	2750	3000	3250

A formula with a break frequency of 625 Hz is given by Lindsay & Norman (1977);^[14] the formula does not appear in their 1972 first edition: $m=2410\log _{10}(0.0016f+1).$

For direct comparison with other formulae, this is equivalent to $m=2410\log _{10}\left(1+{\frac {f}{625}}\right).$

Most mel-scale formulas give exactly 1000 mels at 1000 Hz. The break frequency (e.g. 700 Hz, 1000 Hz, or 625 Hz) is the only free parameter in the usual form of the formula. Some non-mel auditory-frequency-scale formulas use the same form but with much lower break frequency, not necessarily mapping to 1000 at 1000 Hz; for example the ERB-rate scale of Glasberg and Moore (1990) uses a break point of 228.8 Hz,^[15] and the cochlear frequency–place map of Greenwood (1990) uses 165.3 Hz.^[16]

Other functional forms for the mel scale have been explored by Umesh et al.; they point out that the traditional formulas with a logarithmic region and a linear region do not fit the data from Stevens and Volkmann's curves as well as some other forms, based on the following data table of measurements that they made from those curves:^[17]

Umesh et al. 1999 mel scale data from Stevens and Volkmann 1940
Hz	40	161	200	404	693	867	1000	2022	3000	3393	4109	5526	6500	7743	12000
mel	43	257	300	514	771	928	1000	1542	2000	2142	2314	2600	2771	2914	3228

Slaney's MATLAB Auditory Toolbox agrees with Umesh et al. and uses the following two-piece fit, though notably not using the "1000 mels at 1000 Hz" convention:^[18] $m(f)={\begin{cases}{\dfrac {3f}{200}},&f<1000,\\15+27\log _{6.4}\left({\dfrac {f}{1000}}\right),&f\geq 1000.\end{cases}}$

Applications

The first version of Google's Lyra codec uses log mel spectrograms as the feature-extraction step. The transmitted data is a vector-quantized form of the spectrogram, which is then synthesized back to speech by a neural network. Use of the mel scale is believed to weigh the data in a way appropriate to human perception.^[19] MelGAN takes a similar approach.^[20]

Criticism

Stevens' student Donald D. Greenwood, who had worked on the mel scale experiments in 1956, considers the scale biased by experimental flaws. In 2009 he posted to a mailing list:^[21]

I would ask, why use the Mel scale now, since it appears to be biased? If anyone wants a Mel scale, they should do it over, controlling carefully for order bias and using plenty of subjects – more than in the past – and using both musicians and non-musicians to search for any differences in performance that may be governed by musician/non-musician differences or subject differences generally.

References

^ ^a ^b Stevens, Stanley Smith; Volkmann; John; Newman, Edwin B. (1937). "A scale for the measurement of the psychological magnitude pitch". Journal of the Acoustical Society of America. 8 (3): 185–190. Bibcode:1937ASAJ....8..185S. doi:10.1121/1.1915893. Archived from the original on 2013-04-14.
^ Douglas O'Shaughnessy (1987). Speech communication: human and machine. Addison-Wesley. p. 150. ISBN 978-0-201-16520-3.
^ John C. Steinberg (1937). "Positions of stimulation in the cochlea by pure tones". Journal of the Acoustical Society of America. 8 (3): 176–180. Bibcode:1937ASAJ....8..176S. doi:10.1121/1.1915891.
^ Harvey Fletcher; W. A. Munson (1937). "Relation Between Loudness and Masking". Journal of the Acoustical Society of America. 9 (1): 1–10. Bibcode:1937ASAJ....9....1F. doi:10.1121/1.1915904.
^ Harvey Fletcher (1938). "Loudness, Masking and Their Relation to the Hearing Process and the Problem of Noise Measurement". Journal of the Acoustical Society of America. 9 (4): 275–293. Bibcode:1938ASAJ....9..275F. doi:10.1121/1.1915935.
^ Stevens, S.; Volkmann, J. (1940). "The Relation of Pitch to Frequency: A Revised Scale". American Journal of Psychology. 53 (3): 329–353. doi:10.2307/1417526. JSTOR 1417526.
^ W. Koenig (1949). "A new frequency scale for acoustic measurements". Bell Telephone Laboratory Record. 27: 299–301.
^ Gunnar Fant (1949) "Analys av de svenska konsonantljuden : talets allmänna svängningsstruktur", LM Ericsson protokoll H/P 1064.
^ Fant, Gunnar. (1968). Analysis and synthesis of speech processes. In B. Malmberg (ed.), Manual of phonetics (pp. 173–177). Amsterdam: North-Holland.
^ Jonathan Harrington; Steve Cassidy (1999). Techniques in speech acoustics. Springer. p. 18. ISBN 978-0-7923-5731-5.
^ John Makhoul; Lynn Cosell (1976). "LPCW: An LPC vocoder with linear predictive spectral warping". ICASSP '76. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 1. IEEE. pp. 466–469. doi:10.1109/ICASSP.1976.1170013.
^ T. Ganchev; N. Fakotakis; G. Kokkinakis (2005), "Comparative evaluation of various MFCC implementations on the speaker verification task", Proceedings of the SPECOM-2005, pp. 191–194, CiteSeerX 10.1.1.75.8303
^ Beranek, Leo L. (1949). Acoustic measurements. New York: McGraw-Hill.
^ Lindsay, Peter H.; & Norman, Donald A. (1977). Human information processing: An introduction to psychology (2nd ed.). New York: Academic Press.
^ B. C. J. Moore and B. R. Glasberg, "Suggested formulae for calculating auditory-filter bandwidths and excitation patterns", Journal of the Acoustical Society of America 74: 750–753, 1983.
^ Greenwood, D. D. (1990). A cochlear frequency–position function for several species—29 years later. The Journal of the Acoustical Society of America, 87, 2592–2605.
^ Umesh, S.; Cohen, L.; Nelson, D. (1999). Fitting the mel scale. Proc. ICASSP 1999. pp. 217–220. doi:10.1109/ICASSP.1999.758101. ISBN 978-0-7803-5041-0.
^ Slaney, M. Auditory Toolbox: A MATLAB Toolbox for Auditory Modeling Work. Technical Report, version 2, Interval Research Corporation, 1998., translated to Python in librosa (librosa documentation).
^ "Lyra: A New Very Low-Bitrate Codec for Speech Compression". ai.googleblog.com. 25 February 2021. See also: arXiv:2102.11906, arXiv:2102.09660.
^ Kumar, Kundan; Kumar, Rithesh; de Boissiere, Thibault; Gestin, Lucas; Teoh, Wei Zhen; Sotelo, Jose; de Brebisson, Alexandre; Bengio, Yoshua; Courville, Aaron (8 December 2019). "MelGAN: generative adversarial networks for conditional waveform synthesis". Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc.: 14910–14921.
^ "Archived copy". Archived from the original on 2013-02-08. Retrieved 2012-12-12.{{cite web}}: CS1 maint: archived copy as title (link)

External links

Media related to Mel scale at Wikimedia Commons
Volkmann, J; Stevens, SS; Newman, EB (1937). "A scale for the measurement of the psychological magnitude pitch". The Journal of the Acoustical Society of America. 8 (3): 208. Bibcode:1937ASAJ....8..208V. doi:10.1121/1.1901999.
Handbook for Acoustic Ecology

[stevens1937-1] Stevens, Stanley Smith; Volkmann; John; Newman, Edwin B. (1937). "A scale for the measurement of the psychological magnitude pitch". Journal of the Acoustical Society of America. 8 (3): 185–190. Bibcode:1937ASAJ....8..185S. doi:10.1121/1.1915893. Archived from the original on 2013-04-14.

[2] Douglas O'Shaughnessy (1987). Speech communication: human and machine. Addison-Wesley. p. 150. ISBN 978-0-201-16520-3.

[3] John C. Steinberg (1937). "Positions of stimulation in the cochlea by pure tones". Journal of the Acoustical Society of America. 8 (3): 176–180. Bibcode:1937ASAJ....8..176S. doi:10.1121/1.1915891.

[4] Harvey Fletcher; W. A. Munson (1937). "Relation Between Loudness and Masking". Journal of the Acoustical Society of America. 9 (1): 1–10. Bibcode:1937ASAJ....9....1F. doi:10.1121/1.1915904.

[5] Harvey Fletcher (1938). "Loudness, Masking and Their Relation to the Hearing Process and the Problem of Noise Measurement". Journal of the Acoustical Society of America. 9 (4): 275–293. Bibcode:1938ASAJ....9..275F. doi:10.1121/1.1915935.

[6] Stevens, S.; Volkmann, J. (1940). "The Relation of Pitch to Frequency: A Revised Scale". American Journal of Psychology. 53 (3): 329–353. doi:10.2307/1417526. JSTOR 1417526.

[7] W. Koenig (1949). "A new frequency scale for acoustic measurements". Bell Telephone Laboratory Record. 27: 299–301.

[8] Gunnar Fant (1949) "Analys av de svenska konsonantljuden : talets allmänna svängningsstruktur", LM Ericsson protokoll H/P 1064.

[9] Fant, Gunnar. (1968). Analysis and synthesis of speech processes. In B. Malmberg (ed.), Manual of phonetics (pp. 173–177). Amsterdam: North-Holland.

[10] Jonathan Harrington; Steve Cassidy (1999). Techniques in speech acoustics. Springer. p. 18. ISBN 978-0-7923-5731-5.

[11] John Makhoul; Lynn Cosell (1976). "LPCW: An LPC vocoder with linear predictive spectral warping". ICASSP '76. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 1. IEEE. pp. 466–469. doi:10.1109/ICASSP.1976.1170013.

[12] T. Ganchev; N. Fakotakis; G. Kokkinakis (2005), "Comparative evaluation of various MFCC implementations on the speaker verification task", Proceedings of the SPECOM-2005, pp. 191–194, CiteSeerX 10.1.1.75.8303

[13] Beranek, Leo L. (1949). Acoustic measurements. New York: McGraw-Hill.

[14] Lindsay, Peter H.; & Norman, Donald A. (1977). Human information processing: An introduction to psychology (2nd ed.). New York: Academic Press.

[15] B. C. J. Moore and B. R. Glasberg, "Suggested formulae for calculating auditory-filter bandwidths and excitation patterns", Journal of the Acoustical Society of America 74: 750–753, 1983.

[16] Greenwood, D. D. (1990). A cochlear frequency–position function for several species—29 years later. The Journal of the Acoustical Society of America, 87, 2592–2605.

[17] Umesh, S.; Cohen, L.; Nelson, D. (1999). Fitting the mel scale. Proc. ICASSP 1999. pp. 217–220. doi:10.1109/ICASSP.1999.758101. ISBN 978-0-7803-5041-0.

[18] Slaney, M. Auditory Toolbox: A MATLAB Toolbox for Auditory Modeling Work. Technical Report, version 2, Interval Research Corporation, 1998., translated to Python in librosa (librosa documentation).

[19] "Lyra: A New Very Low-Bitrate Codec for Speech Compression". ai.googleblog.com. 25 February 2021. See also: arXiv:2102.11906, arXiv:2102.09660.

[20] Kumar, Kundan; Kumar, Rithesh; de Boissiere, Thibault; Gestin, Lucas; Teoh, Wei Zhen; Sotelo, Jose; de Brebisson, Alexandre; Bengio, Yoshua; Courville, Aaron (8 December 2019). "MelGAN: generative adversarial networks for conditional waveform synthesis". Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc.: 14910–14921.

[21] "Archived copy". Archived from the original on 2013-02-08. Retrieved 2012-12-12.{{cite web}}: CS1 maint: archived copy as title (link)

[1]