AES Headphone Technology Conference Highlight Paper

Oh my, this paper sure throws a wrench into my mental monkey-works.

In it, Gunther Thiele opines about how to develop the standardized EQ and signal processing needed to deliver a tonally neutral headphone listening experience for audio professionals. There are some very interesting tid-bits of information here...but the conclusion will be somewhat troubling for headphone audio purists.

Equalization of Studio Monitor Headphones
AESHeadphoneConference_Highlight_Photo_Imaging

The frequency response of high-quality studio monitor headphones should provide the same sound colour neutrality as demanded for loudspeaker monitoring in listening rooms according to ITU-Rec. BS 1116. This is obtained by a probe measured frequency-independent diffuse-field transfer function in accordance with ITU-Rec. BS 708. Spectrum level based calibration requires a reference sound field that provides sufficient diffusity as well as a flat frequency response in order to avoid coloration. Headphone manufacturers are interested in an attractive sound designed in accordance with actual preferences of consumers. Alternative target responses have been designed to simulate what a listener hears from a high-quality multichannel loudspeaker system in a reference listening room (in-room equalization). It is shown that this intention can only be realized with binaural room synthesis implementation that ensures accurate binaural rendering of the spatial cues, ideally including head tracking and personalization methods. A corresponding suitable standard based on a neutral listening room is desirable, not least in view of multichannel sound headphone reproduction. The virtual 3D listening room would avoid inadequate in-head perception of suboptimal two-channel stereo downmix material. Instead, it would ensure the intended perception of the recording in terms of space and colour. However, alternative in-room based equalization target curves should be documented with measures according to ITU-Rec. BS 708 Annex 2 that offer clear information for the assessment of tone colour, as well as comparability of headphone frequency responses.

As many readers will know, I've recently spent some time measuring my head in Harman's killer listening room in an effort to come up with a target response curve specific to my dummy head. Well...sadly...this paper points out that such an effort might be in vain. Let's work our way through some of the interesting points of Theile's paper.

Imaging - Here's one I didn't know: Turns out that while headphone imaging is inside the head, it has been found to be more precise than speakers. In the plot above, speakers at 3 meters (normal room listening), speakers at 1 meter (near field listening), and headphones were evaluated for the ability to precisely locate a sound within the normal stereo image. Nothing more to say on this...just thought it was cool.

In-Head Localization with Headphones - This seems to me one of those things that's so obvious we never bother to think about it, but the answer is quite interesting. Theile asserts that each ear is hearing the sound properly: If you listen to only one ear-piece, you correctly hear the sound as at the ear, zero distance away. Then, when you engage both ears, you essentially create a phantom image between the two source just like with stereo speakers, except in this case with the sources at both ears, the phantom image between ends up in your head.

Tone Color - This is where things get rough. For a real sound source outside your head you get two types of information: spatial (what angle and distance to the sound) and tonal (the sound of the sound).

AESHeadphoneConference_Highlight_Photo_Gestalt

The diagram above shows a model of what happens to the original sound as it becomes perceived by the listener. First, the sound impinges on the outer ear and becomes "spatially encoded"—it gains some reflections and colorations that make up the psychoacoustic cues you need to determine the location of the source. Then the signals from both ears go into the brain for spatial decoding to determine the direction and distance to the source.

Then a weird thing happens, in the "Gestalt determining stage" the brain determines the location of the source, then removes all the tonal, level, and timing information. It then passes the now cleaned up signal on to become the perceived auditory event. Said another way, the brain knows how to remove all effects of the pinna and body reflections so that you perceive the sound as it actually is, rather than hearing what you're actually hearing. Amazing!

Sound-pressure Level Divergence (SLD) - Okay, if that was weird, this is weirder. It turns out that the mind perceives loudness differently depending on the nature of the sound field. Basically, for the same perceived loudness in free-field, diffuse field, and headphone listening, different levels will be measured in the ear canal.

AESHeadphoneConference_Highlight_Photo_SLD

I'm sure we can all relate to the idea that we tend to play headphones louder than speakers for the same perceived listening level. The plot above shows that difference in level is about 4dB, and also shows that the SLD varies with frequency.

What that means for me and my dummy head trying to make a target response curve from speaker measurements, is that what ever curve I get, I now need to adjust it by the SLD plot above (maybe?) to perceptually correct it for headphone listening.

(Next time I'm in L.A. I'm going to buy Sean Olive a really nice dinner and pick his brain like an opal miner.)

Theile's Conclusions - Because of all these perceptual problems with headphones, it is exceedingly difficult to make headphones sound like speakers. Just making sure the signal in the ear canal is the same in both speaker and headphone listening will not ensure that the listener has the same experience in both cases. Theile sums it up this way:

Simple loudness based calibration does not replicate the original complex outer ear transfer functions given in the reference configuration.

As shown in the previous sections, this is evident also for spectrum level based calibration targeting to simulate what a listener hears from a high-quality multichannel loudspeaker system in a reference listening room (so-called in-room equalization [19]). The result is in-head locatedness, which means that certain ear signal spectrum characteristics generated from individual loudspeakers in the room do not contribute to the spatial perception process but instead degrade colour neutrality.

His conclusion is that the only way to make very accurate studio monitor headphones is to first tune the headphone to the diffuse field response—as it delivers the least linear distortion in the transducer/ear interface and will be able to most accurately play an incoming signal for the ear. Then, using digital signal processing (DSP), create a fake room using binaural room impulse response information for a high acoustic quality listening room. Then create virtual speakers in that room to play the sound. Then add a head tracker and a bunch of HRTF data so that you can move your head normally and hear the cues change—because your brain won't be reliably fooled if you don't. Research shows that if you do all these things, only then can you properly perceive tonal neutrality on headphones.

My Conclusions - What this means to me is that manufacturers will be developing high-end headphones for audio pros that have diffuse field equalization so that they can be hooked up to electronics that will do all the DSP for virtualization of the audio. And, sadly, diffuse field equalization sounds pretty bad without all the DSP.

There will be a push from manufacturers of high-end headphone gear to make headphone that sound tonally neutral by adding DSP because it's almost impossible to do it with a passive headphone due to perceptual problems.

Oy vey. I have a fair bit of skepticism that that much signal processing will ever deliver the type of resolution and nuance an audiophile desires...tonally neutral or not.

COMMENTS
Synnove's picture

Isn't this basically what the Smyth Realiser does already? EQs headphones, simulates room/speakers via custom HRTFs, has headtracking...

If the legacy of their A8, and the impressions and improvements of the A16* are any indication, you need not have such skepticism as expressed in your last paragraph ; ).

*I find it very odd that the A16 hasn't had any coverage on this website for some reason, especially given the rather amazing developments it includes.

ADU's picture

I think we briefly touched on some of these difficulties in the other recent discussions on the Harman curve and new target curves. We've pretty much always known that a simple room or compensation curve is not going to do the whole job. But it's still a step in the right direction imo.

Creating a DSP system that can take into account most of the sound characteristics of speakers in a room will be a difficult and complex (but not necessarily impossible) task. How, for example, could you take into account the differences in individual ear physiology and how that might effect the "encoding" and subsequent processing of the sound without some kind of system for modeling each individual user's ears in the DSP? (Yikes!)

Also, a virtualized headphone system like the one described in the article above would only makes sense for most users if it was more convenient and cost-effective than using actual speakers in a room. If you shop wisely, you can buy a reasonably flat set of speakers and a sub-woofer for less than $500.

I'll leave it there for now. :) But I'm sure I'll have more thoughts on this subject, and will be interested to hear WarrenT and others' takes on this.

wktenbrook's picture

I think it's still very useful to compare dummy head/torso eardrum reference point response from a reference loudspeaker/room versus headphones, but there are many obvious caveats. Lack of whole body response to deep bass, inside the head image, and inability to use natural small changes in head position to both localize and characterize sound all contribute to the 'artificiality,' or at least the unique character, of headphone listening. But despite these differences, we still love headphone listening (and speakers too).

Whatever we can practically do at reasonable expense to match the characteristics of reference speakers and headphones is a step in the right direction. The dummy head measurements are a milestone on the path, but this paper warns us not to get overconfident the problem is solved by frequency response measurements - but understanding the FR measurements are a necessary beginning.

jgazal's picture

Dr. Choueiri says:
1.
"There are a number of methods for generating 3D soundfields from loudspeakers. The three most promising are 1) Ambisonics, 2) Wave Field Synthesis and 3) Binaural Audio through Two Loudspeakers (BA2L)."
https://www.princeton.edu/3D3A/BACCH_intro.html
2.
"Pure Stereo shines at reproducing binaural recordings through two loudspeakers and gives an uncannily accurate 3D reproduction that is far more stable and realistic than that obtained by playing binaural recordings through headphones 17.
17 This is because binaural playback through headphones or earphones is very prone to head internalization of sound (which means that the sound is perceived to be inside the head) and requires, in order to avoid this problem, an excellent match between the geometric features of the head of the listener and those of the dummy head with which the recording was made (this problem has been recently surmounted by the Smyth headphones technology http://www.smyth-research.com/). Pure Stereo does not suffer from this problem as the sound is played back though loudspeakers far from the listener’s ears."
https://www.princeton.edu/3D3A/Publications/Pure_Stereo.pdf
3.
"Not only does Pure Stereo provide a shocking improvement to the spatial realism of sound reproduction, but the same digital filter used in Pure Stereo also corrects, in both the frequency and time domains, most non-idealities in the playback chain (including loudspeaker coloration and resonances, listening room modes, spatial comb filtering, balance differences between channels, etc...) so that the frequency and impulse responses at the listener’s ears are as close to ideal as possible for a given listening room and hi-fi system."
Tyll, have you compared binaural recordings played back with Dr. Choueiri filters and with a Smyth Realiser using a PRIR set for crosstalk cancellation? I would like to hear your opinion.

kais's picture

It seems every approach I've ever heard of was to simulate loudspeaker experience on headphones.
Loudspeakers of the better kind try to present the "natural" sound of recorded instruments, but usually recordings, by themselves, do not try to be the real thing, but something like nice "photographs".
Usual microphone setups for recording simply do not catch spatial information the way our ears do.
So loudspeakers will never sound like the real thing.
So we get:
Recording (unnatural) -> loudspeaker reproduction (unnatural) -> loudspeaker emulation (unperfect) -> headphone reproduction.

Headphones, on the other hand, combined with proper technologies like dummy head recording, could have the capability to approach the real thing much closer then any loudspeaker arrangement ever could.
The question would be: how can the stationary Dummy Head's HRTF be transformed into the individual listener's head tracked HRTF.

Serious's picture

"It turns out that the mind perceives loudness differently depending on the nature of the sound field. Basically, for the same perceived loudness in free-field, diffuse field, and headphone listening, different levels will be measured in the ear canal."
Yup. The FR differences seem about right to me but I'm not too sure about the ~2db hump between 1-2kHz. I wouldn't worry too much about the smaller differences below 1kHz as this depends a lot on the specific speaker setup.
The main takeaway to me here seems to be that headphones need a different measured FR at the ear drum than speakers to subjectively sound the same, something that I also saw in my own ear canal opening measurements.
I wrote about it on SBAF, I think this was one of the first posts:
http://www.superbestaudiofriends.org/index.php?threads/what-is-neutral-f...

With the ear canal entrance measurements, the difference will most likely not look the same. I find that flat works well as a target for headphones while speakers will have the typpical 3-4kHz centered ear gain (about 10db). A small dip in the speaker response may be a good idea here.

Overall I personally wouldn't worry too much about measuring speakers and headphones and instead would try to tweak or EQ headphones to sound similar to a good speaker setup and then measure the response. There will never be perfect accuracy and I think all the methods here are flawed. I feel the dummy head is a powerful tool, it just needs a better compensation curve than ID.
These plots (http://www.innerfidelity.com/content/first-test-estimated-harman-target-...) already looked much better to me than the ID compensation from 500Hz on, but IMO the 1kHz region should be a little more forward (1-2db) and the 3kHz region a little more laid-back here (also about 2db I guess). I hear phones such as the SR009 and even the Elear as forward around 1-2kHz relative to 500Hz.

The IEM measurements probably also need a very different compensation. I hear TWFK equipped IEMs as generally relatively neutral, much more so than the compensated plots would suggest and I don't like the DF target. For the IEMs the uncompensated plots are actually much closer to how I hear things than compensated right now.

johnjen's picture

It seems to me that the cross correlation issue (more precise localization) is also influenced by the near field effect due to loss of signal resolution as the driver to ear distance increases.
Which implies that speakers simply can't deliver the same degree of tightly coupled acoustic energy that HP's can.

IOW besides the fact of the added distortion due to the increase in displacement of the driver(s) themselves in speakers, headphones are at a distinct advantage since they are 'right next to' our ears and 'over there' in the room.

These 2 interrelated aspects (distance from signal source and distortion byproducts due to greater diaphragm displacement) seem like they would explain most of why HP's are not just more resolving but can allow for increased localization of the perceived source of any 'Voice'.

Simply put the acoustic energy we hear is more 'accurate' WRT the original signal where these acoustical cues originate.

JJ

ultrabike's picture

Thanks for bringing this stuff to light Tyll!

JMB's picture

This is a very interesting article. First our senses adapt and normally we always hear with our outer ear (except in the rare case it has being lost) and that is the baseline for our auditory perception but different headphone types eliminate the function of the outer ear in different levels (IEM, on ear and circumaural). Should not the target curve vary according to the amount the outer ear is contributing. There a vey few headphones which are leaving an open space around the ear like AKG K1000 or Float QA. As a student I owned something similar from MB looking like a 1st generation float but with dynamic drivers. They did not have much of an in your head sound (but tended to fall off my head). Why are there not more "headphones" like these which are almost between standard headphones and nearfield monitors

As a measurement reference I think a single sound source (which can be a speaker to make it reproducible) in a room would be better than any setup of speakers which only themselves try to simulate sound(in 3d) in a room.

ADU's picture
Quote:

First our senses adapt and normally we always hear with our outer ear (except in the rare case it has being lost) and that is the baseline for our auditory perception but different headphone types eliminate the function of the outer ear in different levels (IEM, on ear and circumaural). Should not the target curve vary according to the amount the outer ear is contributing. There a vey few headphones which are leaving an open space around the ear like AKG K1000 or Float QA. As a student I owned something similar from MB looking like a 1st generation float but with dynamic drivers. They did not have much of an in your head sound (but tended to fall off my head). Why are there not more "headphones" like these which are almost between standard headphones and nearfield monitors

True circumaural headphones should have some contribution from the outer ear. Several of the Full-sized Open and Full-sized Closed headphones on the IF Wall-of-Fame would probably fall into that category. And the Senn HD 380 Pro would also qualify imo...

http://www.innerfidelity.com/content/innerfidelitys-wall-fame

Alot of so-called "over-the-ear" headphones will partially cover the helix and lobe though, which can get rather painful after awhile if they clamp too tightly. (The AT M50x is a good example imo.)

The mechanics of how sound reaches your ears is completely different in a pair of headphones than in a room though. So the role that the outer ear plays in each case is also different, and not necessarily analogous. That's one reason headphones are generally measured at the eardrum... to take any inner and outer-ear related effects into account.

Here are a couple of images Tyll often likes to use to illustrate the contributions from different parts of the ear btw...

http://cdn.innerfidelity.com/images/Headphone101_InterpretingFrequencyRe...

http://cdn.innerfidelity.com/images/Headphone101_InterpretingFrequencyRe...

Not sure if these are free or diffuse-field measurements, but my guess would be the latter.

Quote:

As a measurement reference I think a single sound source (which can be a speaker to make it reproducible) in a room would be better than any setup of speakers which only themselves try to simulate sound(in 3d) in a room.

I partially agree with this. The position, distance, and angle of the speakers all have an effect on the final sound that you hear though. So I think a standard two-speaker arrangement (with monophonic test signals?) might be the better way to go, since that's the way most music is mastered.

I'm less certain how to approach multi-channel sources.

X