AES Headphone Technology Conference: Head Related Transfer Function

The head related transfer function (HRTF) is the measure of how sound is changed by the direction from which it arrives, and by the shape of your body and ears as it travels to the ear drum. It is traditionally measured by clamping the head of a subject in place who has microphones inserted at the entrance of the ear canal, and then moving a speaker playing test signals to dozens, if not hundreds, of positions around the head recording the response at each position. The HRTF data set produced is quite large, and this procedure is obviously time consuming and expensive.

Fast Continuous Acquisition of HRTF for Human Subjects with Unconstrained Random Head Movements in Azimuth and Elevation
AESHeadphoneConference_HRTF_Photo_RapidHRTF

Head-related transfer function (HRTF) is essential to realize an immersive listening experience over headphones, which is unique for every individual. Conventionally, HRTFs are measured using discrete stop-and-go method for multiple loudspeaker positions, which is a tedious and time consuming process, especially for human subjects. Recently, continuous HRTF acquisition methods have been proposed to improve the acquisition efficiency. However, these methods still require constrained or limited movements of subjects and can only be used in a controlled environment. In this paper, we present a novel fast and continuous HRTF acquisition system that incorporates head-tracker to allow unconstrained head movements in azimuth and elevation. An improved adaptive filtering approach that combines conventional progressive based normalized least mean square algorithm (NLMS) and previously proposed activated based NLMS is proposed to extract HRTFs on-the-fly from such binaural measurements with random head movements in both azimuth and elevation. Experimental results demonstrate that the proposed approach significantly enhances the performance of conventional progressive NLMS for short duration measurements and further validates the accuracy of proposed HRTF acquisition method.

In order to make augmented reality audio on headphones believable, individualized HRTF measurements must created and used in the signal processing of the headphones. Typically, HRTF measurements are taken in an anechoic chamber with dozens of speakers around the room, or speakers swept by mechanical arms, or with rotating stools for the subject to sit upon. Obviously, you're not going to install these systems in every Walmart across the country so millions of kids can enjoy an immersive game of Pokemon Go.

Something much simpler needs to be developed for HRTF acquisition for the masses in the future.

The present paper proposes a system where the subject has microphones inserted into their ears and a head tracker mounted atop their head. They stand in front of a large coordinate screen with a speaker mounted directly in the middle. (See image at top of page). A test signal is played, and the subject is asked to simply move their head around in a random or possibly spiraling series of circles so that their head has been pointed to all the various positions on the screen.

While this is going on, a computer is recording the head related impulse response (HRIR is a time based equivalent of an HRTF) and the simultaneous head position from the tracker. Over a relatively short period of time—about 40 seconds—a full HRTF for the angular area covered can be produced.

The bulk of the paper is a discussion of the statistical methods for deriving and efficient adaptive algorithm to produce an ever more accurate estimate of response for each angle measured as the subject spends more and more time sweeping their head over the intended range. The result of their research showed excellent performance, superior in some ways to other methods. The plots at the top of this section shows that scanning the head progressively over the screen (top plots) is actually inferior to scanning in a random rotational pattern.

The take-away point? Pretty much any audiologist, or even eye glass store, would have the skills and means to put a speaker on a wall, draw a big box around it, and have you stand on an "X" on the floor as you sweep your head around for a minute. Here's a pretty good model for something that could allow everyone to get their HRTF measured for $19.99 at Walmart. Pretty cool!

But not so fast...in a casual conversation mentioning this paper, a couple of engineers at the table suggested the way HRTFs for the masses are most likely to happen is by using a smartphone to create a 3D model of your head and ears, and then having a computer create an HRTF set buy personalized anthropomorphic models.

Wait what!? 3D models from your cellphone?

Yup, check this out, Microsoft has a tasty little chunk of software that can create fairly good 3D models from a smartphone. If this method works for creating personalized HRTFs, then you don't even have to get out of your robe and slippers to go to Glasses 'R Us. You just sit on your sofa and sweep your cellphone around your head. Here's the full research paper.

You might just shake your head at all this, but you'd only be taking the first steps along the way to getting your HRTF read. :)

The Effect of Elevation on ITD Symmetry
AESHeadphoneConference_HRTF_Photo_Asymmetry

In binaural simulations, Head-Related Impulse Responses are used to recreate a 3D auditory display through headphones. Public repositories of individually measured HRIRs are widely used in industry and research. However, head-related anthropometric asymmetries, among measured subjects, are a likely cause of measured asymmetries in Interaural Time Delay cues (ITDs), which may lead to imprecise sound localization. As part of a larger study on HRIR personalization, this paper expands, to the elevation dimension, the investigation of ITD asymmetry in public databases of measured HRIRs. In a previous exploratory study, concerning the horizontal plane only, a region of sensitivity, where the ITD asymmetry was observed to be significantly more prominent, was identified in datasets of individually measured HRIRs approximately between the azimuth range of θ = ±90° to ±130°. For this paper. two publicly available databases of individual HRIRs were selected and analyzed in search of an elevation effect on ITD symmetry. Results found that an increase or decrease in elevation angle φ , away from the horizontal plane, affects the asymmetry curve by reducing the gap between average and peak ITD asymmetry values within the mentioned region in a roughly linear trend. This finding points to the fact that, within the examined datasets, the statistical presence of ITD asymmetries is gradually less severe, although still present, as the elevation angle moves away from the horizontal plane.

Your ears are different than mine of course, but your left ear may also be different than your right ear. Engineers creating artificial HRTFs would like to think that 30 degrees off to the left is just a mirror image of 30 degrees off to the right. But previous research looking at HRTF repositories has shown that there do exist asymmetries, especially in the range of 90 to 135 degrees off axis. Unfortunately, that research was only done in the horizontal plane.

The present paper looks for asymmetries caused by anthropomorphic differences between the left and right ear in the 90-135 degrees azimuth area with changes to elevation. The paper finds that asymmetries decrease going both up and down in elevation away from the horizontal plane.

Okay, engineers in future may need to be particularly aware of HRTF asymmetries in the horizontal plane in the region somewhat behind each ear. Another way of looking at it is that if you want people playing first person shooter games to be able to accurately discriminate the direction of enemy soldier footfall sounds approaching from your 4 or 8 o'clock, you may have to have good personalized HRTF measurements in that region. Cool.

And then the question and answer period came up. One of the audience members offered an interesting remark. He said it's become understood that the ITD asymmetry errors in the 90-135 degree region may actually occur because from that angle, there are a number of different sound paths around the head to the far ear that become of relatively equal in distance. (Around the back of the head; around the front; and around under the chin.) He posited that the asymmetries observed in this paper may not be pointing at ear morphology being particularly different left to right at these angles, but rather that multi-path sound arrival times effects may be increasing measurement uncertainty. The fact that the asymmetry sensitivity reduces as elevation is raised and lowered, may mostly point to multi-path measurement errors being reduced as one path becomes shorter and more dominant and definitive in the measurement.

In other words: There is a region in the horizontal plane from roughly straight out from your ears and swept back 45 degrees that may be problematic for good localization due to the far ear being so completely shadowed by the head having sounds come from around all sides to reach it.

Your takeaway? Five years from now when all this stuff is available and you're playing games in virtual 3D, always approach your enemy from his 4 or 8 o'clock. Looks like the data is going to be shitty there.

Interaural Distances in Existing HRIR Repositories
AESHeadphoneConference_HRTF_Photo_ITD

With the recent development of low-cost and efficient methods for generating individualized Interaural Time Differences (ITDs), this paper investigates the distribution of interaural distances among certain populations in order provide a framework for improving the performance of individualized binaural audio systems across a wider range of head morphologies. Interaural distances are extracted from the publicly available LISTEN and CIPIC spatial audio databases in order to generate distributions across subjects, and from the MARL-NYU database in order to investigate measurement stability across testing sessions. The interaural difference is shown to be a means to measure the magnitude of an individual's set of ITDs. Furthermore, the constraints introduced on the precision of measured ITDs by limited sampling rates across all three datasets are explored, and the authors motivate the use of higher sampling rates in the development of spatial audio databases.

One of the really nice things about this paper is I could actually understand most of it. It's written in English, not equations.

Individually measured HRTFs is best for virtual audio synthesis, but using generic HRTFs are better than using none at all. The effort of this paper was pointed at a particular opportunity for making a generic HRTF one step better using a very simple method. Special facial recognition software can be put in a smartphone app that allows you to take a selfie and derive the measured distance between your ears. Armed with this information, you can make a first pass at improving generic HRTFs with more accurate interaural time differences—the most important cue for externalization.

Well...you have to figure out if the two are comparable reliably and accurately, and that's what this research is about.

The present paper used LISTEN and CIPIC spatial audio HRTF databases, which include physically measured interaural distances for each subject along with the HRTF data, and compares the distribution of acoustically measured interaural time delays and physically measured interaural distances between the ears. They also compare actual ITD measurements with those derived using a spherical head model for interaural distances.

The result is satisfyingly sensible. The two databases did show a different distribution of ITDs, but when they looked at the actual interaural distances of subjects in the test, they found a similar skew in the average measured size. They also found that the spherical head model does a good job of tracking ITD difference with differences in IAD the databases, but it overestimated by a reliably stable 3cm.

The upshot of all this is, even though you might not be able to get personalized HRTFs in the near future, there does exist a fairly simple way of tweaking the most important variable: interaural time difference.

You open up the sound settings on your smart phone; take a selfie; and it makes a tweak on your HRTF in the immersive audio chip to improve your immersion.

Can One “Hear” The Shape Of A Person: Anthropometry Estimation Via Head-Related Transfer Functions
AESHeadphoneConference_HRTF_Photo_Estimation

Individualized head-related transfer functions (HRTFs) are closely related to anthropometry (measurements of torso, head, and pinna) of listeners. This relation not only derives the individualized HRTFs from anthropometric measurements, but can also be viewed as a means to derive the anthropometry of the listener from his/her measured HRTFs (bypass direct anthropometric measurements). In this study, we propose to estimate a person’s anthropometry information using the linear representation obtained from the individualized HRTF features of the person and a HRTF feature database with a number of subjects. Five different HRTF features as well as their best combination are considered in the training stage. Although our experiments showed that the performance of these methods varies in general, the best combination method yields considerable accuracy for the estimation of most anthropometric features. The proposed idea also provides further insights on the complex relation between anthropometry and HRTFs. Our experiment revealed that the anthropometric features that are not well estimated could be removed from HRTF individualization process without causing significant performance degradation.

A moment ago, we talked about how an anthropomorphic measure—the distance between your ears—was directly related to and could be used to predict the HRTF. Well, the converse is also true: The measured HRTF is directly related to and can be used to predict the distance between your ears...among other things.

The present paper sets out to use HRTFs to predict various body measurements. It searched for a total of 37 anthropomorphic features: 17 head and torso measures, and 10 pinna measurements for each ear. To make a long story filled with a lot of statistical detail very short, it worked pretty darn well.

A number of physical body measurements were derived with fairly good accuracy. Other measures were somewhat indistinct in the results. When researchers threw out the poor performing measurements, and reconstructed HRTFs from the anthropomorphic data to see if it still matched well, they found their accuracy increased by ignoring the seemingly unreliable data.

Alrightythen, these guys can look at your HRTF and tell you whether your a fat guy with a big head and small ears, or whether you're skinny pin-head with dumbo ears. Um...wouldn't it be easier to just get a tape measure and a scale? Is this an answer waiting for a question? Just because you can, doesn't mean you should, right? After this presentation I was left with the feeling, "So what?"

After re-reading the paper, I find myself very grateful for this novel effort. I'll point out this sentence:

Moreover, this method could provide us with further insights into the complex relations between anthropometry and HRTFs (e.g., which anthropometric features are more important), which in turn facilitates HRTF individualization.

In other words, by looking at the problem from this angle, it may be possibly to detect physical measurements that don't have to be made in order to synthesize and HRTF. You may have noticed that all the papers today have something to do with data, computation, and model complexity reduction. Everyone knows HRTFs have to work well in the future, but they also know that the problem is so complex that we may not be able to portably carry around enough computing power to do the job. There seems to be as much or more research around simplifying the problem as there is about figuring out the problem itself.

A Fairytale Scenario Using Technology Pointed to by Papers on this Page
So, you just bought your new iPhone13. Somewhere in the start-up sequence, the headphone set-up begins. It first asks you to put the stock earpods into your ears. The phone then asks you to take a selfie straight on. It calculates your inter-aural distance from the picture—made more accurate by some markings visible on the earpods. It then instructs you to make a selfie movie by moving the phone around in front of you in random large circles; first with your left hand and then with your right to cover both sides in front of you while always looking straight forward. While you're doing this, facial recognition software is calculating the angles to your head aided by the marks on the earpods. The phone is also putting out MLSS chirps when it sees an angle of interest; mics on the outside of the earpod record the chirps and deconvolve an HRTF estimate for the angles covered. It might also ask you to hold the smartphone directly to the side and slightly back to try to take a few extra measurements at this pesky angle. It may then ask you to bring the phone about a foot from your ear and wave it around while always pointing at the ear. It will then create a 3D model of each ear. Then it tells you it's done.

In seconds, all the data is sent to some computer on the cloud that knows how to take all these estimates and produce a very good estimate of your personal HRTFs, which it then pokes back into your smartphone. From then on, you'll be able to clearly hear where the Pokemons are giggling as they hide in the hedgerow.

COMMENTS
jgazal's picture

Thank you for shedding some light on current research.

castleofargh's picture

it's a nice confidence boost for what's to come. now we know that someone does care about getting realistic audio in headphones at some point. and for the consumer lambda apparently, not just for S&M people wearing VR helmets.

having fooled around with 3D scanning some years ago with professional tools, it needed a human slave to make sense of some mesh artifacts most of the time. did the software evolve so much that we can remove human post processing and do it from a cellphone? if it did, I'm all for it.

I always thought we would end up in little sound proof boxes at the end of a mall (between the bathroom for Apache helicopters and the garbage area), with mics in our ears and speakers at different points on the walls. and then to finish the job we would get a scan of the ear canal with the Lantos in ear scanner(the stuff with the inflating balloon). and only then we would all use our cellphones to get the right profile into our "better than BT", wireless headphone that would have a gyroscope or a little dash cam that could be used to calculate the movements of the head and get us 3D audio like a boss.

Jim Tavegia's picture

When we listen live to anything, concerts conversations, we don't hold our heads still, at least I don't. With headphones the position of the sound source IS in the same place all the time.

I do think that the Harman work on Frequency Response is most important and agree with Bob Katz that it is in the EQ mostly. And after reading Art Dudley's writing about the poor sound quality coming of out expensive phones for music listening these days from the new AQ Dragonflys It appears that those who use their phones for a music source are missing much. Or maybe like one writer almost said "could be used to calculate the movements of the head and get us 3D like a" Bose 901.

Is this science really headed in the right direction? I'm not so sure. I do think that the folks at Audeze are.

LytleSound's picture

Years ago I listened to a sound source that was set up so that while listening through earphones, when the listener turned his or her head, the location of the sound source stayed put. So, a complete system would allow the listener to move his or her head about and shift the imaging of the sound field would move as well instead of staying fix in front. Ideally, the listener could turn around so that the sound source will appear to come from behind.

The newer 64-bit, octacore processors used some new Android smart phones could support virtual sound fields built upon HRTFs obtained for the listener. All that would be needed is source material that could be presented through a set of headphones or IEM (so we'll wait a few more years).

X