AES Headphone Technology Conference: Spatial Audio

You're no doubt aware of the latest Pokemon Go craze. Kids walk around with smartphone in hand searching for these little cartoon characters that have been virtually placed in the real physical world. It's hot, hot, HOT!!!

Well, when the kids are crazy about something, the MBAs take notice. If you can enhance Pokemon Go by making it a more "real" experience, the kids will by your stuff. As a result, companies from Sennheiser to Microsoft to Nokia to Apple are all in a race to make headphones sound like real sound that come from outside your head.

Researchers are looking for the ability to make headphones that produce for the listener an augmented acoustic reality. There are two main aspects to accomplishing this:

  1. To develop the technology to make sound on headphones appear to come from specific directions and distances outside the head.
  2. To create headphones that when put on your head remain acoustically transparent to natural, real-world outside sounds. When you put on the headphones, you must be able to hear the outside world and continue to be able to localize those sounds normally.

Both requirements are extremely difficult! The following papers all are related in one way or another to this task. Remember: this is not an overview but rather a smattering of research in this field each focusing on one small aspect of achieving the task at hand.

I'll be highlighting three papers here; in each case the papers title will link you to the Audio Engineering Society's E-Library entry for that paper. For full access to the library you must be an AES member and have paid the additional fees to access the papers in the library. If you are a non-AES member, papers may be downloaded one at a time for a fee each time. It's a fairly expensive proposition but worth it if you have a strong interest in such things.

I'll be including the abstract for each paper, and then summarizing the gist here. For most folks downloading the original paper won't net a lot of additional information. In most cases these papers are dense with specialized technical information and complex equations—heck, I can't understand a good chunk of detail in most papers. But!!! The conclusions are often contain tantalizing tid-bits of information. Hope you enjoy the nuggets!

Numerical evaluation of binaural synthesis from rigid spherical microphone array recordings
AESHeadphoneConference_SpatialAudio_Photo_RSMA

Binaural systems seek to convey a high-definition listening experience by re-creating the sound pressure at both of the listener's ears. The use of a rigid spherical microphone array (RSMA) allows the capture of sound pressure fields for binaural presentation to multiple listeners. The aim of this paper is to objectively address the question on the required resolution for capturing an individual space. We numerically evaluated how binaural synthesis from RSMA recordings is affected when using different numbers of microphones. Evaluations were based on a human head model. Accurate synthesis of spectral information was possible up to a maximum frequency determined by the number of microphones. Nevertheless, we found that the overall synthesis accuracy could not be indefinitely improved by simply adding more microphones. The limit to the number of microphones beyond which the overall synthesis accuracy did not increase was higher for the interaural spectral information than for the monaural one.

In order to record the sound of an acoustic space—like a dance club or concert hall—and retain all the directional information so it can be played back in augmented reality headphones, a rigid spherical microphone array (RSMA) can be used. Many hundreds of microphones are distributed evenly over the surface of a sphere. As sound passes over the sphere, each microphone hears the sound at slightly different times depending on the direction from which it comes. Using very complex digital methods sounds can not only be recorded, but can be encoded with directivity information.

In this paper, researchers modeled the sound field from RSMAs having various numbers of microphones onto a head model, essentially allowing them to turn the data from the RSMA into an HRTF set for the modeled head. They were then able to compare the modeled results with HRTF data sets from existing libraries to compare the resolution and accuracy of the RSMA modeled head with varying numbers of microphones to that of typical HRTF data.

There goal was to establish the number of microphones needed in order to have enough resolution and accuracy to produce recordings that will permit out-of-head localization. Similarly, but conversely, they also looked at the number of virtual loudspeakers (discrete sources of sound around the listener) that it would take to synthesize a believable experience. In other words, when sound is played back on the headphones, how many discrete virtual loudspeakers around the head does it take to fool you.

The point of this entire exercise was to find the useful limits of these numbers. Too few mics or virtual speakers and the listener begins to see errors in localization and the presentation becomes unbelievable. Too many mics or virtual speakers and the computational load to virtualize the audio becomes untenable.

AESHeadphoneConference_SpatialAudio_Photo_RSMAResults

In the plots above, the amount of error is on the vertical axis, and the audio frequency of the sound is on the horizontal axis. Each line plotted is for a different number microphones (fig. 3) or virtual loudspeakers (fig. 4). The left most plot shows errors for monaural information (sound heard at one ear only; the center plots shows errors for interaural level differences between the two ears; and the right most plot shows errors for interaural group delay (a type of interaural time difference).

Figure 3 shows that errors are reduced with increasing numbers of microphones until you reach 1002 mics, but increases beyond that net no improvements in accuracy. Figure 4 left plot shows that increasing the number of virtual speakers beyond 362 did not improve monaural errors. The center plot shows that you need 642 virtual speakers reach the point of diminishing returns for interaural level difference errors. And the right plot shows you need 1002 virtual loudspeakers to reach the needed error limit for interaural group delay.

Holy multi-channel audio nightmare, Batman! What is essentially brought to light here is that both audio recorded for virtualization on headphones and playback schemes needed to synthesize a believable immersive presentation require 1002 discrete microphones or virtual loudspeakers. And here I thought 11.2 surround had a lot of channels!

Adaptive Adaptive Equalization of Natural Augmented Reality Headset Using Non-Stationary Virtual Signals
AESHeadphoneConference_SpatialAudio_Photo_AdaptiveEQ

A natural integration of virtual sound sources with the real environment soundscape using a natural augmented reality (NAR) headset is discussed in this paper. These NAR headsets consist of dual sensing microphones at each earcup and employ adaptive filtering technique to achieve natural listening in augmented reality applications. We propose an adaptive equalization of the open-back NAR headsets using non-stationary virtual signals to compensate for individualized headphones transfer function (HPTF) and acoustic coupling to seamlessly mix virtual sound with the environmental sound. Training of the NAR headsets are carried out using fast-converging normalized filtered-x least mean square algorithms to respond to changing sound variation. Significant changes in HPTF can be detected online and fast HPTF estimation using normalized least mean square algorithm is employed to update the secondary path estimates.

Let's talk about the elephant in the room for a moment. The ability to properly localize sounds come from cues generated by the reflections off your ears and other anatomy. Your ears are different than mine...and different enough that if you wore my ears you'd have to retrain your brain to get your aural localization back. And so the elephant: Some how, somewhere along the line, something has got to measure your ears...or at least measure how your ears are different than a standardized norm.

To make matters worse, every time you put on your headphone or shake your head or put on your glasses, the acoustics of the headphones are going to change...which, of course, changes the response and screws up its ability to fool you.

This paper is impressive in its detail and methods, and I couldn't possibly go into it here in any detail (not that I could), but the gist of it is interesting in the context of how much complexity engineers are being asked to deal with. Companies know there's big money involved in figuring this stuff out, and they're apparently willing to finance the solution to extremely difficult problems to make it work.

This paper talks about the electro-acoustic systems and adaptive equalization signal processing architecture needed to make a headphone that: is transparent to outside sound even as seal and fit changes; is able to detect characteristics of your personal HRTF to make sound more believable; and is able to continuously monitor changes in headphone transfer function (HPTF) due to fit and modify EQ to adapt. Basically, these are headphones that have a an outside microphone to monitor the acoustic environment, an internal microphone that is positioned very near the entrance of your ear canal that mimics what your and your particular ears hear, and a whole bunch of signal processing.

It accomplishes its goals through a two-step calibration process. The first step is to listen to white noise from external speakers while the headphones go through a basic calibration. The second step is an on-going adaptive EQ that is affected by comparing virtual sounds with the ear canal response mic. The diagram above is basically a math map for the signal processing needed to calibrate, and then constantly update, the headphone transfer function for a particular wearer under varying fit conditions. Each block represents some sort of complex computation...that's a lot of math!

I had a conversation with another attendee from a major chip making firm. I remarked at the stunning amount of computation needed to pull off some of these augmented reality schemes. I mean, where's all that computing going to be done? In the smartphone? In the headset? A bit of both? He said that's all he thinks about all day long, every day.

It's worth noting that this was not the only paper on this subject, a similar one was presented with experimental results of a similarly designed headphone prototype. It's titled:

Adaptive Equalization of Acoustic Transparency in an Augmented-Reality Headset

Perceptual Evaluation of Synthetic Late Binaural Reverberation Based on a Parametric Model
AESHeadphoneConference_SpatialAudio_Photo_Reverberation

Auralizing rooms with data-based dynamic binaural synthesis is an established approach in virtual acoustics. Generally measured binaural room impulse responses (BRIRs) are used to create a virtual acoustic environment (VAE) over headphones. Depending on the application, it is desirable to reduce the amount of data by decreasing the resolution of the BRIRs. For this reason a scalable parametric model for the synthesis of the binaural late reverberation part was developed and is presented. The model reduces the reverberation tail to three features only. Based on these features, BRIRs with synthetic reverberation are generated and compared to the corresponding measured impulse responses. The synthesis is evaluated perceptually in two listening experiments and differences between several settings of the algorithm as well as the performance for various rooms are examined. The results show only small perceptual differences between original and synthesis even with datasets heavily decreased in size.

We all know you need this personalized HRTF processing so that we believe a sound is coming from outside our head, but listening to sounds in an anechoic chamber sucks. So engineers need to not only model your HRTF, they need to add a listening environment for it to sound natural.

A listening environment—like a living room or concert hall—is measured, usually with a dummy head, to derive the binaural room impulse response (BRIR). This impulse response can then be used to synthesize that listening room with digital signal processing to add that environment to the virtual sounds created. The problem with these BRIRs is that they may need to be quite long in order to capture the full room reverberance and, of course, the longer the BRIR, the more computational power needed to synthesize the artificial room with it.

In this paper, researchers propose methods to analyze the initial BRIR, derive an artificial reverberant tail that can be synthesized with white noise, EQ, and a modulation envelope, and then replace the initial BRIR with a simplified short one containing only a few variables for the long reverberant decay synthesis. They also test the quality of a few of these simplified synthetic BRIRs vs. the initial full-resolution BRIR and come to the conclusion that one particular configuration was best.

The take-away point for readers here is that creating augmented reality systems is so complex that a significant amount of time and energy in research is spent studying ways to simply computations. The other take-away is that if you thought MP3 compression was an abomination, you're not going to want to open the hood and take a good hard look at how augmented reality audio is going to work. Frankly, I think a good 320kbs MP3 sounds pretty darn good given how much data is being thrown out.

Next up: Head Related Transfer Function Papers.

COMMENTS
tony's picture

Sound recording started around 1880s, only a hundred years earlier lightning rods were the first consumer application for Electricity control. Geez, about a hundred years after Sound recording came Stereo for Consumers.

Just another few years and we have beautiful music reproduction for personal use and 5 Billion Cellphones in use.

Recently, the Focal engineers designed a transducer with greater resolution than our human ears.

From my 'vintage' perspective, I'd though we're done!, what's left to do? Plenty is the answer and the clever young engineers are discovering applications which may end up being indispensable to daily life: an appliance to let people hear like Bats!

I used to sell VPI turntables and Koetsu Phono cartridges, I never had the slightest clue the Audio Industry would've gotten this far, perhaps we're only at the beginning of an endless series of advancements.

Winter CES in Vegas may be someplace I'll be visiting, from now on!

Tony in Michigan

Tyll Hertsens's picture
I think the problem is especially difficult because making it work is a HUGE convergence problem. Even if/when they figure all the pieces out, they still have to have the entire industry settle into all the various standards that will be developed in order for the technology to work front to back. That's the big problem....and Apple doesn't have it. They already have a converged ecosystem of gear and software. Maybe think about the Apple conference next year. Though it may be to early there as well. More likely to see Beats headphones with Lightning cables on the next go 'round.
klausosk's picture

My bet is the applications will turn out a little different today's plot. Nevertheless, exciting times ahead!

I wonder how useful audio technology can be in a helmet for a fight jet pilot, tank driver or soldiers, to get positional information of the outside world via audio without cluttering the visual side.

I think for consumers, especially young ones, will just isolate people even more, locked into the immersive world of a screen and audio effects.
Any thoughts of bad impact?

I have an old question: where do you draw the line for how much to spend (time and money) for audio reproduction versus going to the actual concert???

tony's picture

I have Season tickets to our DSO, however, I hardly know anyone going to any Popular Events. ( other than Sporting Events, of course )

People ( including me ) already use IEMs regularly, I hate it when my wife makes me take them out to say something. Traveling, I see the "little white wires" everywhere, all age groups.

The LG Bluetooth necklace thing is Brilliant ( don't leave home without it).

We're now in for a wild ride, from here on out to infinity, exciting times indeed.

Money wise, a couple thousand $ US buys a Superb Dopamine Activation Music system, both portable and home based = "Eargasms" aplenty.

Live music is getting quite rare, outside of NY City.

Tony in Michigan

tony's picture

I can imagine folks like Phonak offering a hearing system for blind people, allowing them to have Bat like hearing.

The Entire Audio Industry is Anarchy now, I can't see any sort of Standards in force. Everyone needs a "Moated-Monopoly" to justify investment but as the dust settles we may have an FCC decision to standardize on something useful. I may even live to see it!

Govt. mandates required GM to but a Back-up Camera & Dash Screen in our $10,000 Chevy Spark ( our cheapest car ).

US Govt. research investments gave us the Internet and plenty of other 'everyday' technologies but the Tea Party, Libertarians and Isolationist Elephants want to cut these types of things. ( I'm working to keep 'progressive' as our middle name )

It's exciting (for me) to see Industry probing Audio's 'Outer-Space', if there's something out there, they'll discover it. ( maybe even Aliens and Flying Saucers :>)

Creatures of the Earth seem to have only two ears yet insects seem to have thousand-eye Eyes.

It's one hell-of-an exciting time to be a bright young engineer.

Tony in Michigan

ps. there are Trillions of Dollars out there looking for investment opportunities, everyone is sitting on a huge pile of cash, it's a serious problem, people are looking for places to invest.

jeffporter's picture

I think they give a great insight in to how design committees at GM must work. My guess is you were instrumental in the creation of the Pontiac Aztec. Thanks for always making me laugh with your self important, substanceless interjections. Cheers!

Johan B's picture

One could make this rather complicated. Here is me thinking that the hair in my ears have influence as well. Or for that matter the pressure behind my ear drum.

Tyll Hertsens's picture
We'll get to that...
Darin Fong Audio's picture

Not that we have all the answers, but we've been working on this stuff for a while now. Maybe what we have is getting closer and closer.
You can hear what we've done with our online demo: http://fongaudio.com/demo

-Darin