AES Headphone Technology Conference: Spatial Audio
You're no doubt aware of the latest Pokemon Go craze. Kids walk around with smartphone in hand searching for these little cartoon characters that have been virtually placed in the real physical world. It's hot, hot, HOT!!!
Well, when the kids are crazy about something, the MBAs take notice. If you can enhance Pokemon Go by making it a more "real" experience, the kids will by your stuff. As a result, companies from Sennheiser to Microsoft to Nokia to Apple are all in a race to make headphones sound like real sound that come from outside your head.
Researchers are looking for the ability to make headphones that produce for the listener an augmented acoustic reality. There are two main aspects to accomplishing this:
- To develop the technology to make sound on headphones appear to come from specific directions and distances outside the head.
- To create headphones that when put on your head remain acoustically transparent to natural, real-world outside sounds. When you put on the headphones, you must be able to hear the outside world and continue to be able to localize those sounds normally.
Both requirements are extremely difficult! The following papers all are related in one way or another to this task. Remember: this is not an overview but rather a smattering of research in this field each focusing on one small aspect of achieving the task at hand.
I'll be highlighting three papers here; in each case the papers title will link you to the Audio Engineering Society's E-Library entry for that paper. For full access to the library you must be an AES member and have paid the additional fees to access the papers in the library. If you are a non-AES member, papers may be downloaded one at a time for a fee each time. It's a fairly expensive proposition but worth it if you have a strong interest in such things.
I'll be including the abstract for each paper, and then summarizing the gist here. For most folks downloading the original paper won't net a lot of additional information. In most cases these papers are dense with specialized technical information and complex equationsheck, I can't understand a good chunk of detail in most papers. But!!! The conclusions are often contain tantalizing tid-bits of information. Hope you enjoy the nuggets!
Binaural systems seek to convey a high-definition listening experience by re-creating the sound pressure at both of the listener's ears. The use of a rigid spherical microphone array (RSMA) allows the capture of sound pressure fields for binaural presentation to multiple listeners. The aim of this paper is to objectively address the question on the required resolution for capturing an individual space. We numerically evaluated how binaural synthesis from RSMA recordings is affected when using different numbers of microphones. Evaluations were based on a human head model. Accurate synthesis of spectral information was possible up to a maximum frequency determined by the number of microphones. Nevertheless, we found that the overall synthesis accuracy could not be indefinitely improved by simply adding more microphones. The limit to the number of microphones beyond which the overall synthesis accuracy did not increase was higher for the interaural spectral information than for the monaural one.
In order to record the sound of an acoustic spacelike a dance club or concert halland retain all the directional information so it can be played back in augmented reality headphones, a rigid spherical microphone array (RSMA) can be used. Many hundreds of microphones are distributed evenly over the surface of a sphere. As sound passes over the sphere, each microphone hears the sound at slightly different times depending on the direction from which it comes. Using very complex digital methods sounds can not only be recorded, but can be encoded with directivity information.
In this paper, researchers modeled the sound field from RSMAs having various numbers of microphones onto a head model, essentially allowing them to turn the data from the RSMA into an HRTF set for the modeled head. They were then able to compare the modeled results with HRTF data sets from existing libraries to compare the resolution and accuracy of the RSMA modeled head with varying numbers of microphones to that of typical HRTF data.
There goal was to establish the number of microphones needed in order to have enough resolution and accuracy to produce recordings that will permit out-of-head localization. Similarly, but conversely, they also looked at the number of virtual loudspeakers (discrete sources of sound around the listener) that it would take to synthesize a believable experience. In other words, when sound is played back on the headphones, how many discrete virtual loudspeakers around the head does it take to fool you.
The point of this entire exercise was to find the useful limits of these numbers. Too few mics or virtual speakers and the listener begins to see errors in localization and the presentation becomes unbelievable. Too many mics or virtual speakers and the computational load to virtualize the audio becomes untenable.
In the plots above, the amount of error is on the vertical axis, and the audio frequency of the sound is on the horizontal axis. Each line plotted is for a different number microphones (fig. 3) or virtual loudspeakers (fig. 4). The left most plot shows errors for monaural information (sound heard at one ear only; the center plots shows errors for interaural level differences between the two ears; and the right most plot shows errors for interaural group delay (a type of interaural time difference).
Figure 3 shows that errors are reduced with increasing numbers of microphones until you reach 1002 mics, but increases beyond that net no improvements in accuracy. Figure 4 left plot shows that increasing the number of virtual speakers beyond 362 did not improve monaural errors. The center plot shows that you need 642 virtual speakers reach the point of diminishing returns for interaural level difference errors. And the right plot shows you need 1002 virtual loudspeakers to reach the needed error limit for interaural group delay.
Holy multi-channel audio nightmare, Batman! What is essentially brought to light here is that both audio recorded for virtualization on headphones and playback schemes needed to synthesize a believable immersive presentation require 1002 discrete microphones or virtual loudspeakers. And here I thought 11.2 surround had a lot of channels!
A natural integration of virtual sound sources with the real environment soundscape using a natural augmented reality (NAR) headset is discussed in this paper. These NAR headsets consist of dual sensing microphones at each earcup and employ adaptive filtering technique to achieve natural listening in augmented reality applications. We propose an adaptive equalization of the open-back NAR headsets using non-stationary virtual signals to compensate for individualized headphones transfer function (HPTF) and acoustic coupling to seamlessly mix virtual sound with the environmental sound. Training of the NAR headsets are carried out using fast-converging normalized filtered-x least mean square algorithms to respond to changing sound variation. Significant changes in HPTF can be detected online and fast HPTF estimation using normalized least mean square algorithm is employed to update the secondary path estimates.
Let's talk about the elephant in the room for a moment. The ability to properly localize sounds come from cues generated by the reflections off your ears and other anatomy. Your ears are different than mine...and different enough that if you wore my ears you'd have to retrain your brain to get your aural localization back. And so the elephant: Some how, somewhere along the line, something has got to measure your ears...or at least measure how your ears are different than a standardized norm.
To make matters worse, every time you put on your headphone or shake your head or put on your glasses, the acoustics of the headphones are going to change...which, of course, changes the response and screws up its ability to fool you.
This paper is impressive in its detail and methods, and I couldn't possibly go into it here in any detail (not that I could), but the gist of it is interesting in the context of how much complexity engineers are being asked to deal with. Companies know there's big money involved in figuring this stuff out, and they're apparently willing to finance the solution to extremely difficult problems to make it work.
This paper talks about the electro-acoustic systems and adaptive equalization signal processing architecture needed to make a headphone that: is transparent to outside sound even as seal and fit changes; is able to detect characteristics of your personal HRTF to make sound more believable; and is able to continuously monitor changes in headphone transfer function (HPTF) due to fit and modify EQ to adapt. Basically, these are headphones that have a an outside microphone to monitor the acoustic environment, an internal microphone that is positioned very near the entrance of your ear canal that mimics what your and your particular ears hear, and a whole bunch of signal processing.
It accomplishes its goals through a two-step calibration process. The first step is to listen to white noise from external speakers while the headphones go through a basic calibration. The second step is an on-going adaptive EQ that is affected by comparing virtual sounds with the ear canal response mic. The diagram above is basically a math map for the signal processing needed to calibrate, and then constantly update, the headphone transfer function for a particular wearer under varying fit conditions. Each block represents some sort of complex computation...that's a lot of math!
I had a conversation with another attendee from a major chip making firm. I remarked at the stunning amount of computation needed to pull off some of these augmented reality schemes. I mean, where's all that computing going to be done? In the smartphone? In the headset? A bit of both? He said that's all he thinks about all day long, every day.
It's worth noting that this was not the only paper on this subject, a similar one was presented with experimental results of a similarly designed headphone prototype. It's titled:
Auralizing rooms with data-based dynamic binaural synthesis is an established approach in virtual acoustics. Generally measured binaural room impulse responses (BRIRs) are used to create a virtual acoustic environment (VAE) over headphones. Depending on the application, it is desirable to reduce the amount of data by decreasing the resolution of the BRIRs. For this reason a scalable parametric model for the synthesis of the binaural late reverberation part was developed and is presented. The model reduces the reverberation tail to three features only. Based on these features, BRIRs with synthetic reverberation are generated and compared to the corresponding measured impulse responses. The synthesis is evaluated perceptually in two listening experiments and differences between several settings of the algorithm as well as the performance for various rooms are examined. The results show only small perceptual differences between original and synthesis even with datasets heavily decreased in size.
We all know you need this personalized HRTF processing so that we believe a sound is coming from outside our head, but listening to sounds in an anechoic chamber sucks. So engineers need to not only model your HRTF, they need to add a listening environment for it to sound natural.
A listening environmentlike a living room or concert hallis measured, usually with a dummy head, to derive the binaural room impulse response (BRIR). This impulse response can then be used to synthesize that listening room with digital signal processing to add that environment to the virtual sounds created. The problem with these BRIRs is that they may need to be quite long in order to capture the full room reverberance and, of course, the longer the BRIR, the more computational power needed to synthesize the artificial room with it.
In this paper, researchers propose methods to analyze the initial BRIR, derive an artificial reverberant tail that can be synthesized with white noise, EQ, and a modulation envelope, and then replace the initial BRIR with a simplified short one containing only a few variables for the long reverberant decay synthesis. They also test the quality of a few of these simplified synthetic BRIRs vs. the initial full-resolution BRIR and come to the conclusion that one particular configuration was best.
The take-away point for readers here is that creating augmented reality systems is so complex that a significant amount of time and energy in research is spent studying ways to simply computations. The other take-away is that if you thought MP3 compression was an abomination, you're not going to want to open the hood and take a good hard look at how augmented reality audio is going to work. Frankly, I think a good 320kbs MP3 sounds pretty darn good given how much data is being thrown out.
Next up: Head Related Transfer Function Papers.