Expert Tests InnerFidelity's Headphone Measurement Repeatability and Reproducibility

Editor's Note: Head-Fi member Macedionian Hero may be a headphone geek by night, but he's the Director of Engineering at a MIL/AERO electronics manufacturing firm by day. He's also a "Lean Six Sigma Black Belt" (don't know what he had to do to get that, but I bet there's a lot of math), and one of his specialities is characterizing the the precision and accuracy of electronic systems. He volunteered to evaluate the precision of my headphone frequency response measurements.

Did some nail biting over this one, I tell ya.

When I joined Head-fi several years back, one of the first things that drew my attention were the wonderful headphone measurements offered by Headroom, and now most recently by (both thanks to Tyll). Being an engineer, I guess I'm kind of naturally drawn to this stuff like a bee to honey. But one question constantly in the back of my mind was how accurate/relevant were these measurements?

In my 15+ year career in electronics manufacturing, I've measured many things, but I've also learned that just because one measures something and applies a "number" to it, doesn't mean that the story ends there. Measurement systems have inaccuracies built into them and introduce variations as well. Please note that I am defining "measurement systems" to include the person(s) taking the measurements as well.

How Operators are a Source for Varience
Every September for Bring your Kids to Work Day I run a quick little experiment with the grade 9 students. I bring a 30cm ruler, a 1m stick and a 25 foot tape measure along with me. I then break the students into groups of 3. I then give each group a measuring device from my list above and ask them to measure a 7 foot table. The results are the same every year. Each group measures a different length of the exact same 7 foot long table. With the most accurate being the 25 foot measuring tape; then the 1m stick and the worst being the 30cm ruler.

The kids with the 30cm ruler argued that they had the most difficult job because they had to constantly move the ruler across the table and use their fingers/pencils/pens to mark each position as they moved through the table length. So I had each group change measuring devices. The results, even when the same device was used by 3 different groups, still ended up with the same variability in the lengths for the three measurement devices. The kids learned by this demonstration, that that the person(s) measuring also were a source of variability; not just the gage. In case you're interested, this is a very simplified gage R&R (Repeatability & Reproducibility).

We've seen from above that measurement systems rely on a few key factors:

  1. The suitability of the gage to perform the measurement.
  2. The suitability of the person(s)' training taking the measurement.
Both of the above can and do introduce variations within the measurements themselves.

Basic Statistics
So I've used the words "variation" and "variability" in my introduction. But what does this mean from a statistical point of view? Variability is everywhere; in manufacturing processes, in materials used, and even in subsequent measurements of these processes, materials and products. Variance represents the entire variability of a process or product. We can estimate variance through the calculation of a standard deviation. That is to stay that the standard deviation is an estimate of variance and as the population size increases, this estimate of variance becomes more and more accurate. The standard deviation is also referred to as a "sigma" or the following symbol: "σ".

The other terms that I'm going to use are "mean" and "average". Both are the same thing. So the average length of the table from the same gage that the grade 9 students measured was simply calculated by the formula:


Feature_MecedonianHero_headphonesigma_formula3 represents the average/mean
Xi represents the individual measurements
n represents the number of measurements taken

I won't throw anymore statistical parameters/equations at you than these two. These should be sufficient for the purposes of this article.

Now back to the standard deviation (remember this is an estimate of the variance). Most things in nature follow what's called a "Normal Distribution" (you might have heard the term "Gaussian Distribution" also used, both are equivalent). A quick example would be people's heights. If we were to plot Number of Persons on the Y-Axis and Height Ranges on the X-Axis, we would end up with the following type of curve:


The "0" point in this graph would represent the average height (say 5'10" for the average man). Then you'll notice a +/- 1 σ, this would mean +/- 1 standard deviation. The area under the above curve would represent the percentage of the population that one would find that have heights +/- 1 standard deviation from the average. In a normal Gaussian distribution, this percentage is roughly 68.3%. Two standard deviation represent roughly 95%. The term Six Sigma represents +/- 6 standard deviations and corresponds to roughly 3.4 defects out of a million.

For the purposes of this study, I will use +/- 2 "sigmas" or "standard deviations." That is to say that if we measured the same pair of headphones 100 times, 95 of the measurements would fall within this range.


MacedonianHero's picture

See Tyll, from your final "Editor's Note", I've yet learned more about the intricacies of how sound travels and resonates in the human ear. It was a fun endeavor.

Hopefully others will not just use measurements, but also begin to trust their ears too. I think both are needed to truly evaluate gear.


Baka1969's picture

Great job Peter. It was fun going through it with you.

bluemonkeyflyer's picture

Well done explanation of the vagaries of headphone measurements, Macedonian Hero. Many thanks!


ultrabike's picture

EDIT: removed all my (unnecessary) comments.

Cools stuff Tyll and Peter... I can say that I did not know how sensitive headphones were vs positioning prior to this.


Draygonn's picture

Well written, informative, and interesting. I finally found out what Six Sigma means!

MacedonianHero's picture

Thanks for the kind feedback.

The first page says it all about where 6 sigma came from. But there are many tools in the Six Sigma toolbox beyond that and Gage R&Rs.

Glad to illuminate the community on a subject I'm quite passionate about.


Maxvla's picture

Thanks Peter and Tyll for doing this. I had always suspected treble response above 10KHz on these types of graphs could not be blindly trusted. I'd love to see the results of the smoothing you mentioned, Tyll.

firev1's picture

that Tylls measurement technique is being checked yet again, such test of not only the headphones but also the measuring equipment makes for a interesting read. Cool that Macedonian Hero is Lean Six Sigma Black Belt(for those that don't know, that means he is great at quality control/management).

Jazz Casual's picture

and read Tyll's headphone measurements with interest. : )

Frank I's picture

Nice job Peter. Very well done and a very good read. I enjoyed it thoroughly.

schalliol's picture

Amazing info, and it's great to see the collaborative nature of getting to the bottom of this.

svyr's picture

what about IEMs,not FS or on ear HP?

Shahrose's picture

Enjoyed the read. Nice job.

Amclaussen's picture

Recently I bought a set of Shure SRH-940 headphones that I found quite good overall for the price. Just after trying them at home, I instantly found they were notably sensitive about placement compared to my old Sennheisers, so that I had to be careful about perceiving their sound "signature" and jump to conclusions before finding the (then) elusive "sweet spot placement on my ears. After three months of relaxed hearing, I still enjoy them a lot, but now I'm careful to check they are "correctly" positioned and adjusted (headband and earcup rotations) so they "sound" at their best. This has teached me that sometimes, one cannot simply reach a valid opinion on a certain model, because it happens to require a more detailed or careful listening. As the first comment (5:15 pm) says: "Beguin to trust you ears too". Several days after, I visited another store and carefully auditoned them with a Lehman Audio Black Cube Linear headphone amplifier, and found enogh difference and improvement, as to decide to expend more than twice the 940's price... and I'm not as wealthy as I wish! Maybe the differences need to be analyzed and explained too, but I trusted my ears and continue to enjoy them more with the amplifier than with no amplification.

Now, the subject of placement (or more properly:insertion or coupling) of IEM's... I also own the Shure 535, and still cannot get the same precise sound every time due to their (in my ears)large variability. I am using the silicone moulds made by a local auditive specialist because, for me at least, NONE of the supplied sleeves provided me with the satisfactory sound signature, degree of isolation, bass seal or necessary comfort. I found them quite difficult to "set and forget", and much more variable than my old Shure E-1 that were very different in this aspects, since those old ones were so small and light that I was able to insert them far enought that the yellow foam sleeves were able to properly support them, get them perfectly sealed and comfortable enought to really forget I was wearing them. (BTW, my best fitment was with the earphone body upside down, that is, the LEFT one in the RIGHT ear and viceversa, with the cord over the ear). In contrast, I find the 535's too bulky, heavy and cumbersome to a degree that I miss the performance of the older ones, even when they had a more limited frequency response. Can somebody trow some light on this subject as applied to IEM placement?

Mkubota1's picture

...and I'm continually impressed by Tyll's efforts and transparency. Keep it up!

kongmw's picture

And hats off to Tyll. While the analysis confirms that Tyll measurement scheme is solid at bass and mid range levels, it also reveals the uncertainty up in the treble region. It takes a man to post such honest review about his own systems possible shortcomings.

purrin's picture

It's important not to jump to conclusions on the precision of Tyll's measurements in the treble. They may in fact be better than what is presented here.

This is related to what Tyll mentioned in his Editors's Note: certain types of measurement phenomena, the extreme peaks and dips, are artifacts of the measurement system. There are two issues here which need to be considered when interpreting the results:

1) Whether the extreme peaks and dips are erroneous data that should be discarded for purposes of determining precision. It is not an uncommon practice for pollsters (or other data gatherers) to discard what is obviously nonsensical data. In my experience with measurements, the extreme dips are always very suspect. I could go more in-depth into why this occurs, but that would be a another subject.

2) Whether minor frequency shifts of peaks and dips should unnecessarily "punish" the precision of the system because the evaluation method used is one-dimensional, i.e. only changes in amplitude per specific frequency, but not frequency shifts, are taken into account.

For example, say measurement #1 has a peak of 7db at 10kHz. Then measurement #2 has a peak of 7db at 10.5kHz instead. The shifting of frequencies in not uncommon because of placement, or even ambient temperature/pressure, or voice coil temperature differences.

So to make a very simplified illustration: would be then be fair to say the measurement system is 5db off at 10kHz AND 5db off at 10.5Khz; or would it be more fair to say that the measurement system varies the peak at 10kHz at most 1/50 of an octave?

Just some food for thought.

As Tyll mentioned, maybe the analysis should be run on the data 1/3 or 1/6 octave smoothed to mitigate the effects of the two issues mentioned above. We would get more meaningful results. I would certainly be interested in seeing precision of the measurement system when the FR data is smoothed.

Which actually leads to a good argument that FR graphs should have at least some level of smoothing when presented for wide public consumption.

Tyll Hertsens's picture
Thanks Purrin, good observations and exactly my thoughts regarding smoothing once measurements start making it out to wider audiences.
MacedonianHero's picture

Smoothing is definitely worth trying out. Looking at the raw data; particularly the 2X Standard Deviation vs. frequency response (Regions 4 and 5), you can see that it's not just a "few peaks" causing it to rise, but rather a trend that is consistent across the frequency range in the treble region.

But then again, this isn't true "raw data" either as it's already smoothed somewhat as we averaged out the 5 headphone positions for each run.

That said, I'm keen on seeing the effects of different smoothing methodologies.

purrin's picture

Don't disagree with the trend of going up the band being less precise. This behavior is obvious from the get-go to those who have some experience taking the headphone measurements. We usually see the funkiest crap past 10kHz.

However, it's still sort of misleading to say the standard deviation in the treble region is 5db, which is one heck of a lot, basically almost meaning unreliable. From a standard deviation vs. frequency graph POV, this statement is true. However, humans don't hear this way. And even a simple glance at the FR graphs for each measurement don't scream "all over the place, i.e. +/- 5db." Again, we need account for a second axis (allowing for minor frequency shifts in peaks.)

Ideally the best way to measure precision with these graphs would be a 2D vector based system that identifies similar looking curves within a close enough threshold in 2D space, and then measures the deltas (both frequency and amplitude) of matching points of those curves among the FR plots.

Short of that, I'd like to see the a re-crunching of the data with 1/6 and 1/3 octave smoothing using a rectangular function to reduce the influence of artifacts and take into account the frequency shifting phenomenon.

ultrabike's picture

According to the source bellow, the smoothing should be 0.2 octave:

You guys may (may not) find these papers interesting (I know I do):

MacedonianHero's picture

After Tyll's natural smoothing (by taking the average of 5 different dummy head positions) its more like an average 2X standard deviation of approximately 3.4dB from 8.4kHz and up. This is not bad IMO. But some newer to the hobby may look at two different headphone models and extrapolate that this is a meaningful difference; when statistically it's not. It's the average of the 5 different headphone positions that Tyll publishes here not the "raw data" so to speak. Then you raise a very good question, is that what the human hears? Can they hear that?

I agree that other smoothing exercises should be looked at as a means to see how this can be reduced further in the treble region. I am also wondering what can be done physically to the setup to do this at the outset. Any ideas?

Currawong's picture

Thanks MH for the analysis. It's great to see everyone working together on getting more useful data for people to use in what is a complex subject.

Tyll: For positioning consistency, have you thought of doing something like sticking a couple of small pen lasers on the walls of the chamber pointing towards the middle of the dummy ears so that you can align either side more precisely with the head, or are there marks on it already that you can use?

Tyll Hertsens's picture
There are marks on the head already. The real problem is that from headphone to headphone you really don't know where the center should be.
ultrabike's picture

If you make a measurement of the headphone and then take the headphone off and on again, placing it as much as possible in the same spot as it was before, do you still get significant variations in the measurements? or are the variations due to the fact that you measure purposely on the 5 different locations?

If you where to take another set of 5 measurements and go through the usual process, how much variation do you get on the same can final measurements?

Tyll Hertsens's picture
I'm not sure if I get your question, but it sounds like you're asking me for exactly what's in the article.
ultrabike's picture

I guess I got kind of confused when you said "The real problem is that from headphone to headphone you really don't know where the center should be."

Based on your article and your reaction, the real problem is that you just can't get a consistent measurement at high frequencies EVEN if you knew "where the center should be."

Reticuli's picture

I think it's interesting the range that Tyll's measurements are most consistent in is the range used by I also wonder how much this relates to actual headphone listening and if perhaps we are mostly affected by this same range. That would mean we are most sensitive to just minor differences within the middle frequency spectrum: response, decay & distortion, transients, etc. It could also explain the occasional inferior and superior headphone listening moments with the exact same pair of headphones, associated equipment and source material, while not as important perhaps as a glass of wine, medications (some enhancing it, others deleterious), the amount of sound exposed to in recent days, or how much sleep we got the night before, still significant nonetheless.