Assessing the Critics: 2013 Report on the San Francisco World Spirits Competition

Assessing the Critics: 2013 Report on the San Francisco World Spirits Competition

How stable and accurate are the scores from critics over time? The scores and what we do with them is a collision of the subjective with the scientific. We put our faith in the hands of these critical “experts,” watch them somehow come up with points and medals to rank spirits of all prices and types, and then hope to some kind of Bacchus-God that they’ll be good guides for our personal palates.

Meanwhile, distillers, bottlers, and blenders all submit their life’s work to these same critical “experts” to have it assessed, assayed, and analyzed while praying to their own Bacchus-God that they’ll ultimately receive some kind of award.

We field questions (and sometimes accusations) from all places: regular people question the critical method, the distillers question the critical authenticity, and the press often claims the whole business is pure marketing and/or one step removed from astrology.

What do we have here at Proof66? We have data. Some of these questions are not answerable but we can at least run some analyses and reveal how exclusive the higher scores are, how consistent the scores are over time, and how consistent the critics are among themselves. The results are fascinating and they’re revealed below.

The San Francisco World Spirits Competition

This is one of the biggest and most extensive critical competitions around and it’s one of the most exciting times of the year for us when hundreds of medals are dropped on the world. It’s like a spiritual Olympics every year. Medals are awarded from expert panels ranging from “not recommended” (that is, no medal given); bronze medal; silver medal; gold medal; and then when the panel is unanimous in their decision, a double-gold medal.

Looking back at the last four years of scores, you can see a high degree of consistency in the ratios of different types of medals. The only slight variation is a gradual decrease of bronze medals flowing to silver. (Please note: “not recommended” is not reported so the percentages add to 100%.) One can confidently assume that a double-gold medal of any kind indicates it’s in the top 15% of all spirits (usually well over 1,000) submitted in a given year; a gold of any kind in the top-third.

Distribution of Scores by Year for San Francisco

There is some variation in how medals are awarded across different spirit categories (scores averaged across all four years). Clearly, it’s tough to be a flavored vodka in the competitions—which stands to reason given the bizarre flavors going on in that market today—while more hallowed categories like scotch and whiskey in general tend to fare much better. Meanwhile, scotch producers are highly skilled and the critics reward that with almost two-thirds of all scotch entrants getting some kind of gold and only a very tiny fraction earning a bronze. Here are the detailed results sorted by percent awarded double-gold.

Spirit Type

Double Gold

Gold

Silver

Bronze

Scotch

31%

32%

32%

5%

American Whisky

19%

28%

36%

18%

Irish Whiskey

18%

30%

42%

9%

Brandy

17%

29%

41%

12%

Rum

16%

16%

43%

26%

Gin

15%

15%

44%

26%

Tequila

12%

17%

42%

29%

Canadian Whisky

10%

13%

45%

33%

Liqueur

9%

14%

39%

38%

Vodka

8%

19%

51%

22%

Flavored Vodka

7%

10%

36%

47%

All Spirits

15%

20%

41%

25%

 

What about volatility versus consistency? It turns out that the trend of traditional spirits faring better overall tends to persist when measuring consistency as well. We looked back at any bottle with two scores within the same period of time (there were 549 examples in our data set). Irish whiskey enjoyed the highest consistency with 48% of all entries (13 bottles) while liqueurs were the least consistent with 38% (14 bottles) showing dramatically different scores. (“Perfectly consistent” was measured by receiving the same medal as a previous year while “significant inconsistency” indicated a two-medal difference, say a bronze becoming a gold or a silver a double-gold.) Here are the detailed results sorted by consistency:

Spirit Type

Perfectly Consistent

Significant Inconsistency

Irish Whiskey

48%  (13 bottles)

19% (5 bottles)

Canadian Whisky

40% (6 bottles)

20% (3 bottles)

Gin

40% (12 bottles)

7% (2 bottles)

Scotch

38% (28 bottles)

21% (15 bottles)

Flavored Vodka

33% (3 bottles)

11% (1 bottle)

Tequila

32% (36 bottles)

27% (30 bottles)

American Whiskey

32% (17 bottles)

26% (14 bottles)

Vodka

30% (14 bottles)

13% (6 bottles)

Liqueur

28% (10 bottles)

39% (14 bottles)

Brandy

25% (9 bottles)

19% (7 bottles)

Rum

24% (10 bottles)

29% (12 bottles)

All Spirits

33% (179 bottles)

22% (123 bottles)

 

This means that for any spirit, you have a 1 in 3 chance of getting your same score again in a subsequent year while about 1 in 5 times you’ll get a greatly different score. This is not necessarily because of critical error or inconsistency… distillers themselves could be offering variations in their products. In fact, some craft-distillers are proud of and brag about their vintages and variations in batch. It stands to reason that there would be variation in scores given miraculously consistent judging. Meanwhile, mass-market products may strive for consistent flavor profiles but this is difficult to achieve. In the event, producers often tweak their formulas in response to marketplace demand, conditions and availability of ingredients, and perhaps even prior critical assessments. One must absolutely expect inconsistent scores… seeing the opposite would be a little unsettling as if someone were judging a label instead of the product. As it is, getting consistently high scores in blind judging (see, for example, Yamazaki 18yr Whisky) is a tribute both to critical consistency and sustained distilling excellence.

On the other hand, odd flavors and unusual characteristics—even spirits without a tradition of defined attributes—can be consistent themselves but show differently according to different critical sensibilities. A liqueur with a strong mint or licorice character might please one body of judges and horrify another; without an historical standard, subjectivity is more likely to rule and produce different results. This is often an issue in judging gin where strong preferences to a traditional dry versus a modern floral may take hold. This is also likely one reason why flavored vodka and perhaps rum show such high volatility in their scores due to high variation in sweet versus dry qualities.

We would also be remiss in pointing out the small sample size… multiple submissions within the 4-year span was limited to 549 total bottles (and we note the even smaller sample sizes for different categories above).

What this all means to you!

What does this mean to the consumer? We will continue to say that critical scores in general are the best available indication of quality and certainly far superior to price (though that’s not bad); fanciness of bottle; and perhaps even the recommendation of your neighbor. More to the point, seeing consistent scores over time—and really, of any medal type—is highly desirable and suggests a very dependable spirit. In interpreting our algorithm, we try to reward consistency but still heavily favor outstanding results. This means that modest scores punctuated by one outstanding result can overpower a long string of very consistent, solid scores (something we’ll perhaps adjust in the coming year).

What does this mean to the distiller or producer? If you want to pick apart scores and find fault in the results, then you have some ammunition. There are examples of critics exerting their will in one direction one year and a different direction the next. If you are making a new style of spirit or fanciful flavor, it will be harder to gain the highest accolades in a staunch, tradition-driven industry. Nevertheless, we continue to believe it is one of the very best uses of your marketing dollars. You can spend money on the pretty or handsome models in exotic locales to sell your spirit—maybe even hire a rap star to speak for it—but for a more modest sum you can get public, critical acclaim. As the consumer becomes more information savvy, these scores will become more meaningful and a history of solid results will most definitely yield high returns. We have seen first-hand how critical accolades can drive distributor/retailer interest as well as customer demand. And there is still time to submit for the 2014 competition!

What does this mean for Proof66? Publicizing and highlighting the results of leading critical institutions will continue to be a passion for us. Using analyses like these will, we hope, help maintain the integrity and quality of events like the San Francisco World Spirits Competition so that their findings will have relevance in the industry. It should also hopefully drive more producers to submit more frequently.

Next month: assessing the Beverage Testing Institute!

By Neal MacDonald, Editor

[Disclosures and notes: we are an independent, limited liability company with no affiliation with the San Francisco World Spirits Competition or any other critical body; our opinions are our own. All scores noted here were compiled from the results made public by the San Francisco World Spirits Competition—while we believe our data are complete and accurate, any errors or omissions are unintentional and ours alone.]


2014-02-02
Published by Proof66.com