In the middle 2000s one of my favorite business books was James Surowiecki’s 2004 volume, The Wisdom of Crowds. Surowiecki argues that aggregating individual opinions into a collective judgment will often yield a better decision than one made by experts. For example, Surowiecki describes a fair in Plymouth, England, which was held in 1906. At the fair people were paid to judge the weight of an ox after it was “slaughtered and dressed.” There were 787 responses. When these were averaged together, the result was 1,197 pounds, which was very close to the actual weight of 1,198 pounds. This is an impressive finding, and Surowiecki reviews similar evidence throughout the book.

Among the more interesting real life examples, was Surowiecki’s analysis of the Iowa Electronic Markets (IEM). The IEM are a bundle of nonprofit futures markets. Participants purchase and sell contracts. These contracts function as — for want of a better way to put it — bets on the future. (Actually, that’s a perfect way to put it.) The IEM is best known for predicting the results of political elections. To illustrate, consider a presidential campaign. One could purchase a contract that would pay you $1 if the Republicans are victorious but $0 if the Democrats capture the White House. If you believe that the Republicans are likely to win, then you would perhaps value a Republican victory contract at close to 100 cents. If you believed that the Democrats are likely to prevail, then this contract would be worth much less to you.

The share price fluctuates with the political fortunes of the two parties. When the Republicans pull ahead in the polls (or through other indicators of success), then the value of their contracts rises. People are willing to pay closer to 100 cents if they believe they have a chance of earning a profit. Conversely, if the polls reverse and the Democrats gain ground, there is likely to be a “sell off,” as the investors unload their over-valued Republican shares. The supply of available Republican shares rise, but the demand drops. Hence, the price declines.

There are two curious things about the IEM. For one, it captures the collective “wisdom of the crowd” through all of that buying and selling. For another, the crowd does its job well. Surowiecki tells us that the IEM may even outperform conventional political polls in predicting the winner. These are very impressive results.

For my part, I found Surowiecki’s thinking so compelling that I revised my course materials so that they made ample mention of his work. At least until the financial meltdown of 2007 – 2008, and then my opinion of a crowd’s wisdom became a smidgen more skeptical. I suppose that I forgot to read the fine print, or at least I didn’t read it as closely as I should have. It’s this “fine print” that I want to talk about today.

Just after the big economic crash, I turned my attention to Charles R. Morris’ 2008 book, The Trillion Dollar Meltdown. (In 2009 Morris released an updated edition, which was titled The Two Trillion Dollar Meltdown, but I have only read the original. What’s a trillion here or there?) An excellent writer, Morris explains why the financial sector, and by extension the entire economy, crashed. The causes of the crisis were complex but a substantial part of the problem had to do with financial instruments, which were tied to real estate. Very briefly, originators made loans and sold them to large firms. These many mortgages were bundled into tranches and turned into bonds. The resulting bonds were then peddled to investors. The reasoning behind all of this was straightforward and similar to The Wisdom of Crowds. Essentially, the likelihood of default on any particular mortgage might be high, but aggregating a set of them together minimizes the risk. This was true, so the argument went, even for the lower tranches, which were often composed of subprime loans.

There’s a lot more to the story than this, but the important thing to keep in mind is that the math didn’t work the way we thought that it would (for a thorough but readable quantitative critique, see Nassim Nicholas Taleb’s The Black Swan). After a surge of defaults burst the housing bubble, those formally promising bonds were transmogrified into “toxic assets.” Putting a lot of bad things together may yield a good thing, but then again, as Charles Morris and Nassim Taleb tell us, it may not.

As The Wisdom of Crowds uses roughly the same logic, you can see why you need to read the book very cautiously, with careful attention to the particulars. It should be emphasized that James Surowiecki is not blind to these matters. In fact, he lists a number of caveats, four of which had special resonance with me: (1) judgments must be independent of one another, (2) the crowd must contain a diversity of opinions, (3) the decision process should be decentralized, allowing people to utilize their particular expertise and local knowledge, and (4) there must be a suitable means of aggregation (e.g., the sum, arithmetic mean, geometric mean, etc.). These cautionary notes make for an excellent start, though some remain skeptical (see, for example, Jaron Lanier’s book, You are Not a Gadget: A Manifesto, of which more will be said in a moment).

 My concern here is a subtle one, so let’s be sure that we don’t miss it. It is not that The Wisdom of Crowds provides sound advice but for a few simple caveats. Rather, the problem is that these caveats may be alerting us to some rather problematic limitations with the method. That is, the method itself might be more narrowly applicable than one would hope, at least in many real world situations. With this in mind, let’s see what Surowiecki’s four caveats imply about the basic process.

For one thing, each judgment can be said to have at least two components. The first part (hopefully!) is an accurate component; this is the part of the response that is essentially correct. The second part is the inaccurate component; this is the part of the response that is essentially mistaken. For example, suppose you and I are trying to guess how many jellybeans are in a jar. There are actually 1000 but you guess 1150. Your response is equal to the accurate component plus the inaccurate component. Substituting numbers, we can see that 1150 = 1000 + 150; your estimate is 150 jellybeans high. This analysis also works with negative numbers. If I guess that there are 850 jellybeans, then my response = 1000 + (-150).

There is another thing that we can learn from a close examination of Surowiecki’s caveats, and as we precede this may also prove worrisome. Surowiecki is assuming that the accurate component is relatively stable, whereas the inaccurate component is relatively random. We know this because aggregation reduces the size of the error, while simultaneously maintaining the accurate part of our responses. If one, say, averages the responses together, then the accurate component is likely to remain, while the inaccurate component will be gradually reduced. You may guess high, while I may guess low, but once our responses are aggregated these errors wash out.  We can view this as a signal detection problem. The “signal” corresponds to what I have called the accurate part of each response. The “noise” corresponds to the inaccurate part of each response. When the responses are aggregated, the noise is reduced, thereby rendering the signal more easily detectable.

From this logic, we can (sort of) re-interpret The Wisdom of Crowds in terms of something we behavioral scientists understand well — Classical Test Theory. The essential formula is X = T + e. X is the observed score, or in the present case it is an individual judgment. T is the true score or what I have loosely labeled the accurate component. The term e is the error, which is assumed to be randomly distributed. When we aggregate the individual responses (e.g., X1 + X2 + …. + Xn) the error terms are reduced. All is well and good, but only if we meet the assumptions of Classical Test Theory. With this understanding of the essential process, let’s revisit each of Surowiecki’s four caveats – diversity, decentralization, independence, and aggregation. When we pull back the curtain we will uncover some unsolved problems. For simplicity, I will link one aspect of Classical Theory to each of the four caveats.

Let’s begin with independence. The astute reader has already recognized that I have so far described Classical Test Theory somewhat casually. The true score, T, is not “the accurate component.” Rather, it is the part of the response that is stable or, more precisely, the average score you would expect to get if you administered a test an infinite number of times. Whether or not it is “true” in any other sense remains an open question. If many members of a crowd share a misjudgment, then this misjudgment can serve as a source of systematic (as opposed to random) error. Put differently, aggregation causes random error to drop out, but it does not reduce stable or so-called “systematic” error. As such, it will be carried into the final decision (i.e., it will be part of the “true” score) as if it were factual information.

As Surowiecki reminds us, the methods advocated in The Wisdom of Crowds tend to be effective when each individual judgment is independent of others. Using the lens of Classical Test Theory, we can see why this is so — non-independence is a source of systematic error. That is, it tends to cause individual judgments to become more similar for reasons that have little to do with the nature of the problem. These correlated judgments create bias by systematically pushing responses in one direction or the other.

Taking a cue from the social psychological literature, Surowiecki does a sound job of describing this problem in terms of social influence and group decision-making (see especially his Chapter 9). For example, in Irving Janis’ influential work on groupthink decisions are compromised when team members place greater emphasis on group harmony and lesser emphasis on decision quality. Likewise, group polarization, the tendency of teams to make either extremely risky or extremely cautious decisions, occurs for two reasons — group discussion tends to produce unequal amounts of information favoring each of the alternative positions and, secondly, members seek to embody presumed group norms in their behavior. Notice that when groupthink or group polarization occurs, then individual judgments are impacted by something other than the “facts of the case.” These responses are being systematically influenced by group membership.

Surowiecki does a good job illustrating the pitfalls of social influence. This is because the problem is a serious one. Conformity pressure, a desire to please others, and even simple modeling can lead to poorer group decisions, at least with the sort of crowd sourcing suggested by The Wisdom of Crowds. There are many situations in which groups are employed to solve problems, such as quality teams and other workplace empowerment programs, where these pernicious social influence effects could be powerful. In such circumstances Surowiecki’s suggestions would have less applicability.

Surowiecki also reminds us to solicit opinions from a diverse group of respondents. If we accept a Classical Test Theory interpretation, this makes considerable sense. A diverse sample, especially a large one that collects independent responses, is more likely than a homogenous sample to turn systematic error into random error.  To understand why, consider a simple example. Suppose that one is trying to, say, develop a new product roll-out. Engineers, marketers, and sales personnel may each have a valid viewpoint. Even so, each also has a certain perspective that could well be biased by their point of view. Including all of these groups is likely to yield a sounder and more nuanced solution. This is all to the good, but we must now add a caveat to Surowiecki’s diversity caveat. It is randomness with respect to error that is essential here. Diversity is less useful if it does not mitigate systematic bias.

This is an important idea, since it implies that certain types of diversity will actually be harmful to the crowd’s decision. This is especially so when there is a part of the crowd that has a consistent bias in a particular direction, which is not empirically valid. To illustrate, suppose we were using The Wisdom of Crowds to determine how much evolution should be presented in a high school biology curriculum. Should we consult a small and homogenous group of biologists from the local university? Or should we consult a more diverse crowd, which also includes members of conservative churches? As it happens, almost half of Americans doubt the veracity of evolution, and this skepticism is stronger among the deeply religious. As such, these diverse beliefs would likely add systematic error into judgments about evolutionary biology. That is, these religiously traditional individuals are not guessing and occasionally guessing wrong. Rather, their faith experience tilts their views on evolution in a more or less consistently negative direction. This is a type of systematic error.

Virtually all professional biologists accept the scientific consensus favoring evolution. In this example, it seems likely that the small and homogenous crowd of experts will produce a sounder decision than would a larger and more diverse collective that includes non-experts. This is because it is insufficient for the sample to be diverse. Mistakes need to be random, so that they can be annulled by the (opposing) mistakes of other crowd members.

If we move away from Classical Test Theory and re-state the problem in sociological terms, we see what a simple and obvious point we are making. People often share beliefs that are not true. These beliefs could be a source of pernicious stereotyping (e.g., racism) but most of the time they are simply incorrect (e.g., low temperature rather than a virus causes colds). In either case, shared misconceptions undermine the quality of aggregated judgments because they are relatively stable across a number of individuals. For this reason, diversity will be useful but only if the individuals included have some actual knowledge to share and only if they are not consistently biased toward or against certain positions.

Surowiecki also argues that the approaches recommended in The Wisdom of Crowds will yield their best results when decision-making is decentralized. That is, each respondent should be able to apply her local knowledge and unique skills. From a Classical Test Theory perspective, this is sound advice. Notably, decentralized decision-making will help insure independence of judgments (our first caveat), and it also does something else.  As was true for diversity, decentralization brings in more perspectives, increasing the likelihood of finding the correct answer, while also turning systematic bias into random error. When decision-making is decentralized, people with different perspectives and unique pieces of the puzzle will all have the opportunity combine their knowledge into a comprehensive solution to a problem.

There is an interesting implication in this thinking. The solutions are myriad, but the problem seems to be defined in a roughly common way across the decentralized respondents. That is, people seem to be agreeing as to what they are trying to accomplish and, by extension, what criteria constitute success. Along these lines, the scholar Jaron Lanier argues that The Wisdom of Crowds is maximally useful when the issue confronting the crowd is defined in advance and possesses a simple, often numerical, solution.

Stated differently, decentralization requires one to strike a careful balance between two extremes. On the one hand, crowd members must have enough local and/or specialized knowledge in order to have something unique to add to the solution. On the other hand, they need to share a common understanding of the problem they are collectively facing. If a set of crowd members views the problem in different terms (e.g., disagrees as to what criteria constitute success), then it will be difficult for the suggestions to be meaningfully combined. Success is equilibrium. If the crowd becomes too homogenous, then each participant has to contribute beyond what is already added by the other. However, if the crowd becomes too heterogeneous, then they may not conceptualize their task in different terms. In essence, they will no longer be working on the same problem.

The fourth and final caveat concerns aggregation. There needs to be an appropriate way to combine the individual responses into an overall judgment. According to Classical Test Theory, this should be straightforward, assuming that the accurate part of the judgment is stable and the inaccurate part (the error) is randomly distributed. Right away, this analysis gives us some cause for worry. What if there is no accurate component contained within the individual judgments, stable or otherwise? It is possible that none of the participants possess any part of the correct answer.

If this were to occur, then there would not be a true score. Hence, aggregation would be unable to recover the best answer. To be sure, The Wisdom of Crowds would produce some answer, but it would probably not be an answer that we would feel comfortable trusting. Suppose we had to estimate the number of jellybeans, but we were not allowed to see the size of the jar? In this case, our guesses would likely be positive but would otherwise be close to random. There might not be an accurate component to our guesses, even though we would obtain an estimate greater than zero (e.g., the average of positive numbers, even random positive numbers, is positive). However, we would be fooling ourselves if we believed that the procedure was yielding valid information.

This is a matter of some seriousness when one is seeking to address a novel problem, for which an objective correct solution does not yet exist. When this occurs, the “accurate component” (if that is the best way to state the matter) may be very small. In the aforementioned book, You are Not a Gadget, Jaron Lanier argues that The Wisdom of Crowds may not provide the most effective decision techniques when the task requires innovativeness and creativity. Our present analysis suggests that Lanier has a point. If the correct answer or a part of the correct answer is not widely known, there isn’t a valid true score around which the crowd’s responses can converge. Surowiecki’s method of crowdsourcing will not work well.

The problem with Surowiecki’s The Wisdom of Crowds is not that it is so much wrong, as that it has to be used with great care. It can work, but only for certain types of problems, certain types of crowds, and in certain situations. To re-state the caveats, Surowiecki suggests that the best decisions will be made when the participants provide individual judgments (independence), that are pulled from a broad sample (diversity of opinion), and then take idiosyncratic perspectives using special skills and/or local knowledge (decentralization). These individual judgments are then combined in a way that reduces decision errors (aggregation). When they are met, then The Wisdom of Crowds promises effective solutions. However, these are tough standards, and they shouldn’t be taken lightly.