By Joseph L. Gastwirth, Weiwen Miao, and Efstathia Bura*

Under Rule 23(a)(2), a party seeking class certification must show there is a question of law or fact common to the class. In support of their claim of class-wide gender-based pay disparities, the plaintiffs in *Wal-Mart v. Dukes* submitted a regression analysis of salary data for *each region*. 131 S.Ct. 2541 (2011). Their regression accounted for seniority, weeks worked, job held, performance rating, the particular store where the employee worked, and their gender. Females were found to be paid significantly lower than males in almost all regions. The defendant’s expert analyzed the data for *each store*, separating out employees of grocery departments. They found that, even though females in about three-fourths of the stores were paid less than their male counterparts, the pay disparity was statistically significant in only 10% of the stores. The majority opinion (*id.* at 2555) indicates that information about disparities at the national and regional level cannot establish the existence of disparities at individual stores, as a regional disparity may be attributable to a small subset of stores. The dissent (*id*. slip op. at 6, n.5) states that plaintiffs’ regression showed there were disparities *within* stores. Unfortunately, neither opinion is statistically correct.

The *Dukes v. Wal-Mart* case appears to be a classic example of how statistics can be misused in analyzing data in class actions. In such cases, there is typically a large amount of data available. To establish a fact common to the class, the plaintiffs organize the data into one or more *large* subsets. This often enables them to show that a very small disparity provides a statistically significant result; in this case, that of pay disparity between the two sexes. On the other hand, the defendant, in a divide-and-conquer approach, breaks down the data into *small* subgroups where within each subgroup, sample sizes are small and it is very difficult, and in some instances mathematically impossible, to find a statistically significant disparity. In *Dukes v. Wal-Mart*, the plaintiffs based their conclusions on forty-one regression analyses of *regional* salary data whereas the defendant’s expert analyzed the pay data at the *store* or other subunit level and ran approximately 7500 regressions. In our paper we show that both approaches misuse statistics and can be equally misleading.

The question then becomes “what are the appropriate statistical analyses?” The answer depends on the power of a statistical test. The power of a statistical test measures its ability to detect a disparity when one exists, which is a function of the sample sizes of the minority and majority groups and the magnitude of the disparity. The smaller any of these are, the lower the power. To study how power affected the reported results of the defendant’s regression analysis, we carried out simulations emulating the data of the case (the real data were confidential so we used reported summaries from both parties in our simulation). According to the briefs, in about 7.5% of the 7500 regressions the defendant’s expert ran, women were statistically significantly underpaid (*p*-value < 0.05), in 2.5% of the 7500 regressions men were found to be statistically significantly underpaid. In 53% of the regressions, the women were found to be paid less than men, and in 37% of the regressions men were paid less, but not statistically significantly so. The defendant argued that these results are not consistent with systematic discrimination against the women. We simulated pay data so that females were consistently underpaid by about 2%, which is consistent with systematic discrimination, and obtained similar results to those reported by the defendant’s expert. Thus, we showed that the defendant’s store-by-store regression results were consistent with a general pattern of a small or modest pay disparity disadvantaging females. How can one explain this statistical result? The answer is that each of the regressions has very small power to detect a pay differential of about 2% when the sample size is 150, approximately the average number of employees per store. When a combination test is applied to the defendant’s expert’s summary of the p-values of the regressions collectively, we found a clearly significant disparity in pay for employees in the major Wal-Mart stores, grocery stores and Sam’s Clubs (p-values < .01).

The numbers of employees in many stores or sub-units were too small to detect a meaningful pay differential and combination methods, similar to those used in the analysis of multi-center medical studies, can be applied. When we used them to re-analyze the defendant’s regressions, a statistically significant difference in pay disadvantaging females was found. It is important to emphasize that if the FDA adopted a criterion similar to the majority opinion and required a new drug to have a statistically significant effect in each center or hospital contributing patients to a larger clinical trial, many fewer beneficial new drugs would be approved.

On the other hand, too large samples come about by merging “naturally” dissimilar data. In this situation, one should identify the clusters of similar data, analyze them separately and then combine the results in a statistically meaningful manner. Because plaintiffs included a “store” variable in their equation, the pay differential attributable to each store has been removed from the estimate of the effect of gender in regional disparities. Thus, the majority’s view that the regional disparities could have been due to disparities in a few stores does not seem plausible. Furthermore, why would every region have a “few” problem stores with pay disparities large enough to create a region-wide disparity? While the plaintiffs’ analysis accounted for the effect of each store as well as seniority etc. on regional pay disparities, this adjusted estimated *regional* pay differential is not the same as comparing employees *within* each store, as the dissent apparently thought. Indeed, one might question whether the role of seniority is the same in every store in each region. While this could be accounted for in a regression that included the appropriate “interaction” variables, plaintiffs’ regression did not submit one. Whether such a detailed analysis should be required for establishing “commonality” in a motion for class certification, however, is debatable unless the defendant submitted evidence that substantially different criteria were used in different departments, e.g. *Bennett v. Nucor Corp.*,* *656 F.3d 802, 814–15 (8th Cir. 2011). Because both parties in *Wal-Mart *relied on readily available computerized records, information on other potentially important variables was not used. Some of this data, e.g., attendance records, could be obtained in further discovery and included in more detailed regression analyses.

In *Wal-Mart*, the plaintiffs also analyzed promotion data and showed that females were under-promoted in almost all regions for all the four managerial positions. Of the four managerial positions, the store manager had the *least* convincing statistical evidence of female being under-promoted, i.e. the smallest number of regions that females were under-promoted. However, even for the store manager, 34 out of 40 regions females were under-promoted and the probability of observing 34 or more out of 40 regions that females would be disadvantaged under a *fair* system is less than 1 in 100,000. Moreover, our re-analysis of the plaintiffs’ promotion data demonstrated that the promotion data are consistent with an overall system in which the odds of a female being promoted were only 70% to 80% that of a male. Unfortunately, plaintiffs just relied on the statistical significance of the difference in promotion rates and did not provide this summary relative odds ratio.

Relatively few legal opinions appreciate the need to carefully assess criticisms of statistical studies. Almost no data sets are error-free and one can usually conjecture that some non-included variables may explain a disparity. Before accepting suggested explanations such as regional disparities could be due to a small set of stores (*id.* at 2555) or most managers in a corporation with a policy forbidding discrimination would make sex-neutral decisions (*id.* p. 15), one should ask for data showing the stores causing each regional disparity or the mechanism used by the firm in enforcing its policy. Knowing the method Wal-Mart used to monitor its employment practices would aid in determining the appropriate units, e.g., district or region, for which the data should be analyzed.

*Professor Joseph L. Gastwirth is Professor of Statistics and Economics at George Washington University and the author of *Statistical Reasoning in Law and Public Policy*. Professor Weiwen Miao is Associate Professor of Mathematics at Haverford College and Professor Efstathia Bura is Associate Professor of Statistics at George Washington University. The authors have written many articles on the proper use and interpretation of statistical evidence in equal employment, toxic tort, and securities law. This note relies on their article, *Some Important Statistical Issues Courts Should Consider in their Assessment of Statistical Analyses Submitted in Class Certification Motions: Implications for *Dukes v. Wal-Mart, published in *Law, Probability and Risk* (Sept. 2011) 10, 225–64.