Appendix E — Stop & Think Answers — Chapter 8
Tip 8.1: Relying solely on p-values can lead to overinterpreting small, biologically meaningless differences as important simply because they’re statistically significant, especially with large sample sizes.
Tip 8.2: You might choose a nonparametric test like Mann-Whitney when your data doesn’t follow a normal distribution or when you have small sample sizes and can’t verify distributional assumptions.
Tip 8.5: This is definitely a personal preference type of questions! But I kind of like the 2nd option. It’s a bit “fancy” but it’s nice because it only uses data contained in the data frame itself and doesn’t require going through the keys in the REGIONS
map.
Tip 8.3: A result can be statistically significant but have such a small effect size that it’s biologically meaningless, or a result can have a large biological effect but fail to reach statistical significance due to small sample size.
Tip 8.4: In drug studies, a medication might show a statistically significant reduction in some biomarker (p < 0.05), but the actual change might be so small (tiny effect size) that it doesn’t translate to any meaningful clinical improvement for patients.
Tip 8.6: Regions whose confidence intervals don’t overlap have significantly different cancer death rates. From the plot, it appears the West has significantly lower death rates than several other regions, particularly the Southeast.
Tip 8.7: An R-squared value close to zero indicates that the linear model explains almost none of the variation in the dependent variable, suggesting there’s no linear relationship between the predictor and response variables.
Tip 8.8: You can determine important predictors by examining their coefficients and p-values in the summary output. Predictors with larger absolute coefficient values and p-values < 0.05 (like X1 and X2 in our example) are more important to the model.
Tip 8.9: Although we didn’t go over the biplot, it can be a useful tool for interpreting the principal components. That said, you can still examine the loadings returned by the PCA model to understand how each feature relates to PC 1 (the x-axis). Specifically, you can compute the angle each loading vector makes with the x-axis. Features with small angles (or angles close to 180°) align closely with PC 1. For example, you could use something like result.loadings.apply(lambda row: np.arctan2(row[1], row[0]) * 180 / np.pi, axis=1)
to calculate these angles. In this dataset, petal length and petal width are most strongly associated with PC 1.
Tip 8.10: One potential explanation that k-means performed less well for virginica is because it overlaps more with versicolor in the feature space, while setosa is more distinct. This suggests virginica and versicolor share more similar morphological characteristics.