Appendix C — Stop & Think Answers — Chapter 7
Tip 7.1 response: Looking at a summary of your data has a ton of benefits. Summarizing your data can help you:
- Identify data import issues, like importing the wrong file
- Identify data quality issues, like missing values or outliers
- Become familiar with the data, including data types and ranges
The more familiar you are with your data, the better!
Tip 7.3 response: Clean visualizations aren’t just important for presenting data to others, they are also important for presenting data to yourself! A messy graph may hide trends, overload the viewer, or be otherwise difficult to interpret, potentially leading to misinterpretations or missed patterns. Professional-looking visualizations can also be shared with colleagues through presentations and reports with minimal modifications, freeing up time later. Lastly, it’s never bad to get in the habit of producing clear and effective visualizations of your data!
Tip 7.6 response: Choosing a good color palette is critical for communicating information about your data. Improper color choice can easily lead to misinterpretations. In the heatmap example, we use a diverging color scheme to show the difference between positive and negative correlations, but the balance of those colors must be correct:
- If the center of the palette was on 0.2 rather than zero, it would visually suggest that values around 0.2 are “neutral” or “average,” rather than “positive”
- With asymmetric color intensity (brightest blue at -0.2, brightest orange at 1.0), it would create visual bias, making positive correlations appear weaker than negative ones of the same magnitude
In both cases, viewers would easily be thrown off by the color scheme, even with a legend available.
Tip 7.2 response: Filtering columns by category can help manage cognitive load and distraction when working with large datasets, allowing you to focus on particular variables. It can make it easier to identify patterns, facilitate more targeted analysis, and help build intuition about how different aspects of the data relate to each other.
Tip 7.5 response: Correlation values can show potential relationships between variables and suggest further avenues for investigation. However, they have limitations: they only measure linear relationships, can be heavily influenced by outliers, and don’t indicate causation. They should be considered as starting points for deeper analysis rather than conclusive findings.
Tip 7.4 response: While we are most used to linear scales, logarithmic scales are useful in many situations:
⁃ When data spans several orders of magnitude, log scales lett you visualize both small and large values effectively on the same plot ⁃ When data follows exponential growth patterns, like in the early stages of an epidemic ⁃ When looking for proportional or percentage changes rather than count changes, like in the case of this library example
In fact, visualizing data on a logarithmic scale can sometimes reveal trends that are not apparent from count data, as in this example using cancer data.
Logarithmic transformations also have some nifty statistical applications, like reducing skew in data.
Tip 7.7 response: Heatmaps add an extra visual layer that is not present in tables (color), making strong correlations visually obvious. Clustering reveals groups of variables that behave similarly, and the dendrograms show how those variables are related to each other. Together, these components make patterns easier to spot than a table alone and add hierarchical structure that can’t be easily represented in a data table.
Tip 7.8 response: When merging datasets, it’s important to consider:
- Whether incomplete data is acceptable or problematic. If problematic, choose an inner join.
- If either dataset is primary to your analysis. If no, choose an outer join. If yes, choose a left or right join according to the dataset.
Along the way, consider how representative the resulting dataset will be and whether you will be able to use it effectively for any planned analyses or visualizations.