FlowingData has a great tutorial on making bubble charts in R. Bubble charts are like x-y scatterplots with an additional value mapped to the size of each dot (or “bubble”).
The tutorial produces a clean and professional-looking plot. When working in R, however, there are often many ways to do a single task. My preferred tool for this task is ggplot2.
With the same dataset FlowingData used in their example, I used ggplot2 code to create the bubble chart:
#updated for ggplot 0.9.1 crime <-read.csv("http://datasets.flowingdata.com/crimeRatesByState2005.tsv", header=TRUE, sep="\t") ggplot(crime, aes(x=murder, y=burglary, size=population, label=state),guide=FALSE)+ geom_point(colour="white", fill="red", shape=21)+ scale_area(range=c(1,25))+ scale_x_continuous(name="Murders per 1,000 population", limits=c(0,12))+ scale_y_continuous(name="Burglaries per 1,000 population", limits=c(0,1250))+ geom_text(size=4)+ theme_bw()
One of my favorite things about ggplot2 is the flexible and consistent framework:
- scale_area() automatically scales the bubbles to reflect differences in terms of areas (instead of radius). However, I still made an arbitrary decision by specifying the range of minimum and maximum sizes
- the aesthetics defined in the main ggplot() command are applied to the rest of the arguments, unless overridden (If I wouldn’t have specified the geom_text size=4, the text would instead use size=population. Similarly, I specified color in geom_point to produce red bubbles–without affecting the text color)
- theme_bw() removes the default grey background (which I often prefer).
- using shape 21 (instead of default circle) allows me to set the outline of the shape to white, and fill the circle in red
This is what we get:
Not a bad start, but it could benefit from carefully repositioning the labels. Also, I’m pretty sure these rates are per 100,000 (not 1,000) otherwise the average person in NC would be a victim of burglary 1.2 times per year. (The FD post has since been updated.)
I also started the y-axis at 0. I’m not sure what these relationships are supposed to mean, but I think it is helps emphasize that there is roughly a 4-fold difference in state burglary rates, and that low crime areas still have some crime.
In the comments of the FlowinData post, someone posted code that handles much of the text repositioning in R. It is always especially satisfying when there is a way to do all the manipulation through code, without directly “touching” the data.
Here is another example of a bubble plot from our recent paper:
Edits:Reference for above figure: Maenner MJ, Durkin MS. Trends in the prevalence of autism on the basis of special education data. Pediatrics. 2010;126(5):e1018–e1025.
In the time since I posted this, ggplot2 has updated, and the original code now produces an error. It’s been updated to work in ggplot2 0.9.1
The last figure was made 3+ years ago. I’ve updated the code to produce essentially the same plot with ggplot2 0.9.3.1:
ggplot(asd_data, aes(x=prev2003, y=asd_diff, weight=denom2003, colour=octile, size=denom2003)) + geom_point( alpha=0.8, guide="none") + scale_size_area(breaks=c(250, 500, 1000, 10000, 50000), "2002 District\nElementary School\nPopulation", max_size=20) + stat_smooth(method="rlm", size=0.5, colour="black", alpha=0.4, level=0.95)+ scale_colour_brewer(palette="Spectral", type="qual",name="2002 Autism\nPrevalence Octile") + coord_equal(ratio=1/2)+ guides(colour = guide_legend(override.aes = list(alpha = 1)))+ ggtitle("Figure 4. Change in Autism Prevalence between 2002 and 2008 vs Baseline (2002) Prevalence,\n Wisconsin Elementary School Districts (with weighted linear best-fit line and 95% confidence band)") + scale_x_continuous("2002 Autism Prevalence (per 1,000)") + scale_y_continuous("Change in Autism Prevalence (per 1,000) between 2002 and 2008")