Variance of a sum
November 13, 2025 at 12:07 PM by Dr. Drang
Earlier this week, John D. Cook wrote a post about minimizing the variance of a sum of random variables. The sum he looked at was this:
where and are independent random variables, and is a deterministic value. The proportion of that comes from is and the proportion that comes from is . The goal is to choose to minimize the variance of . As Cook says, this is weighting the sum to minimize its variance.
The result he gets is
and one of the consequences of this is that if and have equal variances, the that minimizes the variance of is .
You might think that if the variances are equal, it shouldn’t matter what proportions you use for the two random variables, but it does. That’s due in no small part to the independence of and , which is part of the problem’s setup.
A natural question to ask, then, is what happens if and aren’t independent. That’s what we’ll look into here.
First, a little review. The variance of a random variable, , is defined as
where is the mean value of and is its probability density function (PDF). The most familiar PDF is the bell-shaped curve of the normal distribution.
The mean value is defined like this:
People often like to work with the standard deviation instead of the variance. The relationship is
Now let’s consider two random variables, and . They have a joint PDF, . The covariance of the two is defined like this:
It’s common to express the covariance in terms of the standard deviations and the correlation coefficient, :
If we were going to deal with more random variables, I’d explicitly include the variables as subscripts to , but there’s no need to in the two-variable situation.
The correlation coefficient is a pure number and is always in this range:
A positive value of means that the two variables tend to be above or below their respective mean values at the same time. A negative value of means that when one variable is above its mean, the other tends to be below its mean, and vice versa.
If and are independent, their joint PDF can be expressed as the product of two individual PDFs:
which means
because of the definition of the mean given above. Cook took advantage of this in his analysis to simplify his equations. We won’t be doing that.
Going back to our definition of ,
the variance of is
To get the value of that minimizes the variance, we take the derivative with respect to and set that equal to zero. This leads to
This reduces to Cook’s equation when , which is what we’d expect.
At this value of , the variance of the sum is
Considering now the situation where , the value of that minimizes the variance is
which is the same result as before. In other words, when the variances of and are equal, the variance of their sum is minimized by having equal amounts of both, regardless of their correlation. I don’t know about you, but I wasn’t expecting that.
Just because the minimizing value of doesn’t depend on the correlation coefficient, that doesn’t mean the variance itself doesn’t. The minimum variance of when is
A pretty simple result and one that I did expect. When and are positively correlated, their extremes tend to reinforce each other and the variance of goes up. When and are negatively correlated, their extremes tend to balance out, and stays closer to its mean value.
Snow and memory
November 9, 2025 at 5:01 PM by Dr. Drang
Two things came together over the past few days. First was my longstanding belief that winters here in the Chicago area were snowier when I was a kid than they are now. I’m not sure why that popped back into my head, but it did. Second was my concern that I’ve been losing my Pandas skills since I retired and stopped having data analysis work projects. With both of these things floating around in my head at the same time, I decided to put together a small project to address them: an analysis of Chicago snowfall data over several decades.
The first problem was getting the data. You can download lots of historical weather data from NOAA, but I’ve never found its website easy to search. I’m sure the terminology NOAA uses makes sense to regular weather analysts, but as a dilettante, I’ve never felt comfortable with it. Still, I was able to find this page, which presented a clickable map with local forecast offices across the US.

Clicking on northern Illinois brought me to another page that let me choose the data to download. They call it NOWData,1 and I selected monthly summarized snowfall data for the Chicago area.

After clicking Go, this overlay appeared:

This is in nearly perfect form for the analysis I want to do. Each row represents a 12-month period centered on New Year’s Day, so the total at the end of the row is that entire winter’s snowfall.
I selected all the rows of the table (apart from some summary rows at the bottom) and used a combination of BBEdit and Numbers to clean things up and convert it to a CSV file. Here are the first several rows:
Season,Jul,Aug,Sep,Oct,Nov,Dec,Jan,Feb,Mar,Apr,May,Jun,Total
1883,,,,,,,,,,,,0.0,
1884,0.0,0.0,0.0,0.0,0.5,8.8,20.2,19.0,3.6,1.9,0.0,0.0,54.0
1885,0.0,0.0,0.0,0.0,0.7,14.6,26.7,6.0,1.9,1.0,0.0,0.0,50.9
1886,0.0,0.0,0.0,0.0,2.6,9.8,17.7,4.2,6.2,0.0,0.0,0.0,40.5
1887,0.0,0.0,0.0,0.0,2.5,9.9,11.9,2.2,3.5,2.0,0.1,0.0,32.1
1888,0.0,0.0,0.0,0.0,0.5,3.2,6.0,7.9,5.1,0.0,0.0,0.0,22.7
1889,0.0,0.0,0.0,0.0,1.3,0.0,2.7,8.3,9.4,0.0,0.0,0.0,21.7
1890,0.0,0.0,0.0,0.0,0.0,7.1,3.5,1.1,7.7,2.2,0.0,0.0,21.6
1891,0.0,0.0,0.0,0.0,6.8,5.5,15.3,2.8,3.1,0.0,0.0,0.0,33.5
1892,0.0,0.0,0.0,0.0,0.8,2.1,15.2,11.8,1.0,0.6,0.0,0.0,31.5
1893,0.0,0.0,0.0,0.0,7.5,12.1,6.5,12.9,5.4,0.0,0.0,0.0,44.4
1894,0.0,0.0,0.0,0.0,2.5,10.1,15.4,14.0,5.2,0.0,0.0,0.0,47.2
1895,0.0,0.0,0.0,0.0,14.5,3.4,2.0,27.8,5.9,0.0,0.0,0.0,53.6
1896,0.0,0.0,0.0,0.0,4.2,1.3,,,,,,,
As you can see, I turned all the Ms into explicit missing values and all the Ts into zeros. I also deleted the “second half” of the first column, identifying each season by the calendar year in which it starts.
Although the table goes all the way back to 1883, I wasn’t interested in the early figures. Part of that came from what I saw as questionable data—trace amounts of snow in the Julys of 1902–1904?—and part was just disinterest in snowfall more than about a generation before I was born. So I decided to review just the data since 1932. Not an important year in weather, as far as I know, but a significant one in American history.
Here’s a plot of the raw data and a curve fit through it via local regression analysis. So yes, my youth was spent with more snowfall than the generation before or after. And the last three years have been nearly snowless.

You might question drawing a curve through data with this much year-to-year variation, but even without the blue line, I think it’s clear that the 60s and 70s had both the highest highs and the highest lows.
Here’s the Python code that produced the graph:
python:
1: #!/usr/bin/env python3
2:
3: import pandas as pd
4: import numpy as np
5: import matplotlib.pyplot as plt
6: from matplotlib.ticker import MultipleLocator, AutoMinorLocator, FixedLocator
7: import statsmodels.api as sm
8:
9: # Read in the data, putting only the Season and Total columns in the dataframe.
10: df = pd.read_csv('area.csv', usecols=[0, 13])
11:
12: # Retain rows since 1932 and when the Total isn't missing.
13: df = df[(df.Season >= 1932) & df.Total.notna()]
14:
15: # Smooth the yearly snowfall with local regression.
16: smooth = sm.nonparametric.lowess(df.Total, df.Season, frac=.25, it=0)
17:
18: # Create the plot with a given size in inches.
19: fig, ax = plt.subplots(figsize=(6, 6))
20:
21: # Plot the raw and smoothed values.
22: ax.plot(df.Season, df.Total, '--', color='#bbbbbb', lw=1)
23: ax.plot(df.Season, df.Total, '.', color='black', ms=4)
24: ax.plot(smooth[:, 0], smooth[:, 1], '-', color='blue', lw=2)
25:
26: # Set the limits
27: plt.xlim(xmin=1930, xmax=2025)
28: plt.ylim(ymin=0, ymax=100)
29:
30: # Set the major and minor ticks and add a grid.
31: ax.xaxis.set_major_locator(MultipleLocator(10))
32: ax.xaxis.set_minor_locator(AutoMinorLocator(2))
33: ax.yaxis.set_major_locator(MultipleLocator(10))
34: ax.yaxis.set_minor_locator(AutoMinorLocator(2))
35: ax.grid(linewidth=.5, axis='x', which='major', color='#dddddd', linestyle='-')
36: ax.grid(linewidth=.5, axis='y', which='major', color='#dddddd', linestyle='-')
37:
38: # Title and notes.
39: plt.title('Chicago area snowfall (inches)')
40: plt.text(1952, 86.5, 'NWS NOWData', ha='center', va='center',
41: fontsize='small', backgroundcolor='white')
42: plt.text(1952, 83, 'www.weather.gov/wrh/climate?wfo=lot', ha='center',
43: va='center', fontsize='x-small', backgroundcolor='white')
44: plt.text(2015, 3.5, 'Calendar year at start of winter season', ha='right',
45: va='center', fontsize='x-small', backgroundcolor='white')
46:
47: # Make the border and tick marks 0.5 points wide.
48: [ i.set_linewidth(0.5) for i in ax.spines.values() ]
49: ax.tick_params(which='both', width=.5)
50:
51: # Save as PNG.
52: plt.savefig('20251108-Chicago area snowfall.png', format='png', dpi=200, bbox_inches='tight')
There’s really not much Pandas in this script. The only interesting bits are in Lines 10 and 13. The usecols parameter in Line 10 limits the dataframe to just the two fields we care about in the CSV. Then the filter in Line 13 gets rid of both the years before 1932, as discussed above, and a few other years in which one or more months of missing data prevent a legitimate snowfall total.
Eliminating those missing value rows was necessary to run the lowess function in Line 16. I chose the frac and it parameters based on an “eye test” of the resulting curve. I’d try to come up with a better justification if this were more than just a blog post.
In addition to keeping my skills up in Pandas, I’d like to learn more about Polars and R. Since Polars’s main claim to fame is being faster than Pandas, it doesn’t make much sense to use it on a data set with only about 150 records. So I wrote a little R script to produce this less-than-satisfying plot:

The bulk of the plot is fine, but I didn’t stick with it long enough to get the details looking the way I like. The tick marks and major grid lines are too thick, and I’d rather not have the minor grid lines at all. I don’t understand why R created them without any associated ticks. In any event, here’s my amateurish R code:
1: #!/usr/bin/env Rscript
2:
3: library(tidyverse)
4:
5: # Import the data, including only the Season and Total columns.
6: df = read_csv('area.csv', col_select=c(1, 14), show_col_types=FALSE)
7:
8: # Retain rows since 1932 and when the Total isn't missing.
9: df <- filter(df, Season >= 1932, !is.na(Total))
10:
11: # Plot the raw and smoothed data using the bw theme.
12: p1 <- ggplot(data=df, mapping=aes(x=Season, y=Total)) +
13: theme_bw() +
14: geom_line(color='#bbbbbb', linetype='dashed', linewidth=.25) +
15: geom_point(size=.5) +
16: geom_smooth(method='loess', se=FALSE, span=.4, linewidth=.75, color='blue')
17:
18: # Set the scale and tick spacing.
19: p2 <- p1 +
20: scale_x_continuous(name=NULL, limits=c(1930, 2025), expand=0,
21: breaks=seq(1930, 2020, 10)) +
22: scale_y_continuous(name=NULL, limits=c(0, 100), expand=0,
23: breaks=seq(0, 100, 10))
24:
25: # Set the title.
26: p3 <- p2 +
27: ggtitle('Chicago area snowfall (inches)') +
28: theme(plot.title=element_text(size=10, hjust=.5))
29:
30: # Save it as a PNG file.
31: ggsave('20251109-Chicago snowfall ggplot.png', p3,
32: width=1200, height=1200, units='px')
I do like R’s layering approach of adding features to a plot step by step. And it’s cool that I can add a smoothing line without explicitly creating smoothed data series first. I’m not sure of the internals of R’s loess algorithm, other than it’s clearly not the same as the lowess function I used in the Python script. I had to use different parameters to get a curve that looked similar.
One thing I found absolutely infuriating was the need to add the expand=0 parameter to Lines 20 and 22. To me, when I set the limits on an axis, the plotting software should not extend that axis by an extra 5% or so by default. It should just do what I tell it to do and nothing more.
Anyway, I found the R approach to plotting interesting enough to give it another try the next time a problem like this comes up. I really should crack open Kieran Healy’s book, which Amazon says I bought nearly seven years ago. I have more time now.
-
The NOW part is one of those annoying acronyms of an acronym: NOAA Online Weather. ↩
The wait for new Siri continues
November 4, 2025 at 4:25 PM by Dr. Drang
Yesterday, I drove to Champaign-Urbana for the Illinois men’s home basketball opener. Because I’m going to the CSO concert tonight, I stayed here overnight. When I started my car this morning, CarPlay connected to my phone and offered to give me directions home. Snort.
It’s common for CarPlay to do this when I’m away from home, and under most circumstances it’s a reasonable suggestion. A couple of years ago, when my wife was having day-long chemo infusions at the University of Chicago Medical Center, it was nice to have our rush hour route automatically plotted from Hyde Park home to Naperville. But I’m obviously not going home today until after the concert.
Should Siri know this? Well, the concert ticket is in my Wallet, the event (with location) is in my Calendar, and the email with a link to the ticket is archived somewhere in Mail. Last year’s Apple Intelligence ads would lead anyone—anyone who didn’t keep track of Siri’s actual capabilities, that is—to think that suggestions given by an Apple device would be precise and personalized. But even people who don’t follow Apple closely know that last year’s ads were lies.
We’re now hoping that Mark Gurman is right and a bespoke version of Google Gemini will be the savior of personalized Siri, giving us an assistant that’s aware of our calendar, contacts, and so on, but doesn’t transmit all our data to the don’t be evil empire in Mountain View.
Or maybe we shouldn’t bother. If Apple hadn’t made promises it couldn’t keep, today’s suggestion to drive home wouldn’t have struck me as funny. I’d be happy when suggestions like that are right and would dismiss them without a second thought when they’re wrong. It’s the hope that kills you.
The problem with dollars
November 1, 2025 at 9:08 AM by Dr. Drang
I stopped charting Apple’s quarterly results back in 2019 and don’t intend to return to it, but after seeing the recent posts at Six Colors and TidBITS, I thought I’d try out a new graph.
I got out of the Apple charting business shortly after Apple stopped reporting unit sales with the Q1 2019 figures.1 I care more about how many products Apple is putting into people’s hands than about how much money is passing from those hands into Apple’s pocket. I mentioned this in my penultimate quarterly charting post, and I also pointed out a problem with charting dollar sales instead of unit sales:
One unaddressed problem with [Apple’s revenue figures] is that they don’t account for inflation—something I didn’t have to worry about when I was plotting unit sales. Apple doesn’t account for inflation either, of course, but that doesn’t mean I shouldn’t. If I’m going to keep doing this, I’ll have to decide on an inflation index and a basis year.
Of course, I didn’t decide on how to handle inflation. I chickened out and stopped doing quarterly posts after just one more set of charts. And Apple is happy to ignore inflation, because doing so allows it to report most quarters as “the best ever” in one form or another.
But let’s adjust Apple’s numbers for inflation and see what happens. For simplicity, I’m going to use the CPI-U Index, a common measure of inflation. It looks like this from 2018 to the present:

I downloaded the CPI-U monthly values from the Bureau of Labor Statistics and adjusted Apple’s revenue figures to their equivalent in Q1 2019 dollars. Here’s how that looks compared to the unadjusted revenue:

After adjustment, the climb after the big work-from-home jump isn’t as impressive, although the non-Q1 quarters of fiscal 2025 are quite good. The just-reported Q4 revenue truly is the best Q4 ever, even after adjustment.
You might say that CPI-U isn’t the right inflation measure to use. Also—as Jason Snell pointed out when I showed him this graph—Apple is a worldwide company, and there are different inflation rates in every country. But some adjustment should be made, especially if you want to graph results over several years and some of those years include the COVID inflation period. I’ve chosen the CPI-U because Apple reports its earnings in US dollars, and that’s the inflation metric most Americans are used to seeing.
This isn’t to say that it’s wrong to report the unadjusted numbers, just that accounting for inflation gives some added perspective. Does this mean I’m suggesting additional work for others that I myself don’t intend to do? Yes.
-
In case you’ve forgotten, Apple’s fiscal year ends on the last Saturday of September, so Q1 2019 covers roughly October through December of 2018. ↩