Thanksgivings and Thursdays
November 27, 2025 at 2:58 PM by Dr. Drang
Thanksgiving is not the last Thursday of November. I mean, it is this year, and it usually is, but that’s not how it’s determined. Since 1942, Thanksgiving—the US one, not the Canadian one—has been set by federal law to be the fourth Thursday in November. The law was passed in December of 1941, when you would have thought Congress and the White House had more important things to think about.
How often is Thanksgiving not the last Thursday of November? One way of thinking about it is to consider the day of the week of November 1st. If it falls on a Wednesday or Thursday, there will be five Thursdays that month. Another way is to consider the date of Thanksgiving. If it’s on the 22nd or 23rd, there will be a fifth Thursday after it. Let’s look at both.
Here’s a little Python code to list the Novembers since 1942 that have had five Thursdays using the first method:
python:
from datetime import date
for y in range(1942, 2026):
nov1 = date(y, 11, 1)
if nov1.weekday() == 2 or nov1.weekday() == 3:
print(y, end=', ')
The result, after deleting the final comma and reformatting it to show six years per line, is
1944, 1945, 1950, 1951, 1956, 1961,
1962, 1967, 1972, 1973, 1978, 1979,
1984, 1989, 1990, 1995, 2000, 2001,
2006, 2007, 2012, 2017, 2018, 2023
which is 24 Novembers over the past 84 years. The gaps between five-Thursday Novembers are either one or five years.
Doing it again with the second method, we first define a utility function that returns the date of Thanksgiving for a given year and then check the day of the month for every year since 1942:
python:
def thanksgiving(y):
day = (3 - date(y, 11, 1).weekday()) % 7 + 22
return date(y, 11, day)
for y in range(1942, 2026):
if thanksgiving(y).day < 24:
print(y, end=', ')
The output is the same as before.
I should mention that thanksgiving takes advantage of Python’s sign convention for the modulo (%) operator. The result has the same sign as the divisor (the number after the %), regardless of the sign of the dividend (the number before the %). So
(3 - date(y, 11, 1).weekday()) % 7 + 1
returns the day of the month corresponding to the first Thursday of the month. Adding 21 to that gives us the fourth Thursday.
Getting 24 five-Thursday Novembers in 84 years is two-sevenths, which seems like the expected result for how often November starts on a Wednesday or Thursday. But will that be the result in the longer term?
The Gregorian calendar repeats every 400 years,1 so to get the fraction of Novembers with five Thursdays over the long haul, we have to check over that range:
The result is 0.285, which is just a hair less than 2/7 (0.285714). There’s no way for the answer to be exactly 2/7 because 400 isn’t divisible by 7, but this is awfully close.
Happy Thanksgiving!
-
The calendar cycle repeats every 28 years except when that period includes a century year that isn’t divisible by 400.
python: count = 0 for y in range(1942, 2342): nov1 = date(y, 11, 1) if nov1.weekday() == 2 or nov1.weekday() == 3: count += 1 print(count/400) ↩
NASA fact sheets—missing and found
November 25, 2025 at 9:50 PM by Dr. Drang
While I was putting together my recent post on Maya calendar calculations, I tried to get some information on the Moon from a NASA website. Specifically, a page with the prosaic name Moon Fact Sheet at the URL
https://nssdc.gsfc.nasa.gov/planetary/factsheet/moonfact.html
The gsfc part of the domain refers to the Goddard Space Flight Center, and the nssdc part refers to the NASA Space Science Data Coordinated Archive (formerly the National Space Science Data Center, which explains the missing a). It’s one of several similar pages with nicely summarized information on planets and other bodies in the solar system.
Did I say it is one of several such pages? I meant it was one of several such pages. They’re all gone and apparently have been for a few months now. Why? Well, my first thought was some sort of DOGE-inspired budget cut that saved the US taxpayer the untold billions of dollars necessary to host maybe as much as a half-megabyte of static HTML.
Or maybe NASA itself was to blame. Maybe it decided to replace the dry but wonderfully useful tables of numbers with this page and similar, which have lots of nice images but very few numbers. Or maybe the various Fact Sheets have been moved where neither Google nor Kagi can find them.
Whatever the answer, I was sad for the loss. But only momentarily. Because the Internet Archive still had the Moon Fact Sheet and all the other Planetary Fact Sheets. Hooray!

I’ve bookmarked the planetary sheet, as it has links to all the others. For what it’s worth, the last archived versions seem to date from August of this year.
As an early Christmas present to myself, and to the rest of the internet, I made a donation.
Web logs with Pandas
November 24, 2025 at 9:31 AM by Dr. Drang
I don’t care much about monitoring the traffic here, but analyzing web logs seemed like a good way to practice using Pandas. I used some old data analysis friends, like groupby, and made some new ones, like the vectorized str functions. I also reacquainted myself with rsync, which I used to run quite often in my Linux days.
Let’s start by laying out the problem. My web server is running Apache 2, and it generates access log files in the combined format. I wanted my script, called top-pages, to download and parse the necessary log files from the server and report on the top pages accessed either yesterday or today. Today, running top-pages with no options returns this:
Sunday, November 23, 2025
Page path Visits
1 2015/01/apple-leverage 651
2 2025/11/some-maya-calendar-calculations 315
3 2025/11/casting-about-again 222
4 2025/11/charting-malpractice 185
5 2025/11/variance-of-a-sum 179
6 2025/11/the-wait-for-new-siri-continues 178
7 2025/11/snow-and-memory 123
8 2025/11/the-problem-with-dollars 114
9 2025/10/other-wavy-paths 109
10 2025/10/vectors-and-weathervanes 105
All pages 10,050
These are the ten individual blog posts with the most hits yesterday and the total number of hits on all individual posts. As you can see, top-pages doesn’t report on hits to the home page, my RSS feeds, image files, or anything other than blog posts.
To see the options top-pages can handle, here’s the output from running top-pages -h:
usage: top-pages [-h] [-t | -y] [-n [N]]
Table of the top pages from ANIAT weblogs.
options:
-h, --help show this help message and exit
-t today's top pages
-y yesterday's top pages (default)
-n [N] number of pages in table (default: 10)
So I can change the length of the table with the -n option, and I can have it show today’s top pages (so far) with the -t option. The -t and -y options are mutually exclusive. If I try to run top-pages -t -y, it will return
usage: top-pages [-h] [-t | -y] [-n [N]]
top-pages: error: argument -y: not allowed with argument -t
Because top-pages runs on my MacBook Pro, it has to download the necessary log files from the server. This is not efficient, but right now I’m more interested in writing top-pages than in running it. If I suddenly get the urge to run it on a regular basis, I’ll move it to the server and set up a cron job there to run it every day and email me the results.
The access log files are in the /var/log/apache2 directory on the server. At present, that directory contains these files:
access.log
access.log.1
access.log.2.gz
access.log.3.gz
[more access logs]
error.log.1
error.log.2.gz
error.log.3.gz
[more error logs]
access.log is the current access log file. It’s being continually updated as the site gets visited. Every morning at 6:25 am (in the server’s time zone), the current access log is archived into access.log.1 and a new version of access.log is started. At the same time, what used to be access.log.1 gets gzip‘d into access.log.2.gz, and the older gzip‘d access logs get their numbers incremented.
The upshot of this is that the only access logs top-pages needs to determine the hits from yesterday and today are access.log, access.log.1, and access.log.2.gz. We’ll see this when we get into the source code.
Which I guess it’s time to do:
python:
1: #!/usr/bin/env python3
2:
3: import pandas as pd
4: from datetime import datetime, timedelta
5: from zoneinfo import ZoneInfo
6: import argparse
7: import subprocess
8: import os.path
9:
10: # Functions for later use.
11: def day_params(whichDay):
12: '''Return the title, start, and end parameters for whichDay.
13:
14: The whichday argument can be either "today" or "yesterday."
15: Any value other than "today" is treated as "yesterday."
16: '''
17:
18: if whichDay == 'today':
19: theDay = datetime.now(tz=ZoneInfo('America/Chicago'))
20: else:
21: theDay = datetime.now(tz=ZoneInfo('America/Chicago')) - timedelta(days=1)
22: dTitle = f'{theDay:%A, %B %-e, %Y}'
23: dStart = theDay.replace(hour=0, minute=0, second=0, microsecond=0)
24: dEnd = theDay.replace(hour=23, minute=59, second=59, microsecond=999_999)
25: return dTitle, dStart, dEnd
26:
27: def date_slug(request, width):
28: '''Return a series of yyyy/mm/slug strings from the request series.
29:
30: The strings are width characters long, truncated or left-padded,
31: as necessary.
32: '''
33:
34: # The page URL is between two space characters.
35: dslug = request.str.replace(r'^[^ ]+ ([^ ]+) .+', r'\1', regex=True)
36: # Strip everything but the yyyy/mm/slug.
37: dslug = dslug.str.replace(r'(https?://(www\.)?leancrew\.com)?/all-this/',
38: r'', regex=True)
39: dslug = dslug.str.replace(r'(^\d\d\d\d/\d\d/[^/]+).*$', r'\1', regex=True)
40: # Shorten if necessary.
41: long = dslug[dslug.str.len()>width]
42: dslug[long.index] = dslug[long.index].str.slice_replace(start=width-3,
43: repl='...')
44: return dslug.str.pad(width)
45:
46: def read_apache_access(f):
47: '''Return a dataframe from an Apache access log file.
48:
49: The log is in combined format and the returned dataframe has columns for
50: Request, Status, and Datetime. The Datetime is determined via the Timestamp
51: and Zone columns, which are not part of the returned dataframe.
52: '''
53:
54: # CSV reading code adapted from https://stackoverflow.com/questions/58584444
55: headers = 'Timestamp Zone Request Status'.split()
56: df = pd.read_csv(f, sep=' ', escapechar='\\', quotechar='"',
57: usecols=[3, 4, 5, 6], names=headers)
58:
59: # Turn the Timestamp and Zone into a single Datetime column.
60: df['Datetime'] = pd.to_datetime(df.Timestamp.str.slice(start=1) +
61: df.Zone.str.slice(stop=-1),
62: format='%d/%b/%Y:%H:%M:%S%z',
63: utc=True)
64:
65: # Don't include the Timestamp and Zone columns in the returned dataframe.
66: return df.drop(labels=['Timestamp','Zone'], axis='columns')
67:
68:
69: ########## Main program ##########
70:
71: # Handle the arguments.
72: desc = 'Table of the top pages from ANIAT weblogs.'
73: parser = argparse.ArgumentParser(description=desc)
74: group = parser.add_mutually_exclusive_group()
75: group.add_argument('-t',
76: help="today's top pages",
77: default=False,
78: action='store_true')
79: group.add_argument('-y',
80: help="yesterday's top pages (default)",
81: default=True,
82: action='store_true')
83: parser.add_argument('-n',
84: help="number of pages in table (default: 10)",
85: nargs='?',
86: default=10,
87: const=10,
88: type=int)
89: args = parser.parse_args()
90:
91: # Set the printing parameters.
92: iwidth = len(str(args.n)) # index
93: pwidth = 40 # page path
94: vwidth = 7 # visits
95: fwidth = iwidth + 2 + pwidth + 1 + vwidth # full table
96: twidth = iwidth + 2 + pwidth # "All pages" position
97:
98: # Set the day's title, start, and end parameters.
99: if args.t:
100: dTitle, dStart, dEnd = day_params('today')
101: else:
102: dTitle, dStart, dEnd = day_params('yesterday')
103:
104: # Set the local log file parameters.
105: logdir = os.path.join(os.environ['HOME'],
106: 'Library/Mobile Documents/com~apple~CloudDocs/personal/weblogs/')
107: af = [os.path.join(logdir, 'access.log.2'),
108: os.path.join(logdir, 'access.log.1'),
109: os.path.join(logdir, 'access.log')]
110:
111: # Update the log files.
112: rsync = f"rsync -az -e 'ssh -p 1234'\
113: --include=access.log.2.gz\
114: --include=access.log.1\
115: --include=access.log\
116: --exclude=*\
117: user@server.com:/var/log/apache2/ '{logdir}'"
118: subprocess.run(rsync, shell=True)
119:
120: # Gunzip the oldest log file if necessary. Keep the gzipped file.
121: gzfile = af[0] + '.gz'
122: if (not os.path.exists(af[0])) or\
123: (os.path.getmtime(af[0]) < os.path.getmtime(gzfile)):
124: subprocess.run(f"gunzip -fk '{gzfile}'", shell=True)
125:
126: # Read the log files into a dataframe.
127: df = pd.concat((read_apache_access(f) for f in af), ignore_index=True)
128:
129: # Limit to one day's successes using my timezone.
130: df = df[(df.Datetime >= dStart) &
131: (df.Datetime <= dEnd) &
132: (df.Request.str.slice(stop=4) == 'GET ') &
133: (df.Status == 200)]
134:
135: # Limit to just the requests for individual pages.
136: df = df[df.Request.str.contains(r'all-this/\d\d\d\d/\d\d', regex=True)]
137:
138: # Get the yyyy/mm/slug from Requests and add it as a new column.
139: dslug = date_slug(df.Request, pwidth)
140: df.insert(0, 'Page path', dslug)
141:
142: # Group by Page path, count the entries, and sort in descending order.
143: # All of the columns have the same count, so I chose to include just the
144: # Status column and renamed it Visits for presentation.
145: # Limit it to just the first args.n rows.
146: top = df.groupby('Page path')\
147: .count()\
148: .sort_values('Status', ascending=False)\
149: ['Status'].reset_index(name='Visits')[:args.n]
150:
151: # Make the index column one-based, not zero-based, and left-padded.
152: top.index = [ f'{i+1:{iwidth}}' for i in range(args.n) ]
153:
154: # Print out the date, the top pages and the count of all pages.
155: print(f'{dTitle:>{fwidth}s}')
156: print(top.to_string(formatters=[lambda p: f'{p:s}',
157: lambda v: f'{v:{vwidth},d}']))
158: print(f'{'All pages':>{twidth}} {len(df):{vwidth},d}')
This is distinctly longer than most of the scripts I post here, so the explanation will be longer, too. Apologies in advance. I’m going to start the description with the main body of the program—we’ll deal with the functions as they get called.
The first section, Lines 72–89, uses the argparse library to define and process the command-line arguments. The -t and -y options are defined as members of a mutually exclusive group so they give us the behavior shown above. The -n option is separate, with a default value of 10. No one who watched as much David Letterman as I did would set the default to anything other than a Top Ten list.
The next few sections set global parameters that the balance of the program uses.
The printing parameters (Lines 92–96) are the widths and positions of the various columns in the table. For reasons I don’t quite understand, Pandas likes to put two spaces after the index column but just one space between the others. That’s why you see the 2 and the 1 in the fwidth and twidth definitions.
The date and time parameters (Lines 99–102) control the title of the table and the datetime limits used to filter the entries that get counted. This section calls the day_params function that’s defined in Lines 11–25. The key feature of day_params is that it bases its definitions of “today” and “yesterday” according to my home timezone here in the Chicago area. This makes the dStart and dEnd values timezone-aware. As we’ll see later, we can compare them with the datetime values from the access logs, which are set to UTC, and get correct results.
The local log file parameters (Lines 105–109) set the directory and full file paths to the access log files after they’ve been downloaded to my MacBook Pro. As you probably know, your iCloud Drive folder is this directory:
~/Library/Mobile Documents/com~apple~CloudDocs
I keep the access log files in a personal/weblogs subdirectory in iCloud Drive.
With the parameters defined, it’s time to start analyzin’. The first thing we need to do is make sure the local copies of the log files are up to date. That’s done through the longish rsync command that’s defined on Lines 112–117 and executed on Line 118. The port, user, and server you see in these lines are, of course, fictional, but the rest of the command is exactly as I use it. Basically, it downloads the access.log, access.log.1, and access.log.2.gz files if they are older on the server than they are on my Mac.
The next section, Lines 121–124, uncompresses the access.log.2.gz if necessary. The decision is made based on the presence of an (uncompressed) access.log.2 file and its modification date. The gunzip command is run if either
- there is no uncompressed file, or
- the uncompressed file is older than the compressed file.
The sharp-eyed among you will note that I could save some downloading here by noting whether top-pages is being asked to return a table for yesterday or today. I decided to ignore this optimization because I wanted to focus on Pandas. And if this script ever gets moved to the server, none of this rsync stuff will be necessary.
With these preliminaries over, it’s time to move into the Pandas stuff. The first step is to read the log files into a dataframe. This is done on Line 127, which uses the read_apache_access function defined on Lines 46–66. To help understand read_apache_access, we should take a look at an example log file entry:
34.96.45.20 - - [23/Nov/2025:06:56:36 -0600] "GET /all-this/2025/11/some-maya-calendar-calculations/ HTTP/1.1" 200 8111 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36"
Sorry about the very long line, but that’s how they look in the access log file.
The fields are separated by space characters. If a space character is part of a field, the field is surrounded by double quotation marks. Except, unfortunately, for the date/time field. For some reason, that field is surrounded by square brackets instead of double quotes, which makes the parsing in read_apache_access a little tricky.
The solution is in the read_csv call in Lines 56–57. By setting the sep, escapechar, and quotechar parameters as shown, we get all the fields the way we want except the date/time, which gets parsed as two fields, one with the timestamp (and a leading bracket) and the other with the timezone offset from UTC (and a trailing bracket). For the example above, the Timestamp field is
[23/Nov/2025:06:56:36
and the Zone field is
-0600]
Lines 60–63 then combine these two fields—after stripping the brackets—with the to_datetime function and add a new Datetime field to the dataframe. The utc=True parameter in the call to to_datetime does exactly what you think: it makes the resulting datetime value timezone-aware and set to UTC. read_apache_access then returns the dataframe with three fields: Request, Status, and Datetime. In the example, Request is
GET /all-this/2025/11/some-maya-calendar-calculations/ HTTP/1.1
Status is 200, and Datetime (in UTC) is 12:56:36 on 2025-11-23.
Line 127 uses a generator expression to loop through the three access log files, creating a dataframe for each and putting them together with concat.
The next section, Lines 130–133, filters the dataframe to include only successful GETs within the specified day. Line 136 further filters the dataframe to just individual posts. It does this by searching for the all-this/yyyy/mm pattern that all the posts at ANIAT have.
The next section, Lines 139–140, adds a new field, Page path, by normalizing the page URLs in the Requests field. It does this through a call to the date_slug function, which is defined on Lines 27–44. date_slug goes through a series of regex replacements to strip out all the inessential parts, leaving just the yyyy/mm/slug part of the URL. It also pads or shortens the string to width characters.
We’re getting near the end now. The next section, Lines 146–149 is a groupby call that counts the entries, sorts them by count, and puts the top ten (or whatever you gave to the -n option) in a new dataframe called top. It also calls reset_index to renumber the rows of top. Without that call, the indices of top would be the row numbers of the original df dataframe.
Because I wasn’t a computer science major, I don’t think a list of the top ten items should be numbered from zero to nine. Line 152 adds 1 to all the index numbers and right-aligns them.
Finally, Lines 155–158 print out the table that we saw back near the top of the post. The columns are formatted using f-string rules.
Although there was a lot of non-Pandas stuff in this script, I think it was a good exercise. The filtering and groupby parts matched up with a lot of the Pandas work I’ve done in the past, but the vectorized regex calls certainly didn’t. Most of my previous analysis has been with dataframes filled with either numbers or boolean values, so this was a nice stretch for me. It wouldn’t be hard to rejigger top-pages to count hits on my RSS feed or list the top referring sites. If I ever decide to care about such things.
Some Maya calendar calculations
November 21, 2025 at 10:15 PM by Dr. Drang
Longtime readers of this blog know I’m a sucker for calendrical calculations. So I couldn’t stop myself from doing some after reading this article by Jennifer Ouellette at Ars Technica.
The article covers a recent paper on the Dresden Codex, one of the few surviving documents from the Maya civilization. The paragraph that caught my eye was this one:
[The paper’s authors] concluded that the codex’s eclipse tables evolved from a more general table of successive lunar months. The length of a 405-month lunar cycle (11,960 days) aligned much better with a 260-day calendar (46 × 260 = 11,960) than with solar or lunar eclipse cycles. This suggests that the Maya daykeepers figured out that 405 new moons almost always came out equivalent to 46 260-day periods, knowledge the Maya used to accurately predict the dates of full and new moons over 405 successive lunar dates.
Many calendrical calculations involve cycles that are close integer approximations of orbital phenomena whose periods are definitely not integers. The Metonic cycle is a good example: there are almost exactly 235 synodic months in 19 years, so the dates of new moons (and all the other phases) this year match the dates of new moons back in 2006. “Match” has to be given some slack here; they match within 12 hours or so.
Anyway, the Maya used a few calendars, one of them being the 260-day divinatory calendar mentioned in the quote above. The integral approximation in this case is that 46 of these 260-day cycles match pretty well with 405 synodic months.
How well? The average synodic month is 29.53059 days long (see the Wikipedia article), so 405 of them add up to 11,959.9 days, which is a pretty good match. It’s off by just a couple of hours over a cycle of nearly 33 years.
I wanted to see how well this works for specific cycles, not just on average. So I used Mathematica’s MoonPhaseDate function to calculate the dates of new moons over about five centuries, from 1502 through 2025. This gave me 6,480 (405 × 16) synodic months, and I could work out the lengths of all the 405-month cycles within that period. I’ll show all the Mathematica code at the end of the post.
First, the shortest synodic month in this period was 29.2719 days and the longest was 29.8326 days, a range of about 13½ hours. The mean was 29.5306 days, which matched the value from Wikipedia and its sources.
I then calculated the lengths of all the 405-month periods in this range: from the first new moon in the list to the 406th, from the second to the 407th, and so on. The results were a minimum of 11,959.5 days, a maximum of 11,960.3 days, and a mean of 11,959.9 days (consistent with the value calculated above).
These are good results but not perfect, and the Maya knew that. They made adjustments to their calendar tables, based on observations, and thereby maintained the accuracy they wanted.
Now that I’ve done this, I feel like going through the paper to look for more fun calculations.
The calculations summarized above were done in two Mathematica notebooks. The first used the MoonPhaseDate command repeatedly to build a list of all the new moons from 1502 through 2025 and save them to a file. Here it is:
Because MoonPhaseDate returns the next new moon after the given date, the For loop builds the list by getting the day after the last new moon, calculating the next new moon after that, and appending it to the end of the list. It’s a command that takes little time to write but a lot of time to execute—over a minute on my MacBook Pro with an M4 Pro chip. That’s why the last command in the notebook saves the newmoons list to a file. The notebook that does all the manipulations of the list could run quickly by just reading that list in from the saved file.
I don’t know any of the details of how MoonPhaseDate works or how accurate it is, but I assume it’s less accurate for dates further away from today. That’s why the period over which I had it calculate the new moons was well after the peak of the Maya civilization.
The second notebook starts by reading in the newmoons.wl file—a plain text file with a single Mathematica command that’s nearly 2 MB in size—and then goes on to calculate the lengths of each synodic month and each 405-month cycle. The statistics come from the Min, Max, and Mean functions.