Monthly Temperature Histories

Posted on April 10, 2015


The visualization above is based on data from the Daily Global Weather Measurements data set, originally collected by the National Climactic Data Center and available as a public data set on Amazon Web Services (AWS). (Note that, as per the note at the bottom of the link, this data set can only be used within the United States.) On AWS, the data is available on an EBS snapshot as a collection of text files. The volume is 20 GB in size.

To aid in exploration and analysis of this data, the scripts described here were used to configure a MySQL server on EC2 with the data loaded as tables. The scripts use the boto and paramiko libraries in Python, and but for one manual ssh action required in the middle of the process (as described in the notebook), it can otherwise run through on its own. The result is a database with three tables, indexed for fast searches on station ID, year, month, day and temperature. The database uses 34 GB of disk space.

Example queries of this database (via the MySQLdb library in Python) are shown in this IPython notebook, including the queries used to extract and summarize the data in the D3 visualization shown above. There are more than 9000 stations available in the database, just a small subset of 259 are shown here (roughly one per country or territory).