German Temperature Data

Temperatures

Personally, I thrive in cooler weather. In fact, any temperature above 20°C is summer for me, and anything above 25°C is hot. Most people don’t understand me, but I prefer cold over warm. Fact: if it’s cold, you can put on another piece of clothing until you’re feeling warm again. If, however, it’s warm… Well, there’s only a limited amount of clothes you can take off until you’re accused of crimes of moral turpitude. Plus, if it’s above 25°C, even going full monty won’t get you to cool off.

I recently had an discussion if this year’s March was warmer than the average. Generally speaking, with global average temperatures steadily increasing since the 1970s, I think it’s safe to say that for a given random month, you’re better off stating that it was warmer than the average compared to colder. Which is exactly what I did. Turns out, I was wrong… But I learnt how to change the colour of a ggplot() line mid-way between two points, based on an arbitrary cutoff value!

Data

As a data scientist, I not only blurt out statements, but I am also in the (judging the current political discussions, unique) opportunity to gather facts to check my ejaculations (teehee, he said “ejaculate”). The official source for German weather data is the Deutscher Wetterdienst, the DWD (translated as the German Meteorological Service). Since it runs the national climate archive, it also provides data on temperatures, amongst many other data like precipitation and historical vine ripeness data (now, there’s another blogpost just waiting to be written, I’d say!).

We’re only interested in the monthly regional air temperature averages. Let’s grab the monthly data for all German regions from the DWD’s FTP:

f.get.dwd.data <- function() {
  # load libraries needed in function
  library(readr)
  library(tidyr)
  
  # generate empty list to fill with data in the loop
  raw.dta <- vector("list", length = 12)
  
  for(i in seq_len(12)){
    # grab all 12 months of data from DWD CDC
    url <- paste0(
      "ftp://ftp-cdc.dwd.de/pub/CDC/",
      "regional_averages_DE/monthly/air_temperature_mean/",
      "regional_averages_tm_", sprintf("%02i", i), ".txt"
    )
    
    # check max length once
    if(i == 1){
      lngth <- length(read_lines(url))
    }

    f.get.length <- function(lngth) {
      # if i < current month, return Januar length
      # else, length is one month less than Januar legnth
      if(i < as.integer(format(Sys.Date(), format = "%m"))){
        # -1 for skip, -1 for last line removed
        return(lngth-2)
      } else {
        return(lngth-3)
      }
    }
    
    # read data with readr, since this automatically parses the urls
    txt.dta <- read_lines(url, skip = 1, n_max = f.get.length(lngth))
    
    # add data to list
    # do NOT use readr, since this somehow completely messes up the data
    raw.dta[[i]] <- read.csv2(text = txt.dta, stringsAsFactors = FALSE, dec = ".") %>% 
      # last row is fubar'ed, remove
      select(-X)
  }
  
  # bind all into one data frame
  dta <- bind_rows(raw.dta) %>% 
    # clean data set
    gather(region, temperature, -Jahr, -Monat) %>% 
    # rename year and month to English
    select(year = Jahr, month = Monat, everything()) %>% 
    arrange(region, year, month)
  
  return(dta)
}

Pulling the 12 files (one for each month) is pretty straightforward with a for loop. I found read_csv() from library(readr) to choke on non-locale decimals and negative numbers. A lesson learnt: I spent half a day trying to figure out what went wrong with my download script. To “fix” the problem, I read the file as a text file with read_lines() (since readr functions nicely default to downloading files if it’s a web resource), and then feeding this text into standard R’s read.csv() function, specifying the non-standard decimal. I skip the first line (since it’s not data, but an explanation), and the last row (since it’s garbage). Skipping the last row needs some checking of the length of the file, done in f.get.length().

Once you’ve set up the function, it’s easy to just grab the data and save it for later use:

if(file.exists("dwd.Rdata")){
  load("dwd.Rdata")
} else {
  dta <- f.get.dwd.data()
  save(dta, file = "dwd.Rdata")
}

All in all, relatively easy to get the data (modulo the readr problem).

Graphing

Right, let’s have a look at the data. I live in North Rhine-Westphalia currently, so let’s just look at that data, for March only and generate a rolling 10-year average.

nrw <- dta %>%
  filter(region == "Nordrhein.Westfalen") %>% 
  filter(month == 3) %>% 
  mutate(
    avg.10.years = rollmeanr(x = temperature,
                             k = 10,
                             fill = NA),
    check = temperature >= avg.10.years
  )

nrw %>% ggplot(aes(x=year, y=temperature))+
  geom_line(aes(group = region, color = temperature > avg.10.years))+
  geom_line(aes(x=year, y=avg.10.years))

DWD temperature data, messy

The usual Brownian motion around an average. The average which has been increasing since around the 1970s, matching the world-wise NASA data.

First problem, and the most painful one: I was wrong. March 2016 was in fact slightly cooler than the 10-year average.

Second problem: grouping the data by a cutoff point (the 10-year average temperature) works, but since it’s point-wise data, ggplot colours the line continuously from one dot to the next, even when it crosses the average temperature line and should change colour exactly at that point.

Is there a way for ggplot to handle this? With some quick googleing, I found the answer on stackoverflow: not directly, but you can generate a new “high resolution” dataset which would allow ggplot to set the cutoff point exactly at the intersection between the two lines.

I can claim no ownership on the solution, the credit goes fully to the stackoverflow user jbaums. I took his solution and “updated” it to newer R (dplyr and pipes, specifically; the answer was written four years ago, at which time my solution setup wasn’t available).

#new data for testing
tt <- nrw
# resolution scale (0.01 = factor 100)
yres <- 0.01

tt %<>% 
  mutate(
    # rename to use in function later
    x = year,
    y = temperature,
    z = avg.10.years, 
    # generate point coordinates for line end
    x2 = lead(x),
    y2 = lead(y),
    z2 = lead(z)
  ) %>% 
  # remove last row from group
  filter(!row_number() == n()) %>% 
  # remove years in which z is NA
  # (chokes the summarising later)
  filter(!x %in% .$x[is.na(z)]) %>% 
  # get high-resolution versions of the inter-point data
  rowwise() %>% 
  do(
    # for each row, generate dataframe with the low-res data,
    # which gets automatically filled up to the high-res data rows
    data.frame(
      year = .$year,
      month = .$month,
      region = .$region,
      temperature = .$temperature,
      x = .$x,
      x2 = .$x2,
      # make a sequence (linear line) of the interpoint data
      # we will use this data length as "basis" for x.hr and z.hr, below
      y.hr = seq(.$y, .$y2, yres*sign(.$y2-.$y)),
      avg.10.years = .$z,
      z2 = .$z2
    ) %>% 
      mutate(
        # make a sequence (linear line) of the remaining interpoint data
        # take the rownumber of data needed from y.hr, above
        x.hr = seq(min(x), min(x2), length.out = nrow(.)),
        z.hr = seq(min(z), min(z2), length.out = nrow(.))
      )
  )

Let’s go through the code step by step (which is made easy since it’s piped code).

  1. start a new dataset for testing, and set the high-resolution scale factor (0.01, or factor 100 in this case)
  2. generate point coordinates for the line end point by using lead()
  3. remove last row from each group
  4. remove any x values for which z have NAs. These result from the rolling average, and will choke the seq() calls later
  5. for each row, we now have the start point and end point of each “section”: (x1, x2), (x2, x3), …
    • we now want to generate a “high resolution” version of these point, essentially filling 100 data points linearly between (x1, x2), and (x2, x3), and so on
    • to do this, we use rowwise() %>% do()
    • each row pulls the start points and the end points of each data section, and then using these as start and end points for a sequence of length yres, the high-resolution scale factor
    • all of this is done as a data.frame(), and by the magic of rowwise-do() the result is a single, large dataframe with everything inside!

With the high-res data now set up, we can graph the chart in a much nicer way:

ggplot()+
  # high-res version of the temperature data
  geom_line(data = tt, aes(x=x.hr, y=y.hr, group = region, color = y.hr > z.hr))+
  # add rolling average as line
  geom_line(data = tt, aes(x=year, y=avg.10.years, color = "10y rolling avg"))+
  guides(
    colour = guide_legend("temperature is larger than \n10-year rolling avg")
  )+
  # change line colours -- black for the line, red if above, green if below
  scale_color_manual(values=c("#000000", "#A5CF35", "#C44B4B"))+
  labs(title = "March temperatures in North Rhine-Westphalia",
       x = "", y = "temperature (March monthly average)")
# save everything, using golden ratio
ggsave(file="temperature.png", width = 30, height = 30/((1+sqrt(5))/2), units = "cm")

DWD temperature data, nice colour shifts

Ooh, it’s so pretty! ☺

Conclusion

One can clearly see the underlying trend of temperature increases. I believe I’ll have to find a job in Hammerfest in the coming years…

What could have been answered with a calculator and 2 minutes on the website turned out to be two day project to pull and clean data, fix and report bugs, and learn even more about ggplot(). Time well spent, I’d say!

Code and data for this analysis is availabe on github, as always.

comments powered by Disqus