We get a lot of questions along the lines of "how do I do this using degree days?" It's very common for the answers to involve *linear regression analysis*.

There are many text books and online resources that explain what linear regression analysis is... But the theory can get a little heavy going... So we wrote this short article to explain just the basics of regression analysis using energy consumption figures and degree days.

**If you want to get straight to analyzing your energy data**, you could just use the regression tool on our website. It does everything described in this article and more, and it is easier than doing it all in Excel. Just visit the Degree Days.net web tool, select "Regression" as the data type, and follow the instructions from there.

Or read on for the basic theory and some Excel-based examples:

- What are degree days?
- Getting the raw data
- Correlating energy usage with degree days (with no day normalization)
- Improved correlation method for irregular data (with unweighted day normalization)
- The best method of all - weighted day normalization
- Multiple regression for when both heating and cooling are metered together

Before diving into regression analysis, it helps to understand exactly what a degree day is... This article provides a good introduction.

To do linear regression analysis, you need to correlate energy-consumption data with degree-day data:

You might have detailed interval data from a smart meter but more likely you'll have weekly or monthly records of energy consumption that you've collected yourself, or energy bills from a utility or energy supplier.

Most buildings follow a weekly routine, which means that ** weekly energy-consumption data is typically a good option for regression analysis**. Although the occupancy of the building and the heating patterns might vary throughout the week, the patterns are usually fairly consistent from one week to the next.

Monthly data is usually OK too, but it's rarely as good as weekly data, because the days of the week don't line up with calendar months (e.g. one month might have 5 weekends, the next might have 4). Unless your building is heated and used in the same way on weekends as it is on weekdays, these calendar mismatches will cause inaccuracies in your calculations.

If you have detailed interval energy consumption data (typically readings taken automatically every 60 minutes or less), you can use our Energy Lens software to turn it into daily, weekly, or monthly kWh data.

You might not have much choice about what energy-consumption data you use for your correlation. But you should try to get data for at least a few periods of measured energy consumption. If you've got daily, weekly, or monthly data, try to cover at least one full heating or cooling season. If you've got annual data, try to cover at least a few years of consumption.

If you're using meter readings provided by your utility or energy supplier, make sure not to use any estimated meter readings. Estimated readings are no use at all for this analysis!

If you're investigating heating energy consumption you'll want heating degree days; if you're investigating cooling energy consumption you'll want cooling degree days.

You will need one degree-day figure for each period of measured energy consumption. If your periods of measured energy consumption are irregular, you'll need to get daily degree days and sum them together to make a total for each period.

Our Degree Days.net tool enables you to access data in a variety of timescales, including daily data (which you can sum into figures matching any period spanning multiple days).

Above we explained how to *get* the two sets of data (energy consumption and degree days). Next you need to *correlate* these two sets of data...

The method that follows is ideal if your periods of measured energy consumption are all the same length (like weeks, which are all 7 days long)... But there's a slightly more complicated method that we'll introduce shortly that works slightly better for monthly data and that is usually necessary for regression analysis of irregular periods of consumption.

Our explanation of the improved method is based on this one, so please do go through this method first.

You might want to start by making 3 columns in Excel (or whatever spreadsheet software package you have):

- Start of period (you don't need this for the correlation, but it'll probably be useful for keeping track of the data).
- Degree days (either HDD or CDD, depending on whether you're investigating heating or cooling energy consumption).
- kWh (or BTU, or litres of oil, or whatever units your records of energy consumption have).

For an example, see the screenshot to the right.

You can then use the second and third columns of data to plot an X-Y scatter chart of HDD (or CDD) against energy consumption.

Once you've made the basic scatter chart, there are a few important extras that you'll probably want: a trend line, an equation, and an R^{2} value. To get these using Excel:

- Right-click one of the data points and select "Add Trendline...".
- For the "Type" select "Linear" (we're doing
*linear*regression analysis). - Check the boxes to "Display equation on chart" and "Display R-squared value on chart". (In Excel 2003 and below you'll need to click the "Options" tab to find these checkboxes.)

You should end up with something like the chart below:

- The "
*y*" corresponds to the kWh (or BTU, or whatever your units of energy consumption are). - The "
*x*" corresponds to the degree days (HDD or CDD, whichever you used). - The figure that multiplies the
*x*(3.317 in the example chart above) represents the**gradient**of the trend line. - The constant at the end (53.505 in the example chart above) is the
**intercept**. It represents the point at which the trend line crosses the y axis. In theory this should represent the "*baseload energy consumption*" (see here for more on this). Generally you would expect this to be zero or positive (more on this later).

**Most importantly** the equation enables you to estimate kWh from degree days. By plugging a known HDD or CDD figure into the equation you can calculate the *predicted* energy consumption for the period that the HDD/CDD covered. You can then compare the *predicted* energy consumption with the *actual* energy consumption for that period. You would typically do this to see whether the energy efficiency has got better or worse than it was in the period that you did the original regression analysis for.

The R^{2} value is basically a measure of how good the correlation is. The closer the R^{2} value is to 1, the better the correlation. A good correlation between degree days and energy consumption indicates that the methodology is sound (the main pitfalls in degree-day analysis have been avoided or corrected for), and that the heating/cooling system is working well (the "control" of the system is good). In other words, the higher the R^{2}, the better.

Generally speaking, an R^{2} of 0.75 indicates an reasonable correlation between energy consumption and degree days. 0.9 or above is very good. An R^{2} much below 0.7 or so is likely an indication that the heating control is either very poor, or that the analysis methodology needs to be improved (e.g. wrong base temperature, irregular building occupancy that hasn't been corrected for, heating/cooling metered together with other energy consumption that varies considerably throughout the year).

The example chart above shows a pretty good correlation. In this instance you don't need the R^{2} value to see that - it's clear from looking at the chart... But R^{2} is useful for assessing the strength of a correlation objectively.

As explained in this article, the base temperature of the degree days makes a big difference to how well they correlate with the energy consumption of any particular building.

The optimal base temperature varies from building to building. It's difficult to estimate the correct base temperature accurately for any particular building using logic alone, so it can be helpful to make a rough estimate and then try correlating kWh with degree days calculated to various base temperatures around that point. R^{2} gives a way to compare the strength of the different correlations.

In theory, the base temperature that produces the highest R^{2} should be the optimal base temperature of the building. However, it doesn't always work out quite so perfectly, because of the other factors that make degree-day-based analysis less than perfect. Nonetheless, testing various base temperatures can give you a useful indication. It shouldn't replace your intuition of what the base temperature *should* be, approximately, but it can help you to decide on what exact number to use. Generally speaking, the better your correlations (the higher your R^{2} values), the more faith you can reasonably place in the numbers.

The 3 relevant Excel functions are:

**SLOPE**: this gives you the gradient of the trend line.**INTERCEPT**: this gives you the intercept of the trend line - in this case it should represent the baseload energy consumption.**RSQ**: this gives you the R^{2}value.

What's great about these functions is that you can quickly apply them to multiple correlations, using degree days with a range of base temperatures. Its a question of copying functions across a spreadsheet rather than creating lots of individual charts.

The screenshot below shows a spreadsheet containing one set of energy-consumption data, multiple sets of degree days (all with different base temperatures), and gradient, intercept, and R^{2} values for each energy/degree-days correlation:

The above spreadsheet can be a little overwhelming on first glance, but it's clearer when broken down into steps:

- Make a rough estimate of the building's base temperature (you can find some basic information and guidance here).
- Select your estimated base temperature on Degree Days.net, check the "Include base temperatures nearby" box, and generate and download the degree days. You'll notice that the data has been calculated to a range of base temperatures either side of your estimate. It'll look a lot like the spreadsheet above, but without the blue cells, which are the ones we added in ourselves (of course you don't need to make them blue unless you want to).
- Add your energy-consumption figures to the right of the last column of degree days. (The energy-consumption periods need to match the periods that you generated the degree days for.)
- Under the last column of degree days, use the SLOPE formula, selecting your energy-consumption values as the
*known_y's*(these are the values in column "O" in the screenshot above). - After selecting the
*known_y's*, hit*F4*to make Excel insert $ symbols in front of the row and column references. The $ symbols "fix" the referenced cells so that you can copy the formula without the referenced cells changing. (Strictly speaking you only need the $ symbols in front of the column references, to fix the column on the energy data, but in this instance it doesn't hurt to have them in front of the row references as well.) - Next, for the
*known_x's*part of the formula, select the degree days from the last column of degree-day data (column "N" in the screenshot above). You*don't*want any $ symbols here. - You should have a formula something like the one you can see at the top of the image above.
- Now, take the cell that you entered the formula into, and copy and paste it across the row, filling the cells under each column of degree-day data.
- Do similarly in the two rows underneath, using the INTERCEPT and RSQ functions instead of SLOPE. You need to specify the
*known_y's*and*known_x's*in the same way for all three functions. - You should now be able to see the gradient, intercept (baseload), and R
^{2}value for each base temperature's correlation. You can use these values to help you decide upon the base temperature to use for the building.

If you're new to using $ symbols and functions/formulas in Excel, it's likely that you found some of the steps above a little confusing. It's well worth taking the time to learn how those features of Excel work - they make it possible to do all sorts of things in seconds that would take minutes or hours otherwise.

The intercept (baseload) should be roughly zero or positive. For base temperatures that are too high, you may see a negative intercept (so you can take that as an indication that you should be looking towards lower base temperatures).

Things get more messy when the heating or cooling consumption that you are analyzing is metered together with other energy uses. If those other energy uses are significant, you should expect a significant baseload energy consumption (positive intercept). If you have a good idea of how much energy those other energy uses consume, you can compare your predicted figure with the intercept values to look for the base temperatures with intercepts that fit your expectations.

Do bear in mind that "baseload" is a fuzzy concept, and it often varies throughout the year (meaning it's not really a "baseload" at all). Analysis is less precise when heating and cooling aren't metered separately from each other and from everything else. Do what you can with the figures you have, but don't be surprised if the numbers don't line up as neatly as you hope.

Also take a look at the R^{2} values. In theory, the base temperature of the building should be the base temperature with the highest R^{2} value. That's in theory though... In reality the various inaccuracies in degree-day-based analysis tend to muddy things up, and can cause misleading figures. Use the R^{2} values as an indication rather than an absolute, especially if your correlations aren't strong (i.e. low R^{2} values across the board).

If the numbers indicate that the optimal base temperature might be higher or lower than the range that you have tested, you should probably download more degree days with higher/lower base temperatures so that you can include them in your analysis.

Correlating energy usage with degree days, as described above, works well when all the energy-consumption records cover identical periods of time. It's ideal for linear regression analysis of daily or weekly data.

However, the above method doesn't work properly for irregular periods of consumption, like those gathered from records of oil deliveries...

The problem is with the baseload energy consumption. The method above assumes that the baseload is a constant number, but this assumption only makes sense if the periods of consumption are all the same length.

When records of energy consumption cover periods of various lengths, the baseload energy consumption depends on the lengths in question. If the baseload is 20 kWh for a 1 week period, it will be 40 kWh for a 2 week period, and 60 kWh for a 3 week period.

Baseload energy consumption can't be expressed as a constant unless the length of the period is also a constant.

(If the above statement doesn't make sense, you might be confusing kWh with kW... Many people do! If you're in any doubt, take a look at our article on kW and kWh - it explains both units in detail.)

Because different months have different lengths, using a constant figure for baseload kWh causes slight inaccuracies in correlations of monthly energy consumption with monthly degree days. The more irregular your consumption records, the greater these inaccuracies become.

Fortunately there's another approach that works just as well for irregular data as it does for regular data:

Instead of correlating energy consumption with degree days, **correlate energy consumption per day with degree days per day**.

To explain this, let's consider the example data that we used previously... Previously we simply correlated the kWh with the HDD, but the improved method involves a correlation of the kWh *per day* with the HDD *per day*... Here's how we might arrange this data in a spreadsheet:

Let's explain each column in turn:

- The start date of each period, as before. We've added on an extra date at the bottom because it makes it easier for us to calculate the number of days in the last period (see the explanation for column D).
- The HDD of each period, as before.
- The kWh of each period, as before.
- The number of days in each period. This is easy to calculate in Excel... To calculate the number of days in the period in row 2, take the start date of the period in row 3, and subtract the start date of the period in row 2. The screenshot above shows the formula at the top so you can see what's going on. Once you've calculated the first value, you can copy the formula right down the column to calculate the number of days in each period.
- The average HDD-per-day for each period. Calculate this by dividing the HDD by the number of days.
- The average kWh-per-day for each period. Calculate this by dividing the kWh by the number of days.

Once we have the figures we can create a scatter chart, just like before, except with HDD-per-day and kWh-per-day instead of HDD and kWh:

Like before, we can also add a trendline and the equation of that trendline (see the chart above).

- The "
*y*" corresponds to the average kWh-per-day (or average BTU per day, or average whatever-your-units-of-energy-consumption-are per day). As before, this includes both weather-dependent and non-weather-dependent (baseload) energy consumption. - The "
*x*" corresponds to the average degree days per day. - The figure that multiplies the
*x*(3.3333 in the example chart above) represents the gradient of the trend line. - The constant at the end (1.6813 in the example chart above) represents the point at which the trend line crosses the y axis. In theory this should now be equivalent to the
*average baseload energy consumption per day*.

This equation is very similar to the one described earlier, and you can apply it similarly. Just remember that x, y, and the constant, are *per-day* figures. Once you've calculated the energy consumption per day from an HDD-per-day figure, you can of course multiply it by the number of days in the period to work out the predicted kWh over the whole of the period.

We have already explained the process for investigating the effect of base temperature on the regression analysis (see here). We used kWh figures and HDD figures above, but it's easy to apply the same approach using kWh-per-day figures and HDD-per-day figures. We need to insert a few additional steps between steps 3 and 4 above:

- 3b) Calculate the number of days in each of our recorded periods (each month in our example).
- 3c) Translate the HDD figures into HDD-per-day figures.
- 3d) Translate the kWh figures into kWh-per-day figures.
- 3e) Follow the remaining steps described above, but using the calculated HDD-per-day figures and kWh-per-day figures instead of the original HDD and kWh figures.

The screenshot below shows one way in which you could organize the data. It might be a little neater to put the HHD/day and kWh/day figures to the right of the original figures, but putting them below makes it easier to fit them all in a screenshot:

In Excel, the trick to calculating HHD/day figures across all periods and base temperatures is to:

- Calculate the HHD/day for the first period and the lowest base temperature, using carefully-placed $ references so that the formula will copy well across multiple columns (base temperatures) and rows (periods).
- Copy that cell across all base temperatures and periods.

In the example above, cell B22 contained the formula "= B8 / $P8". This meant that the "Days" column P was fixed in the formula, but everything else in the formula was relative. So copying the formula across the base temperatures and down the periods worked as desired.

Provided you understood the previous examples, you should hopefully find this slightly-modified method pretty straightforward. But please let us know if anything is unclear. We appreciate that these instructions might be a little intimidating to anyone unfamiliar with Excel formulas and so on, but we're trying to make them as accessible as possible!

The unweighted-day-normalization method described above is better than using no day normalization at all, but it is still not ideal if the periods have different lengths. The trouble is that all the periods of data going into the regression have the same influence on the resulting regression formula. Whereas ideally the longer periods would have a greater influence on the regression formula than the shorter periods.

Weighted day normalization is a method that weights the periods according to their length (in days) so that the longer periods have more influence on the regression calculation than the shorter periods. If all your periods are exactly the same length (e.g. weekly data with periods that are all 7 days long) this makes no difference at all and the results would be the same as with unweighted day normalization. But weighted day normalization improves the regression when the periods have different lengths. The greater the variation, the greater the improvement of using this method.

This is the method recommended by ASHRAE and others, and it is worth using it if you can. Unfortunately there is no easy way to do it in Excel, but it is possible to do it by duplicating the kWh/day and HDD/day values in each period according to the number of days in the period (so you would turn a 31-day period into 31 duplicated kWh/day and HDD/day values), and then running a regression on all the duplicated values across all periods (you can use the SLOPE, INTERCEPT, and RSQ functions for this, as before).

There is, however, a much easier way, as **our online regression tool can do weighted day normalization automatically**. Just visit the Degree Days.net web tool, select "Regression" as the data type, and follow the instructions from there. Weighted day normalization is the default day-normalization option as it works well whether the periods are all the same length or not.

The methods described above work for buildings with heating (for which you would use heating degree days), and for buildings with cooling (for which you would use cooling degree days), but they won't work for buildings with both heating and cooling, unless the heating and cooling are metered separately.

When your energy-consumption data contains both heating energy consumption and cooling energy consumption, the best solution is usually *multiple regression*. This will give you a formula like:

y = a*HDD + b*CDD + c*days Where: y is the energy usage over the period in question; HDD is the heating degree days over the period in question; CDD is the cooling degree days over the period in question; days is the length (in days) of the period in question; a, b, and c, are regression coefficients (the regression tool can calculate these for you)

It is possible to do multiple regression in Excel, using the *Regression* option provided by the *Analysis ToolPak*. The trouble is that you have to do this one regression at a time through the point-and-click UI - there is no way to do it with formulas - so it's not really practical to test different base-temperature combinations to find the optimal base temperatures. (Note we say plural "base temperatures" because usually the HDD base temperature and the CDD base temperature are different, with the CDD base temperature typically being higher.)

Fortunately **our online regression tool does multiple regression automatically**, and it also tests thousands of different HDD/CDD base-temperature combinations to find the ones with the best statistical fit. Just visit the Degree Days.net web tool, select "Regression" as the data type, and follow the instructions from there.