Degree Days

Degree Days

Weather Data for Energy Saving

Degree Days.net Baseline Regression Tool

Regression is at the core of most analysis of heating/cooling energy consumption. A baseline regression describes energy consumption over a chosen baseline period, and is typically used to compare later energy consumption against baseline levels (e.g. to track ongoing performance or prove savings from changes made after the baseline period).

The regression tool tests thousands of regressions against your energy-usage data to help you choose the best regression (with the best HDD/CDD base temperatures) to represent baseline energy consumption.

  1. You choose a weather station and copy/paste in your energy-usage data from a spreadsheet.
  2. Degree Days.net generates HDD and CDD in a wide range of base temperatures and uses them to test thousands of regressions against your energy-usage data.
  3. You download a spreadsheet of the regressions that give the best statistical fit, together with a range of statistics to help you assess their quality.
  4. You choose the best regression (usually the first listed) and use it as the baseline for future analysis.

To use the regression tool: go to the Degree Days.net web tool and select "Regression" as the "Data type". Or continue reading this page to find out more about the regression tool and how best to use it.

On this page:

Why use the regression tool instead of Excel?

A simple regression of energy consumption against heating degree days or cooling degree days is easy enough to do in Excel. But the Degree Days.net regression tool offers a lot more:

Back to top

Copy/pasting in your energy data

The regression tool takes your energy-usage data and runs regressions against it. You will presumably have your energy data in a spreadsheet – you just have to check (and maybe modify) the format and then copy/paste the relevant data into the regression tool.

Step by step instructions for copy/pasting in your energy data

  1. Select the relevant data in your spreadsheet (see the 2 allowed formats below), and hit Ctrl-c to copy it into the clipboard (Command-c on Mac).
  2. Go back to the regression tool, click your mouse in the box, and hit Ctrl-v to paste (Command-v on Mac).
  3. The regression tool should show you a table of your data... Check it over to see that it interpreted everything correctly. If it didn't, edit the data in your spreadsheet and try copy/pasting again.

Don't worry if your spreadsheet contains a lot of other analysis too – you can select/copy/paste only the cells that the regression tool needs.

Getting your spreadsheet data into the right format

Your energy data can have one of two formats. The examples below show the same data specified in each of the 2 formats:

The first-day-only format is a common format for spreadsheets of energy-usage data. The specified dates must be the first day of each period of measured energy usage. The last day of each period is assumed to be the day before the first day of the next. You may need extra rows with dates to specify any gaps in the data and to help the regression tool figure out when the last period ends:

First-day-only format

If your data is regular daily, weekly, monthly (each month starting on the same day), or yearly data, you shouldn't need the final row as the regression tool should figure out the end of the last period automatically. But it's not a bad idea to include it anyway, for clarity.

The first-and-last-day format can be a good one to use if you have gaps in your data as you can typically add an extra column (column 2 in the example below) and insert a few dates without affecting other parts of your spreadsheet:

First-and-last-day format

With the first-and-last-day format you can specify the last day of every period (in column 2), but for the normal case (the last day of one period being the day before the first day of the next) you can just leave it blank.

Date formats

Date formats are a source of much confusion for computer systems. Something like 10/11/12 is highly ambiguous as it can be interpreted as mm/dd/yy, or dd/mm/yy, or even yy/mm/dd.

We like the ISO date format yyyy-mm-dd because it is totally unambiguous. But we've programmed the regression tool to do its best to make sense of a variety of other formats as well. People from all over the world use Degree Days.net and we don't want to force them to change their spreadsheets any more than necessary before copy/pasting their data in.

So we suggest you try copy/pasting your data as it is. Our system will say if it can't make sense of it, and, if it interprets your dates wrong, you should be able to see from the table it displays immediately after you paste your data into the box.

If it's not working correctly, try changing the format of all your dates to yyyy-mm-dd. This is easy to do in Excel: select all the date cells, right-click, select "Format Cells...", then "Custom", type yyyy-mm-dd in the "Type:" box, and click "OK". If your original date format was a common one that you would expect to work automatically, please email us so we can see if there's a way we can improve the system.

Any units are fine for the energy-usage data

Energy-usage data would typically be in kWh, but other units like Btu, therms, litres, or gallons, are fine too. The regression tool does not know or care what the units are – it just processes the usage figures as numbers and gives you regressions that use whatever units you used in your copy/pasted input data.

Back to top

Extra predictors (for Degree Days.net Pro customers)

With Degree Days.net Pro you can optionally include up to two extra predictors, like occupancy or production figures, in your regression analysis. Include them in additional columns to the right of your energy-usage data.

Including extra predictors

You will need to give your extra-predictor data column titles in a specific format that lets the regression tool know what they are called and how to process them. Titles should include a name and a code, like "production c+" in the example screenshot above. The code can be any of the following:

Cumulative in this case means that the figures increase with time and so would typically be larger for longer periods (like a month) than for shorter periods (like a day). For example: total staff hours worked, total widgets produced.

Average in this case means that the figures are averaged (or normalized) in such a way that the length of the period does not affect them. For example: average number of rooms occupied (in a hotel), average widgets produced per hour.

A positive correlation means that larger extra-predictor figures are expected to lead to greater energy usage.

A negative correlation means that larger extra-predictor figures are expected to lead to lower energy usage.

A positive or negative correlation indicates that you don't know whether to expect the extra predictor to increase or decrease energy usage. It's usually best to avoid using this option, as usually it's best to figure out exactly what you expect from any extra predictors you include.

A word of caution about extra predictors

An extra predictor should only be included if it is an independent variable (i.e. not influenced significantly by another extra predictor or by degree days). R-squared will always increase (or at least never decrease) with the addition of any extra predictor (even a completely irrelevant one), but this is a quirk of statistics rather than an indication of a better model.

In some cases it may be better to use extra-predictor data to split your energy data into multiple sets, then run each set separately through the regression tool. For example, rather than using occupancy data as an extra predictor, you may be able to use it to split your energy data into occupied and unoccupied sets, getting a separate regression model for each. This way the regression tool will have the opportunity to use different base temperatures for each.

That said, there are certainly instances where it makes perfect sense to use extra predictors in a regression model, so don't be afraid to try them out! The regression tool will always test regressions both with and without any extra-predictor data you provide, so the stats can help you decide whether to use them or not.

Day normalization

If in doubt, just choose "Weighted", as it works well in all cases. Or read on for more information.

Day normalization is important for dealing with energy-usage data that has periods of different length (such as monthly data). The regression equations below show the key difference between regressions that are day normalized and regressions that aren't:

With day normalization (weighted or unweighted):

E = b*days + h*HDD 
E = b*days + c*CDD
E = b*days + h*HDD + c*CDD

Without day normalization:

E = b + h*HDD
E = b + c*CDD
E = b + h*HDD + c*CDD

Where:

E is the energy usage over the period in question;
days is the length (in days) of the period in question;
HDD is the heating degree days over the period in question;
CDD is the cooling degree days over the period in question;
b, h, and c are regression coefficients (the regression tool calculates these).

Day-normalized regression equations require you to plug in the length (in days) of the period that you want to calculate baseline-predicted energy consumption for. In contrast, regressions that aren't day normalized only work for periods of the same length as the periods in the original baseline data – that period length is effectively built into the baseload coefficient (b) already.

If your baseline data (what the regressions are calculated from) has periods that are all the same length (e.g. daily or weekly data), day normalization is not important. With such data, regression with weighted and unweighted day normalization will give the same results, and regression with no day normalization will give a baseload coefficient (b) that is simply the period length (in days) multiplied by the baseload coefficient given by day-normalized regression. The other coefficients (h and c) will be the same as those given by day-normalized regression.

However, day normalization is important for regressions from data with periods of different length, as it will improve the accuracy of the coefficients. The more variation there is in the period lengths, the more difference there will be in the calculated coefficients.

In summary:

Note that monthly data typically has periods of different length (as calendar months can be 28, 29, 30, or 31 days in length), so it's definitely best to use day normalization (and preferably weighted day normalization) when running regressions against it. For consistency of your post-regression calculation processes we recommend using day normalization (and preferably weighted day normalization) for all the data you work with, whether it has different-length periods or not.

Back to top

Specifying base temperatures to include in the results

The "Include in results" option lets you specify a heating base temperature and a cooling base temperature for which regressions will be included in the results along with the auto-selected ones.

As the regression tool tests thousands of base-temperature combinations automatically, your chosen base temperatures will probably be tested whether you specify them or not. But by specifying them here you can see how their regressions compare with those in the auto-selected shortlist. If the statistics are close, you might want to use them instead of the auto-selected ones.

If you don't have a good idea of the base temperature(s) you want, you can just leave them on the default values to see how much better it is to choose optimal base temperatures than to stick with historically-prescribed defaults like 15.5°C or 65°F.

Back to top

Interpreting the results

After testing thousands of regressions against your energy-usage data, the Degree Days.net regression tool returns a spreadsheet with details of the regression(s) that gave the best statistical fit (the "shortlist"), and any others that were notable (e.g. for data that looks like it was from a heating-only building the regression tool will typically also return the best CDD-only regression, even though it's unlikely to make it into the shortlist).

The spreadsheet output contains the following columns of data:

Watch out for negative coefficients!

A negative coefficient (on the baseload, HDD, or CDD) is usually an indication that the regression is not a good one. A regression with one or more negative coefficients can often look good in other respects (i.e. good statistics), but it is unlikely to be justifiable in real-world terms, so is typically best ignored.

For informational purposes the regression tool will return regressions with negative coefficients if they fit better than any other regressions with the same equation (e.g. E = b*days + h*HDD, E = b*days + c*CDD, or E = b*days + h*HDD + c*CDD), but it will always list them below any regressions with only non-negative coefficients.

Choosing the best regression

The regression tool has a sophisticated process for comparing and ranking the thousands of regressions that it tests against each set of energy-usage data. The shortlist regressions are the ones it considers to be likely candidates, and the first-listed regression is the one it thinks best. But the regression tool knows nothing about the building that your energy-usage data came from. And, although we are always looking to improve the regression tool's algorithms, it is based on statistics and probabilities so it will never be possible for it to be correct 100% of the time.

If a building has no cooling then you're unlikely to want a regression involving CDD, even if the first-listed regression involves CDD. For a building with no heating you're unlikely to want a regression involving HDD. (Although if the numbers for such surprise regressions look much better than they do for the others it may be worth you questioning your assumptions about the building and the equipment that your metered energy is feeding.)

An experienced energy professional will often have a rough idea of the likely base temperature(s) of a building, the likely baseload energy consumption (expressed in the baseload coefficient of the regression equation), and the likely split of energy usage between baseload, heating, and cooling, over the baseline period of energy-usage data provided. This knowledge can help further in choosing the best regression.

If you have good knowledge of the building, use it!

  • Favour regressions with the predictors you expect (e.g. HDD only, CDD only, HDD and CDD together).
  • Favour regressions with base temperatures that you can justify in terms of the building and its operation (our article on estimating base temperatures should help).
  • Look at the regression coefficients and the HDD total and CDD total figures to see how much energy usage each regression attributes to heating, cooling, and baseload (non-weather-dependent consumption accounted for by the baseload coefficient), over the baseline period your energy-usage data covers. Favour regressions with usage breakdowns that fit with your expectations.

Though do check the statistics before choosing any regression that isn't in the shortlist. Here are some tips on comparing regressions based on the statistics:

Bear in mind that the regression tool already aims to use the statistics as best it can when choosing and ranking the regressions it returns from the thousands it tests against each set of energy-usage data. It will always be possible to improve the algorithms, but it should be doing a pretty good job. However, a statistics-only approach can only go so far, and, with knowledge of the building, you will often find that regression 2 or below will be a better choice than the one the regression tool put first.

Back to top

Help us improve the regression tool

Please email us at info@degreedays.net with any feedback about the regression tool. We'd love to hear what you like about it, what you don't like about it, and what else it could do to make it more useful to you.

About sending us data...

For quite a while after we launched the regression tool in beta in October 2015 we were particularly keen for people to send us real-world energy-usage data with which we could test and improve the regression-tool's algorithms. We have now received a lot of data to work with (thank you!) and getting more is no longer a priority for us. But if you do have an interesting data set that you would like to discuss, or that highlights a good or bad aspect of the regression tool, please do send it along, together with the following information:

  1. The fuel that the energy data represents (e.g. gas or electricity).
  2. The location of the building that it came from (so we can choose one or more weather stations to test the data against).
  3. Whether the building has heating, or cooling, or both.
  4. What other fuels supply the building.
  5. Which fuel(s) supply which components of the heating/cooling system (e.g. gas heating, electric cooling).
  6. What temperature the building is heated/cooled to (these are often different).
  7. Whether it is heated/cooled 24/7 or intermittently (e.g. for office hours only).
  8. Whether it is well insulated.
  9. Whether it has any significant internal heat gains (e.g. equipment that generates a lot of heat).
  10. Whether it has any significant refrigeration or freezer loads.
  11. Anything else that you think is likely to be relevant.

Sorry about the long list, it's just that we need to know about the building and its usage to figure out whether or not the regression tool is giving useful results for any given set of data.

Thank you!

Back to top

© 2008–2024 BizEE Software – About | Contact | Privacy | FAQ | Free Website | Pro Website | Desktop App | API