Regression is at the core of most analysis of heating/cooling energy consumption. A baseline regression describes energy consumption over a chosen baseline period, and is typically used to compare later energy consumption against baseline levels (e.g. to track ongoing performance or prove savings from changes made after the baseline period).

The regression tool tests thousands of regressions against your energy-usage data to help you choose the best regression (with the best HDD/CDD base temperatures) to represent baseline energy consumption.

- You choose a weather station and copy/paste in your energy-usage data from a spreadsheet.
- Degree Days.net generates HDD and CDD in a wide range of base temperatures and uses them to test thousands of regressions against your energy-usage data.
- You download a spreadsheet of the regressions that give the best statistical fit, together with a range of statistics to help you assess their quality.
- You choose the best regression (usually the first listed) and use it as the baseline for future analysis.

To use the regression tool, go to the Degree Days.net web tool and select "Regression" as the "Data type". Or continue reading this page to find out more about the regression tool and how best to use it.

- Why use the regression tool instead of Excel?
- Copy/pasting in your energy-usage data
- Day normalization
- Specifying base temperatures to include in the results
- Interpreting the results
- Help us improve the regression tool

A simple regression of energy consumption against heating degree days or cooling degree days is easy enough to do in Excel. But the Degree Days.net regression tool offers a lot more:

- It does
**weighted regression, recommended by ASHRAE**and others for improved accuracy, but difficult to do in Excel. - It does
**multiple regression against HDD and CDD together**as well as single regression against HDD and CDD individually. This is important for buildings with both heating and cooling. - It
**tests thousands of regressions with different HDD and CDD base temperatures**to find the ones that give the best statistical fit. For years we've recommended a similar process for single regression (either HDD or CDD alone), but we do not know of a realistic way to do this for multiple regression (HDD and CDD together) in Excel. - It automatically generates
**powerful extra statistics**that can't all be easily calculated in Excel. For example, it generates the*CVRMSE*statistic that ASHRAE recommends, and, as well as*R-squared*, it generates the more sophisticated*cross-validated R-squared*which is better for assessing the predictive power of an energy model.

The regression tool takes your energy-usage data and runs regressions against it. You will presumably have your energy data in a spreadsheet - you just have to check (and maybe modify) the format and then copy/paste the relevant data into the regression tool.

- Select the relevant data in your spreadsheet (see the 2 allowed formats below), and hit
*Ctrl-c*to copy it into the clipboard (*Command-c*on Mac). - Go back to the regression tool, click your mouse in the box, and hit
*Ctrl-v*to paste (*Command-v*on Mac). - The regression tool should show you a table of your data... Check it over to see that it interpreted everything correctly. If it didn't, edit the data in your spreadsheet and try copy/pasting again.

Don't worry if your spreadsheet contains a lot of other analysis too - you can select/copy/paste only the cells that the regression tool needs.

Your energy data can have one of two formats. The examples below show the same data specified in each of the 2 formats:

The **first-day-only format** is a common format for spreadsheets of energy-usage data. The specified dates must be the first day of each period of measured energy usage. The last day of each period is assumed to be the day before the first day of the next. You may need extra rows with dates to specify any gaps in the data and to help the regression tool figure out when the last period ends:

If your data is regular daily, weekly, monthly (each month starting on the same day), or yearly data, you shouldn't need the final row as the regression tool should figure out the end of the last period automatically. But it's not a bad idea to include it anyway, for clarity.

The **first-and-last-day format** can be a good one to use if you have gaps in your data as you can typically add an extra column (column 2 in the example below) and insert a few dates without affecting other parts of your spreadsheet:

With the *first-and-last-day format* you can specify the last day of every period (in column 2), but for the normal case (the last day of one period being the day before the first day of the next) you can just leave it blank.

Date formats are a source of much confusion for computer systems. Something like 10/11/12 is highly ambiguous as it can be interpreted as *mm/dd/yy*, or *dd/mm/yy*, or even *yy/mm/dd*.

We like the ISO date format *yyyy-mm-dd* because it is totally unambiguous. But we've programmed the regression tool to do its best to make sense of a variety of other formats as well. People from all over the world use Degree Days.net and we don't want to force them to change their spreadsheets any more than necessary before copy/pasting their data in.

So we suggest you try copy/pasting your data as it is. Our system will say if it can't make sense of it, and, if it interprets your dates wrong, you should be able to see from the table it displays immediately after you paste your data into the box.

If it's not working correctly, try changing the format of all your dates to *yyyy-mm-dd*. This is easy to do in Excel: select all the date cells, right-click, select "Format Cells...", then "Custom", type *yyyy-mm-dd* in the "Type:" box, and click "OK". If your original date format was a common one that you would expect to work automatically, please email us so we can see if there's a way we can improve the system.

If in doubt, just choose "Weighted", as it works well in all cases. Or read on for more information.

Day normalization is important for dealing with energy-usage data that has periods of different length (such as monthly data). The formulas below show the key difference between regressions that are day normalized and those that aren't:

With day normalization (weighted or unweighted): y = a*HDD + c*days y = b*CDD + c*days y = a*HDD + b*CDD + c*days Without day normalization: y = a*HDD + c y = b*CDD + c y = a*HDD + b*CDD + c Where: y is the energy usage over the period in question; HDD is the heating degree days over the period in question; CDD is the cooling degree days over the period in question; days is the length (in days) of the period in question; a, b, and c, are regression coefficients (the regression tool calculates these)

Day-normalized regression formulas require you to plug in the length (in days) of the period that you want to calculate baseline-predicted energy consumption for. In contrast, regressions that aren't day normalized only work for periods of the same length as the periods in the original baseline data - that period length is effectively built into the constant coefficient (*c*) already.

If your baseline data (what the regressions are calculated from) has periods that are all the same length (e.g. daily or weekly data), day normalization is not important. With such data, regression with weighted and unweighted day normalization will give the same results, and regression with no day normalization will give a constant coefficient (*c*) that is simply the period length (in days) multiplied by the constant coefficient given by day-normalized regression. The other coefficients (*a* and *b*) will be the same as those given by day-normalized regression.

However, day normalization is important for regressions from data with periods of different length, as it will improve the accuracy of the coefficients. The more variation there is in the period lengths, the more difference there will be in the calculated coefficients.

In summary:

**Weighted**day normalization is the recommended option. It gives the best results for data with periods of different length, and the weighting makes no difference if the periods are all the same length. Longer periods effectively have a greater influence on the regression coefficients than shorter periods. The only real downside of weighted day-normalization is that it is very difficult to replicate in Excel. But, with our regression tool to run the regressions for you, that shouldn't be a problem. You don't need to actually run the regressions in Excel to be able to use the regression formulas that the regression tool gives you in Excel.**Unweighted**day normalization - for data with periods of different length this is typically better than using no day-normalization at all, but it's not as good as the weighted option. It is, however, easy to replicate in Excel by regressing energy usage per day against degree days per day.**None**- no day normalization at all. This works fine for data with periods that are all the same length, but not so well for data with periods of different length. We've included it as an option because, despite its inaccuracies, it is widely used, and it is easy to replicate in Excel (just regress energy usage against degree days).

Note that monthly data typically has periods of different length (as calendar months can be 28, 29, 30, or 31 days in length), so it's definitely best to use day normalization (and preferably weighted day normalization) when running regressions against it. For consistency of your post-regression calculation processes we recommend using day normalization (and preferably weighted day normalization) for all the data you work with, whether it has different-length periods or not.

The **"Include in results"** option lets you specify a heating base temperature and a cooling base temperature for which regressions will be included in the results along with the auto-selected ones.

As the regression tool tests thousands of base-temperature combinations automatically, your chosen base temperatures will probably be tested whether you specify them or not. But by specifying them here you can see how their regressions compare with those in the auto-selected shortlist. If the statistics are close, you might want to use them instead of the auto-selected ones.

If you don't have a good idea of the base temperature(s) you want, you can just leave them on the default values to see how much better it is to choose optimal base temperatures than to stick with historically-prescribed defaults like 15.5°C or 65°F.

After testing thousands of regressions against your energy-usage data, the Degree Days.net regression tool returns a spreadsheet with details of the regression(s) that gave the best statistical fit (the **"shortlist"**), and any others that were notable (e.g. for data that looks like it was from a heating-only building the regression tool will typically also return the best CDD-only regression, even though it's unlikely to make it into the *shortlist*).

The spreadsheet output contains the following columns of data:

**Regression**- the ranking of each regression. The system aims to list the regressions in best-to-worst order, and it labels the likely candidates with*"shortlist"*. Typically you'll want to use the first-listed regression from the*shortlist*. But, with knowledge of the building, you may decide that another regression is a better choice. For example, if you know that the building has heating but no cooling, you may want to choose a regression with HDD but no CDD, even if it's not the first listed, and occasionally even if it's not in the*shortlist*. Though do watch out for regressions with negative coefficients (see below).**Formula**- the regression formula into which you would plug the relevant coefficients from the other columns. This enables you to quickly see which parameters are relevant for each regression listed, and it highlights the way in which you can use each regression to calculate baseline-predicted energy usage over any period for which you have degree days in the base temperature(s) specified by the regression.**Station ID**- the ID of the weather station that was used to generate the degree days that your energy-usage data was regressed against.**Readings**- (a.k.a.*sample size*or*sample count*) the number of energy-usage-data readings that were used for the regression. If this is less than the number of readings you copy/pasted in it's because the weather station you selected did not have enough recorded temperature data to generate degree days matching all of your energy-usage periods.**First day**- the first day of the first energy-usage period that was used for the regression. You can check this against the first day of your original energy-usage data - if they're not the same you might want to try again with another weather station, as presumably the weather station you selected did not have enough recorded temperature data to generate degree days to match all your energy-usage data.**Last day**- the last day of the last energy-usage period that was used for the regression.**Total days**- the number of days covered by the energy-usage data that was used for the regression. If there were no gaps in the usage data this will simply be the number of days between the*first day*and the*last day*.**Gap days**- the number of days between the*first day*and the*last day*for which there was no energy-usage data. Degree days from Degree Days.net are always continuous (no gaps), so, if this number is greater than zero, it will be because of gaps in your original energy-usage data rather than gaps in the degree days generated to match it.**DD % est**- the "% estimated" figure associated with the degree days that were used for the regression - a number between 0 and 100 indicating the extent to which the degree days were based on estimated temperature data used to fill in missing or erroneous temperature readings from the weather station you selected.**HDD base (C or F)**- the heating base temperature of the regression. To use the regression formula you'd want to download HDD in this base temperature. This will be blank if the regression did not include HDD.**CDD base (C or F)**- the cooling base temperature of the regression. To use the regression formula you'd want to download CDD in this base temperature. This will be blank if the regression did not include CDD.**HDD total**- the total heating degree days (in the*HDD base*temperature) over the period covered by the energy-usage data used for the regression (excluding gaps). This will be blank if the regression did not include HDD.**CDD total**- the total cooling degree days (in the*CDD base*temperature) over the period covered by the energy-usage data used for the regression (excluding gaps). This will be blank if the regression did not include CDD.**HDD coef (a)**- the HDD coefficient in the regression formula (e.g. the*a*in*y = a*HDD + c*days*). Watch out for negative values (see below).**CDD coef (b)**- the CDD coefficient in the regression formula (e.g. the*b*in*y = b*CDD + c*days*). Watch out for negative values (see below).**Constant (c)**- the constant coefficient in the regression formula (e.g. the*c*in*y = a*HDD + b*CDD + c*days*). Watch out for negative values (see below). For day-normalized regressions (which we recommend) you'll want to multiply this by the number of days covered by the period you're considering (the period that you want the baseline-regression-predicted energy consumption for and the period that your HDD/CDD values cover).**R2**- the*R-squared*value for the regression. This is the same*R-squared*that you will be familiar with if you have ran regressions in Excel (or added a linear trend line to a scatter plot). Higher values are better, but be careful:*R-squared*favours more complicated regression models, so it's not great for comparing different types of energy regressions (like comparing an HDD+CDD regression with an HDD-only or CDD-only regression). For more on this see the notes for*adjusted R-squared*below.**R2 adjusted**(*adjusted R-squared*) - like*R-squared*, but adjusted downwards to account for the number of predictors (like e.g. HDD and CDD) in the regression formula. To some extent this corrects for the fact that*R-squared*will always increase if you add more predictors into a regression, so it is better than*R-squared*if you're comparing, say, a regression of energy against HDD alone with a regression of energy against both HDD and CDD. It's a standard statistical measure (not specific to energy or degree days) that you will find plenty more information about online if you search. However,*cross-validated R-squared*is typically a more useful statistic for baseline energy regressions.**R2 cross-validated**(*cross-validated R-squared*or*predicted R-squared*) - like*R-squared*but calculated in a more sophisticated way. Whilst*R-squared*only gives an indication of how good the regression is at*explaining*the sample usage data it was generated from,*cross-validated R-squared*gives an indication of how good the regression is likely to be at*predicting*energy consumption (according to baseline levels) in the future. It's calculated using a process called leave-one-out cross-validation. It's a standard statistical measure (rather than being specific to the energy field), and you can find more information about it online if you search for "cross-validated R-squared" or "predicted R-squared". It's derived from the*PRESS*statistic which you can also look up. But you don't need to understand*cross-validated R-squared*fully to know that it is essentially a more-sophisticated*R-squared*that is in many ways better for choosing regressions and assessing the quality of a baseline energy model. Like*R-squared*it has a value of 1 or less (1 being the best value), and it is typically greater than zero, though, unlike*R-squared*, it can be negative for really bad regressions. Also, unlike*R-squared*, it doesn't have the problem of consistently favouring more complicated regression models.**S**- the*standard error*of the regression. This is a common statistic (not energy-specific) that you can look up online. It will always be zero or greater, and lower values are better. You can think of it as a ± value on the y values calculated by the regression, though, like*R-squared*,*adjusted R-squared*, and*CVRMSE*, it is only an indication of the regression's ability to*explain*the sample energy-usage data that it was generated from - it's not like*cross-validated R-squared*which indicates the ability of the regression to*predict*energy consumption from HDD/CDD values that didn't feed into the original model (i.e. from outside the baseline period).**CVRMSE**(*Coefficient of Variation of the Root Mean Square Error*) - a statistic recommended by ASHRAE (e.g. in their often-referenced Guideline 14) to assess the quality of a baseline regression. Lower values are better. For calculation of savings compared to a baseline model ASHRAE likes to see a baseline regression with*CVRMSE*below 0.25 if there is at least 12 months' worth of post-baseline consumption data to be compared with the baseline-predicted consumption, or below 0.2 if there is less than 12 months' worth. You will often see*CVRMSE*expressed as a percentage (e.g. 0.2 expressed as 20%) - to turn the spreadsheet figures into percentages just multiply them by 100. Note that*CVRMSE*does not penalize regressions with more predictors, nor does it involve any cross-validation, so we don't recommend using it over*cross-validated R-squared*to choose between the regressions returned by the regression tool. It's a useful statistic, but that's not what it's built for.**HDD coef S**- the*standard error*of the HDD coefficient. This is a measure of the precision of the regression model's estimate of the HDD coefficient value. It's a common statistic (not energy specific) that you can look up online.**HDD coef P**- the*p-value*of the HDD coefficient. It can have values between 0 (best) and 1 (worst), and you can use it as an indication of whether there is likely to be a meaningful relationship between HDD and energy consumption. People often look for*p-values*of 0.05 or less, but really it is a sliding scale.*p-value*is a common statistic (not energy specific) that you can look up online.**CDD coef S**- the*standard error*of the CDD coefficient. See the explanation of*HDD coef S*above.**CDD coef P**- the*p-value*of the CDD coefficient. See the explanation of*HDD coef P*above.**Constant S**- the*standard error*of the constant coefficient. See the explanation of*HDD coef S*above.**Constant P**- the*p-value*of the constant coefficient. See the explanation of*HDD coef P*above, but note that the*p-value*for the constant coefficient is usually high if the constant coefficient is small, so it's often not much use as an indication of the quality of a regression.

A negative coefficient (on HDD, CDD, or the constant) is usually an indication that the regression is not a good one. A regression with one or more negative coefficients can often look good in other respects (i.e. good statistics), but it is unlikely to be justifiable in real-world terms, so is typically best ignored.

For informational purposes the regression tool will return regressions with negative coefficients if they fit better than any other regressions with the same formula (e.g. *y = a*HDD + c*days*, *y = b*CDD + c*days*, or *y = a*HDD + b*CDD + c*days*), but it will always list them below any regressions with only non-negative coefficients.

The regression tool has a sophisticated process for comparing and ranking the thousands of regressions that it tests against each set of energy-usage data. The *shortlist* regressions are the ones it considers to be likely candidates, and the first-listed regression is the one it thinks best. But the regression tool knows nothing about the building that your energy-usage data came from. And, although we are always looking to improve the regression tool's algorithms, it is based on statistics and probabilities so it will never be possible for it to be correct 100% of the time.

If a building has no cooling then you're unlikely to want a regression involving CDD, even if the first-listed regression involves CDD. For a building with no heating you're unlikely to want a regression involving HDD. (Although if the numbers for such surprise regressions look much better than they do for the others it may be worth you questioning your assumptions about the building and the equipment that your metered energy is feeding.)

An experienced energy professional will often have a rough idea of the likely base temperature(s) of a building, the likely baseload energy consumption (expressed in the constant coefficient of the regression formula), and the likely split of energy usage between heating, cooling, and baseload, over the baseline period of energy-usage data provided. This knowledge can help further in choosing the best regression.

If you have good knowledge of the building, use it!

- Favour regressions with the predictors you expect (e.g. HDD only, CDD only, HDD and CDD together).
- Favour regressions with base temperatures that you can justify in terms of the building and its operation.
- Look at the regression coefficients and the
*HDD total*and*CDD total*figures to see how much energy usage each regression attributes to heating, cooling, and baseload (non-weather dependent consumption accounted for by the constant), over the baseline period your energy-usage data covers. Favour regressions with usage breakdowns that fit with your expectations.

Though do **check the statistics before choosing any regression that isn't in the shortlist**. Here are some tips on comparing regressions based on the statistics:

- Watch out for negative coefficients (see above). A regression with one or more negative coefficients (including a negative constant) is unlikely to be a good one.
It gives an indication of the power of a regression to predict future energy consumption according to baseline levels (the main purpose of baseline regressions), whilst*Cross-validated R-squared*is a particularly useful statistic for comparing regressions that were generated from the same energy-usage data.*R-squared*,*adjusted R-squared*, the*standard error*(*S*), and*CVRMSE*are only measures of how well a regression explains the energy-usage data that it was generated from. Predictive power versus explanatory power is an often-discussed topic in statistics, and the two often go hand in hand, but the key things to remember are that prediction is typically more important for energy regressions, and that*cross-validated R-squared*is a useful measure of a regression's predictive power (and one that is highly intuitive thanks to its similarity to the familiar*R-squared*).- The
*p-values*of the HDD and CDD regression coefficients (*HDD coef P*and*CDD coef P*) can be useful. A high*p-value*is an indication that the predictor might not belong. For example, if a regression with HDD and CDD has a high*p-value*for the CDD coefficient, it is an indication that perhaps an HDD-only regression would be better. The*p-value*of the constant coefficient isn't typically so useful, however, because it's usually high when the constant is small (which it often rightfully is).

Bear in mind that the regression tool already aims to use the statistics as best it can when choosing and ranking the regressions it returns from the thousands that it tests against each set of energy-usage data. It will always be possible to improve the algorithms, but it should be doing a pretty good job. However, a statistics-only approach can only go so far, and, with knowledge of the building, you will often find that regression 2 or below will be a better choice than the one the regression tool put first.

We always like constructive feedback, but it's particularly important to us at the moment while the regression tool is in beta. We'd love to hear what you think of it: what you like about it, what you don't like about it, and what else it could do to make it more useful to you.

**Please email us at info@degreedays.net
with any feedback about the regression tool**. We'll be most appreciative, and as a way of saying thanks we'll gladly set you up with temporary free access to Degree Days.net Pro so you can use the regression tool with larger data sets or data going further back in time.

For a year after we launched the regression tool in beta in October 2015 we were particularly keen for people to send us real-world energy-usage data with which we could test and improve the regression-tool's algorithms. We have now received a lot of data to work with (thank you!) and getting more is no longer such a priority for us. But if you do have an interesting data set that you would like to discuss, or that highlights a good or bad aspect of the regression tool, please do send it along, together with the following information:

- The fuel that the energy data represents (e.g. gas or electricity).
- The location of the building that it came from (so we can choose one or more weather stations to test the data against).
- Whether the building has heating, or cooling, or both.
- What other fuels supply the building.
- Which fuel(s) supply which components of the heating/cooling system (e.g. gas heating, electric cooling).
- What temperature the building is heated/cooled to (these are often different).
- Whether it is heated/cooled 24/7 or intermittently (e.g. for office hours only).
- Whether it is well insulated.
- Whether it has any significant internal heat gains (e.g. equipment that generates a lot of heat).
- Whether it has any significant refrigeration or freezer loads.
- Anything else that you think is likely to be relevant.

Sorry about the long list, it's just that we need to know about the building and its usage to figure out whether or not the regression tool is giving useful results for any given set of data.

Thank you!