Thursday, January 19, 2017

SQL-Server Set-Based Moving Averages without Cursors, Self-Joins, or Subselects

Microsoft SQL-Server Query / Procedure: Set-Based Moving Averages (without Cursors, self-joins, or sub-selects) — Example 1, with comparison to APPLY and OVER

In this blog, I provide the source code for a Microsoft SQL-Server Transact-SQL (T-SQL) query example (easily adapted to a stored procedure) that demonstrates an innovative way of efficiently calculating moving-averages, with variable window-widths / queue-depths, without using cursors or self-joins or any of the usual techniques. This is purely a set-based method, and is very efficient — the only solution that I have found to be more efficient is the native implementation of windowing (OVER clause) functionality in Microsoft SQL-Server 2012, but that has limitations since the native approach does not support using variables to define the number of data-points (i.e., the "window width") to include in the moving-average value like my custom solution is capable of.

In addition to presenting the source code for an efficient SQL moving averages implementation, I examine and compare it to alternatives that make use of APPLY (CROSS APPLY / OUTER APPLY) as well as the new OVER / PARTITION BY features of SQL-Server 2012. Keep in mind that the CROSS APPLY and OUTER APPLY are NOT ANSI Standard SQL (it is a Microsoft extension to Transact SQL), and additionally it is no where near as efficient as my custom and fast moving-average SQL function (or the very fast SQL2012 windowing-functionality). My code will work on older versions of SQL-Server (SQL2005, SQL2008, etc) as well as newer versions.

I have included performance comparisons and execution-speed results for each of the techniques demonstrated here. Three distinct query techniques (that produce the same results) appear below.

NOTE: See my next blog for a related example that extends upon this and shows the source code for performing SQL-Server T-SQL Set-Based Moving-Averages with resets at breaks for another variation on this where I also included the ability to perform break-level resets (to the moving averages; i.e., restart the moving-average calculation at a break-level).

DISCUSSION

This is one way of implementing a solution to a rather common problem — calculating moving average — using any version of Microsoft T-SQL that lacks full ANSI SQL2003 window functions (i.e., any SQL-Server version prior to SQL 2012). If you are not familiar with windowing functions in SQL, perhaps it is because of the current limitations that Microsoft Transact-SQL imposes on Aggregate window functions using the OVER clause. The OVER clause determines the partitioning and ordering of the rowset before the associated window function is applied (and, OVER is relevant to Ranking window functions and Aggregate window functions). This is what SQL-Server 2005, 2008, and 2008r2 provides for with the OVER clause:

Ranking Window Functions
< OVER_CLAUSE > :: =
    OVER ( [ PARTITION BY value_expression, ... [ n ] ] )


Aggregate Window Functions
< OVER_CLAUSE > :: =
    OVER ( [ PARTITION BY value_expression, ... [ n ] ] )


Now, compare SQL-Server's limited Window Functions implementation to the SQL2003 standard for Window Functions, where the SQL2003 standard specifies the following rather powerful syntax for windowing functions:
FUNCTION_NAME(expr) OVER {window_name|(window_specification)}

window_specification ::= [window_name][partitioning][ordering][framing]

partitioning ::= PARTITION BY value [, value...] [COLLATE collation_name]

ordering ::= ORDER [SIBLINGS] BY rule [, rule...]

rule ::= {value|position|alias} [ASC|DESC] [NULLS {FIRST|LAST}]

framing ::= {ROWS|RANGE} {start|between} [exclusion]

start ::= {UNBOUNDED PRECEDING|unsigned-integer PRECEDING|CURRENT ROW}

between ::= BETWEEN bound AND bound
     bound ::= {start|UNBOUNDED FOLLOWING|unsigned-integer FOLLOWING}

exclusion ::= {EXCLUDE CURRENT ROW|EXCLUDE GROUP
              |EXCLUDE TIES|EXCLUDE NO OTHERS}


The window "framing" options were lacking in SQL-Server (prior to SQL-Server version 2012). So, I put together this example (and others on this blog) to demonstrate how to accomplish the same result using purely SET BASED SQL methods for achieving framing in SQL-Server without T-SQL framing being supported as a language feature yet. I did not have to use self-joins or cursors to pull this off, and the technique should extend readily if you desire.

This example used the Microsoft AdventureWorks sample database tables and values from SQL-Server 2012 release.

SQL-Server Procedure / Query Source Code

--********************************************************************************
--This source code is Copyright (c) 2007-2017
--     Author: Mike Eberhart
--
--I hereby release this code under the terms of the MIT License (for freeware).
--
--Permission is hereby granted, free of charge, to any person obtaining a copy
--of this software and associated documentation files (the "Software"), to deal
--in the Software without restriction, including without limitation the rights
--to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
--copies of the Software, and to permit persons to whom the Software is
--furnished to do so, subject to the following conditions:
--
--The above copyright notice and this permission notice shall be included in
--all copies or substantial portions of the Software.
--
--THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
--IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
--FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
--AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
--LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
--OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
--THE SOFTWARE.
--********************************************************************************
--**********************************************************************************************
-- BEGIN: Mike's SET-BASED Moving-Average Technique
-- Demonstrate how, without any self-JOIN operation, without any CURSOR, without subselects, 
-- we can still calculate a moving average, including the ability to reset that moving average
-- at particular break-values, and easily alter the number of values included in average.
--
-- Queue-Depth (i.e., "moving average values-count") size changes have nearly no impact on
-- execution speed, meaning there is essentially no performance penalty for larger moving-avg
-- "windows" than small ones -- so, a 30-day moving average calculates at nearly same speed 
-- as a 120-day moving average due to efficient algorithm.
--
-- Excellent for financial-type applications, like moving-average stock-price calculations.
-- This example even "resets" the moving-average at a break-value, and can easily be extended
-- to do so at multi-column breaks, etc (see SET-BASED RUNNING TOTALS EXAMPLE 2 for technique).
--**********************************************************************************************
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS; -- Clears the data cache
DBCC FREEPROCCACHE    WITH NO_INFOMSGS; -- Clears the procedure cache

SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

DECLARE @starttime DATETIME;
SELECT  @starttime = GetDate();

SET NOCOUNT ON;

--==============================================================================================
--Our work-table, where we can ensure the ORDER with which we later access the data, being
--by default, the order of the PRIMARY KEY declared here in UniqueID column.
--==============================================================================================
DECLARE @Results TABLE
(
  UniqueID                 INT IDENTITY NOT NULL PRIMARY KEY,
  AverageResetBreakColumn1 INT,
  OrderDate                DATETIME,
  ValueBeingAveraged       MONEY,
  TotalSaleMovingAvg       MONEY
);

--==============================================================================================
--Insert all values to be totaled, into our work table in the REQUIRED ORDER by which
--we will be calculating moving-average for.  In this example, we will look at moving
--average of Order SubTotals by OrderDate within each Territory.
--==============================================================================================
INSERT INTO @Results(
  AverageResetBreakColumn1,
  OrderDate,
  ValueBeingAveraged)
SELECT
  Customers.TerritoryID, 
  Header.OrderDate, 
  SUM(ISNULL(Header.SubTotal, 0)) --Handle NULL values
FROM 
  Sales.SalesOrderHeader      AS Header
  JOIN Sales.SalesOrderDetail AS Detail
    ON (Detail.SalesOrderID = Header.SalesOrderID)
  JOIN Sales.Customer         AS Customers
    ON (Customers.CustomerID  = Header.CustomerID)
GROUP BY
  Customers.TerritoryID,
  Header.OrderDate
ORDER BY
  Customers.TerritoryID,
  Header.OrderDate
;

--==============================================================================================
--Whether we call it "moving window width", or "Queue Depth", or whatever, this indicates how
--many elements are to be included in our moving average.  
--E.g., a common moving average in finance situations could be a 30 day moving average, you
--would set "depth" to 30.  For this example, keep queue small for easy validation of calcs.
--==============================================================================================
DECLARE @QueueDepth  INT;
SET     @QueueDepth  = 5;

--==============================================================================================
--Space we'll use to store each value in our queue. 
--In this example, allow for up to 9 leading digits, the decimal, and 4 trailing digits each.
--==============================================================================================
DECLARE  @SingleValueCharLength  INT;
SET      @SingleValueCharLength  = 14;

--==============================================================================================
--Variable to accumulate our delimited string of values-per-break-level used to calc moving avg.
--This is, essentially, our values-queue.  Initialize it, so always predictable fixed length.
--New values (i.e., current-row value) are prepended to this String, with oldest (in queue)
--value appearing in the rightmost position of string... i.e., add new value to front of string
--and old values fall off the "end" (right side).
--
--NOTE: SET SIZE of this var to @QueueDepth * @SingleValueCharLength, OR leave as 8000 or MAX,
--keeping in mind that MAX will add slight performance penalty.
--==============================================================================================
DECLARE @MovingSubtotalValuesString VARCHAR(MAX);
SET     @MovingSubtotalValuesString = REPLICATE('0', @SingleValueCharLength * @QueueDepth);

--==============================================================================================
--Our break-level-value variables (Data-Types match those of the columns we are comparing to)
--Initialize these to some values that will NOT exist in each break-column's actual data.
--==============================================================================================
DECLARE @AverageResetBreakVal  INT;
SET     @AverageResetBreakVal  = -999;

--We will need to track the moving (or "windowed") subtotal during the update process
DECLARE  @MovingSubtotal       MONEY;
SET      @MovingSubtotal       = 0;

DECLARE @RowNumberThisGroup    INT;
SET     @RowNumberThisGroup    = 1;

--==============================================================================================
-- ALGORITHM EXPLANATION:
--    See SET-BASED RUNNING SUBTOTALS and CONCATENATION examples for background info, and 
--    the remainder of specific moving-average logic is described (intra-code comments) below.
--==============================================================================================
UPDATE
    @Results
SET
  --Keep track of what row# within a break-grouping we are on
  @RowNumberThisGroup =
    CASE 
      WHEN @AverageResetBreakVal = AverageResetBreakColumn1 
        THEN @RowNumberThisGroup + 1
      ELSE 1 
    END,
  --If at break, reset moving-subtotal (first value in group is current value); otherwise we
  --add the most recent value (current row value) to be included in the subtotal, and then
  --subtract the last value falling outside of queue-depth -- this is the secret to getting 
  --the "moving subtotal window" to work, and the entire reason we need the values-queue!
  @MovingSubtotal =
    CASE 
      WHEN @AverageResetBreakVal = AverageResetBreakColumn1 
        THEN @MovingSubtotal + ValueBeingAveraged  --ADD NEW VALUE
          - CONVERT(MONEY, RIGHT(@MovingSubtotalValuesString, @SingleValueCharLength))
      ELSE ValueBeingAveraged
    END,
  --If at break, reset moving-values-list-string to contain current row value, with the rest
  --of our "queue" (this string) holding just zero/empty values.
  --Otherwise, we will be adding new value to front of the queue-string and dropping the 
  --last value from the end of the queue.
    @MovingSubtotalValuesString  = 
    CASE 
      WHEN @AverageResetBreakVal = AverageResetBreakColumn1 
        THEN
          LEFT(CONVERT(CHAR(14), ValueBeingAveraged, 2), @SingleValueCharLength) +
          LEFT(@MovingSubtotalValuesString,  @SingleValueCharLength * (@QueueDepth -1))
      ELSE
          CONVERT(CHAR(14), ValueBeingAveraged, 2) + 
          REPLICATE('0', @SingleValueCharLength * (@QueueDepth -1))
      END ,
  --If at break, reset moving-avg (first value in group is current value); otherwise if
  --less than our queue-depth into a group, moving average is our moving-subtotal divided by
  --how many rows into group we are; otherwise, simply divide moving subtotal by queue-depth
  TotalSaleMovingAvg  = 
    CASE 
      WHEN (@AverageResetBreakVal = AverageResetBreakColumn1) 
      AND  (@RowNumberThisGroup <= @QueueDepth)
        THEN @MovingSubtotal / @RowNumberThisGroup
      WHEN (@AverageResetBreakVal = AverageResetBreakColumn1) 
        THEN @MovingSubtotal / @QueueDepth
      ELSE ValueBeingAveraged  
    END,
  --And finally, keep track of whether we hit a new break-value.
    @AverageResetBreakVal= AverageResetBreakColumn1;


--==============================================================================================
--Output the results, showing all rows to demonstrate the accumulation...
--==============================================================================================
SELECT 
  UniqueID, 
  AverageResetBreakColumn1, 
  CONVERT(VARCHAR(10), OrderDate, 111) AS 'OrderDate', 
  ValueBeingAveraged, 
  TotalSaleMovingAvg 
FROM
  @results;

--With a Queue-Depth of 2 (FOR EASY VISUAL VALIDATION OF CALCULATIONS)...
--UniqueID    AverageResetBreakColumn1 OrderDate  ValueBeingAveraged    TotalSaleMovingAvg
------------- ------------------------ ---------- --------------------- ---------------------
--1           1                        2005/07/01 999484.4736           999484.4736
--2           1                        2005/07/04 3578.27               501531.3718
--3           1                        2005/07/06 3578.27               3578.27
--4           1                        2005/07/07 699.0982              2138.6841
--5           1                        2005/07/09 3578.27               2138.6841
--6           1                        2005/07/10 3578.27               3578.27
--7           1                        2005/07/12 3578.27               3578.27
--...
--5037        9                        2008/07/25 1003.05               1067.27
--5038        9                        2008/07/26 557.79                780.42
--5039        9                        2008/07/27 545.30                551.545
--5040        9                        2008/07/28 672.90                609.10
--5041        9                        2008/07/29 149.01                410.955
--5042        9                        2008/07/30 1069.68               609.345
--5043        9                        2008/07/31 1731.23               1400.455
--5044        10                       2008/07/03 699.0982              699.0982
--5045        10                       2005/07/05 3578.27               2138.6841
--5046        10                       2005/07/06 3578.27               3578.27
--5047        10                       2005/07/07 3578.27               3578.27
--5048        10                       2005/07/09 3399.99               3489.13
--5049        10                       2005/07/11 3578.27               3489.13
--5050        10                       2005/07/12 3578.27               3578.27
--...


--With a Queue-Depth of 5 (Most easily Verified with something like a Spreadsheet)...
--UniqueID    AverageResetBreakColumn1 OrderDate  ValueBeingAveraged    TotalSaleMovingAvg
------------- ------------------------ ---------- --------------------- ---------------------
--1           1                        2005/07/01 999484.4736           999484.4736
--2           1                        2005/07/04 3578.27               501531.3718
--3           1                        2005/07/06 3578.27               335547.0045
--4           1                        2005/07/07 699.0982              251835.0279
--5           1                        2005/07/09 3578.27               202183.6763
--6           1                        2005/07/10 3578.27               3002.4356
--7           1                        2005/07/12 3578.27               3002.4356
--...
--5037        9                        2008/07/25 1003.05               1063.094
--5038        9                        2008/07/26 557.79                1051.072
--5039        9                        2008/07/27 545.30                1093.24
--5040        9                        2008/07/28 672.90                782.106
--5041        9                        2008/07/29 149.01                585.61
--5042        9                        2008/07/30 1069.68               598.936
--5043        9                        2008/07/31 1731.23               833.624
--5044        10                       2008/07/03 699.0982              699.0982
--5045        10                       2005/07/05 3578.27               2138.6841
--5046        10                       2005/07/06 3578.27               2618.546
--5047        10                       2005/07/07 3578.27               2858.477
--5048        10                       2005/07/09 3399.99               2966.7796
--5049        10                       2005/07/11 3578.27               3542.614
--5050        10                       2005/07/12 3578.27               3542.614
--...

SELECT DateDiff(ms, @starttime, GetDate()); --Display elapsed Milliseconds 
--**********************************************************************************************
-- END: SET-BASED Moving-Average Technique using Mike's approach: 
--
-- 275ms average (queue-depth makes essentially no difference in speed) : FAST!
--**********************************************************************************************
    


--**********************************************************************************************
--**********************************************************************************************
-- BEGIN: alternative SET-BASED Moving-Average Technique using "APPLY" functionality.
--
-- NOTE: (CROSS / OUTER) APPLY is NOT an ANSI Standard SQL feature.
--**********************************************************************************************
--**********************************************************************************************
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS; -- Clears the data cache
DBCC FREEPROCCACHE    WITH NO_INFOMSGS; -- Clears the procedure cache

SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

DECLARE @starttime DATETIME;
SELECT  @starttime = GetDate();

DECLARE @QueueDepth  INT;
SET     @QueueDepth  = 5;

WITH topNValues AS
(
    SELECT * FROM (SELECT
      Customers.TerritoryID, 
      Header.OrderDate
    FROM 
      Sales.SalesOrderHeader      AS Header
      JOIN Sales.SalesOrderDetail AS Detail
        ON (Detail.SalesOrderID = Header.SalesOrderID)
      JOIN Sales.Customer         AS Customers
        ON (Customers.CustomerID  = Header.CustomerID)
    GROUP BY
        Customers.TerritoryID,
        OrderDate
) AS t
  OUTER APPLY
    (
      SELECT
        TOP(@QueueDepth) topN.ValueBeingAveraged
        FROM (
            SELECT
              Customers.TerritoryID, 
              Header.OrderDate, 
              SUM(ISNULL(Header.SubTotal, 0)) AS ValueBeingAveraged  --Handle NULL values
            FROM 
              Sales.SalesOrderHeader      AS Header
              JOIN Sales.SalesOrderDetail AS Detail
                    ON (Detail.SalesOrderID = Header.SalesOrderID)
              JOIN Sales.Customer         AS Customers
                    ON (Customers.CustomerID  = Header.CustomerID)
            GROUP BY
              Customers.TerritoryID,
              OrderDate
            ) AS topN
      WHERE 
            t.TerritoryID =  topN.TerritoryID
        AND t.OrderDate   >= topN.OrderDate
      ORDER BY
        topN.OrderDate DESC
    ) topNapply
)
SELECT
  TerritoryID, 
  OrderDate,
  AVG(ValueBeingAveraged) AS 'TotalSaleMovingAvg'
FROM 
  topNValues
GROUP BY
  TerritoryID,
  OrderDate
ORDER BY
  TerritoryID,
  OrderDate;


SELECT DateDiff(ms, @starttime, GetDate()); --Display elapsed Milliseconds 
--**********************************************************************************************
-- END: OUTER APPLY / CROSS APPLY BASED Moving-Average Technique, which produces same result 
-- but is MUCH, MUCH SLOWER compared to technique demonstrated first above!)
--
-- 19100ms average (queue-depth of 2)
-- 20500ms average (queue-depth of 20) 
-- i.e., this technique SLOWS as queue-depth increases, and OUTER / CROSS makes little difference
-- (in fact, CROSS APPLY was slightly slower than OUTER APPLY).
--**********************************************************************************************



--**********************************************************************************************
--**********************************************************************************************
-- BEGIN: alternative SET-BASED Moving-Average Technique using "OVER" (windowing) functionality
-- introduced in SQL-Server 2012.
--**********************************************************************************************
--**********************************************************************************************
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS; -- Clears the data cache
DBCC FREEPROCCACHE    WITH NO_INFOMSGS; -- Clears the procedure cache

SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

DECLARE @starttime DATETIME;
SELECT  @starttime = GetDate();

--The OVER (windowing) function within SELECT cannot handle *variables* defining the window-width apparently?!
--Thus, instead of using these variables, we must hard-code the value into the "BETWEEN" in upcoming SELECT statement.
--DECLARE @QueueDepth  INT;
--SET     @QueueDepth  = 5;

SELECT
    TerritoryID, 
    OrderDate, 
    AVG(ValueBeingAveraged) 
OVER
    (
    PARTITION BY TerritoryID
    ORDER BY OrderDate
    ROWS BETWEEN 19 PRECEDING AND CURRENT ROW --NOTE: Queue-Depth of 5 is represented as "BETWEEN 4 PRECEDING AND CURRENT ROW" 
    )
FROM
    (SELECT
        Customers.TerritoryID, 
        Header.OrderDate, 
        SUM(ISNULL(Header.SubTotal, 0)) AS ValueBeingAveraged  --Handle NULL values
    FROM 
        Sales.SalesOrderHeader      AS Header
        JOIN Sales.SalesOrderDetail AS Detail
        ON (Detail.SalesOrderID = Header.SalesOrderID)
        JOIN Sales.Customer         AS Customers
        ON (Customers.CustomerID  = Header.CustomerID)
    GROUP BY
        Customers.TerritoryID,
        OrderDate
    ) AS topN


SELECT DateDiff(ms, @starttime, GetDate()); --Display elapsed Milliseconds 
--**********************************************************************************************
-- END: "OVER" (WINDOWING FUNCTION) BASED Moving-Average Technique, which produces same result 
-- but requires SQL-Server 2012+ and also imposes a potential issue due to the fact that the 
-- "window-width" (aka, queue-depth) for our moving-average must be HARD-CODED INTEGER VALUE!
--
-- 170ms average (queue-depth makes essentially no difference in speed) : FAST!
-- Being "native" functionality in SQL-Server 2012, it is not surprising this approach produces
-- the SPEEDIEST EXECUTION TIME, and IF Microsoft implements variable-support in the window-width
-- specifiers, this approach will win out in almost all situations. 
--
-- But, for a non-native and/or pre-SQL2012 solution, our original approach I laid out (above)
-- is remarkably close to this performance and DOES support variable-window-width / queue-depth. 
--**********************************************************************************************


Continue to read this Software Development and Technology Blog for computer programming, software development, and technology Techniques, How-To's, Fixes, Reviews, and News — focused on Dart Language, SQL Server, Nvidia CUDA, VMware, TypeScript, SVG, other technology tips and how-to's, and my varied political and economic opinions.

No comments: