使用
system-versioned temporal table(sql Server 2016中的新增功能)时,使用此功能处理大型关系数据仓库中的缓慢变化维度时,查询创作和性能影响是什么?
例如,假设我有一个带有Postal Code列的100,000行Customer维度和一个带有CustomerID外键列的数十亿行Sales事实表.并假设我想查询“按客户邮政编码计算的2014年总销售额”.简化的DDL就像这样(为清楚起见,省略了许多列):
CREATE TABLE Customer ( CustomerID int identity (1,1) NOT NULL PRIMARY KEY CLUSTERED,PostalCode varchar(50) NOT NULL,SysStartTime datetime2 GENERATED ALWAYS AS ROW START NOT NULL,SysEndTime datetime2 GENERATED ALWAYS AS ROW END NOT NULL,PERIOD FOR SYSTEM_TIME (SysStartTime,SysEndTime) ) WITH (SYSTEM_VERSIONING = ON); CREATE TABLE Sale ( SaleId int identity(1,SaleDateTime datetime2 NOT NULL,CustomerId int NOT NULL FOREIGN KEY REFERENCES Customer(CustomerID),SaleAmount decimal(10,2) NOT NULL );
有趣的是,客户可能在一年内搬家,因此同一客户可能有不同的邮政编码.甚至远程可能客户搬走然后搬回去,这意味着同一个客户可能会有相同邮政编码的多个历史记录!无论客户的邮政编码如何随时间变化,我对“邮政编码销售”的查询都应该能够计算出正确的结果.
我理解如何使用时态表来单独查询客户维度(例如SELECT * FROM Customer FOR SYSTEM_TIME FROM’2014-1-1’to’2015-1-1′)但我不确定如何最准确和有效加入事实表.
这是我应该如何查询它?
SELECT c.PostalCode,sum(s.SaleAmount) SaleAmount FROM Customer c FOR SYSTEM_TIME FROM '2014-1-1' TO '2015-1-1' JOIN Sale s ON s.CustomerId = c.CustomerId WHERE s.SaleDateTime >= '2014-1-1' AND s.SaleDateTime < '2015-1-1' AND c.SysStartTime >= s.SaleDateTime AND c.SysEndTime < s.SaleDateTime GROUP BY c.PostalCode
解决方法
我认为,在您的情况下,需要一个派生表来隔离每个客户的查询数量的邮政编码突变:
SELECT c.postalcode,sum(s.SaleAmount) SaleAmount,count(postcode_mutations.customerid) as CntCustomerChangedPostCode FROM dbo.Sale s JOIN dbo.Customer c on s.customerid = c.customerid LEFT JOIN ( SELECT CustomerID FROM [dbo].[Customer] FOR SYSTEM_TIME FROM '20140101' TO '20150101' GROUP BY CustomerID HAVING COUNT(DISTINCT PostalCode) > 1 ) postcode_mutations on s.customerid = postcode_mutations.customerid WHERE s.SaleDateTime >= '2014-1-1' AND s.SaleDateTime < '2015-1-1' GROUP BY c.PostalCode
upd:由于查询应该为DWH / Analytics方案提供服务,因此列存储索引是一个要检查的选项.我还为made some benchmarks previously换了1000万行表.