所以我们有
this database填充了一堆字符串,在这种情况下发布标题.
我想做的是:
>将字符串拆分为单词
>计算字符串中出现的字数
>给我前50个单词
>在data.se查询中没有此超时
我尝试使用this SO question中适用于data.se的信息,如下所示:
select word,count(*) from ( select (case when instr(substr(p.Title,nums.n+1),' ') then substr(p.Title,nums.n+1) else substr(p.Title,nums.n+1,instr(substr(p.Title,' ') - 1) end) as word from (select ' '||Title as string from Posts p )Posts cross join (select 1 as n union all select 2 union all select 10 ) nums where substr(p.Title,nums.n,1) = ' ' and substr(p.Title,1) <> ' ' ) w group by word order by count(*) desc
不幸的是,这给了我一些错误:
‘substr’ is not a recognized built-in function name. Incorrect Syntax
near ‘|’. Incorrect Syntax near ‘nums’.
解决方法
正如Blogbeard所说,您提供的查询不适用于sql Server.这是计算最常用单词的一种方法.它基于一个函数
DelimitedSplitN4K,由Jeff Moden编写,并由sql Server Central社区的成员进行了改进.
WITH E1(N) AS ( SELECT 1 FROM (VALUES (1),(1),(1) ) t(N) ),E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b) SELECT TOP 50 x.Item,COUNT(*) FROM Posts p CROSS APPLY ( SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),Item = LTRIM(RTRIM(SUBSTRING(p.Title,l.N1,l.L1))) FROM ( SELECT s.N1,L1 = ISNULL(NULLIF(CHARINDEX(' ',p.Title,s.N1),0)-s.N1,4000) FROM( SELECT 1 UNION ALL SELECT t.N+1 FROM( SELECT TOP (ISNULL(DATALENGTH(p.Title)/2,0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4 ) t(N) WHERE SUBSTRING(p.Title,t.N,1) = ' ' ) s(N1) ) l(N1,L1) ) x WHERE x.item <> '' GROUP BY x.Item ORDER BY COUNT(*) DESC
由于不允许创建函数,我已经这样写了.如果您有兴趣,这是函数定义:
CREATE FUNCTION [dbo].[DelimitedSplitN4K]( @pString NVARCHAR(4000),@pDelimiter NCHAR(1) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH E1(N) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ),E2(N) AS (SELECT 1 FROM E1 a,E1 b),E4(N) AS (SELECT 1 FROM E2 a,E2 b),cteTally(N) AS( SELECT TOP (ISNULL(DATALENGTH(@pString)/2,0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4 ),cteStart(N1) AS ( SELECT 1 UNION ALL SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,1) = @pDelimiter ),cteLen(N1,L1) AS( SELECT s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,4000) FROM cteStart s ) SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),Item = SUBSTRING(@pString,l.L1) FROM cteLen l ;
以下是您将如何使用它:
SELECT TOP 50 x.Item,COUNT(*) FROM Posts p CROSS APPLY dbo.DelimitedSplitN4K(p.Title,' ') x WHERE LTRIM(RTRIM(x.Item)) <> '' GROUP BY x.Item ORDER BY COUNT(*) DESC
结果:
Item -------- ------- to 3812411 in 3331522 a 2543636 How 1770915 the 1534298 with 1341632 of 1297468 and 1166664 on 970554 from 964449 for 886007 not 835979 is 704724 using 703007 I 633838 - 632441 an 548450 when 449169 file 409717 how 358745 data 335271 do 323854 can 310298 get 305922 or 266317 error 263563 use 258408 value 254392 it 251254 my 238902 function 235832 by 231025 Android 228308 as 216654 array 209157 working 207445 does 207274 Is 205613 multiple 203336 that 197826 Why 196979 into 196591 after 192056 string 189053 PHP 187018 one 182360 class 179965 if 179590 text 174878 table 169393