从SQL中的一列字符串中获取最常用的单词

所以我们有 this database填充了一堆字符串,在这种情况下发布标题.

我想做的是:

>将字符串拆分为单词
>计算字符串中出现的字数
>给我前50个单词
>在data.se查询中没有此超时

我尝试使用this SO question中适用于data.se的信息,如下所示:

select word,count(*) from (
select (case when instr(substr(p.Title,nums.n+1),' ') then substr(p.Title,nums.n+1)
             else substr(p.Title,nums.n+1,instr(substr(p.Title,' ') - 1)
        end) as word
from (select ' '||Title as string
      from Posts p
     )Posts cross join
     (select 1 as n union all select 2 union all select 10
     ) nums
where substr(p.Title,nums.n,1) = ' ' and substr(p.Title,1) <> ' '
) w
group by word
order by count(*) desc

不幸的是,这给了我一些错误

‘substr’ is not a recognized built-in function name. Incorrect Syntax
near ‘|’. Incorrect Syntax near ‘nums’.

因此,给定sql中的一列字符串,每个字符串中包含可变数量的文本,如何获取最常用的X字列表?

解决方法

正如Blogbeard所说,您提供的查询不适用于sql Server.这是计算最常用单词的一种方法.它基于一个函数 DelimitedSplitN4K,由Jeff Moden编写,并由sql Server Central社区的成员进行了改进.

ONLINE DEMO

WITH E1(N) AS (
    SELECT 1 FROM (VALUES
        (1),(1),(1)
    ) t(N)
),E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b)
SELECT TOP 50
    x.Item,COUNT(*)
FROM Posts p
CROSS APPLY (
    SELECT 
        ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),Item       = LTRIM(RTRIM(SUBSTRING(p.Title,l.N1,l.L1)))
        FROM (
            SELECT s.N1,L1 = ISNULL(NULLIF(CHARINDEX(' ',p.Title,s.N1),0)-s.N1,4000)
            FROM(
                SELECT 1 UNION ALL
                SELECT t.N+1 
                FROM(
                    SELECT TOP (ISNULL(DATALENGTH(p.Title)/2,0))
                        ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
                    FROM E4
                ) t(N)
                WHERE SUBSTRING(p.Title,t.N,1) = ' '
            ) s(N1)
        ) l(N1,L1)
) x
WHERE x.item <> ''
GROUP BY x.Item
ORDER BY COUNT(*) DESC

由于不允许创建函数,我已经这样写了.如果您有兴趣,这是函数定义:

CREATE FUNCTION [dbo].[DelimitedSplitN4K](
    @pString NVARCHAR(4000),@pDelimiter NCHAR(1)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN

WITH E1(N) AS (
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),E2(N) AS (SELECT 1 FROM E1 a,E1 b),E4(N) AS (SELECT 1 FROM E2 a,E2 b),cteTally(N) AS(
    SELECT TOP (ISNULL(DATALENGTH(@pString)/2,0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),cteStart(N1) AS (
    SELECT 1 UNION ALL 
    SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,1) = @pDelimiter
),cteLen(N1,L1) AS(
    SELECT s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,4000)
    FROM cteStart s
)
SELECT 
    ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),Item       = SUBSTRING(@pString,l.L1)
FROM cteLen l
;

以下是您将如何使用它:

SELECT TOP 50
    x.Item,COUNT(*)
FROM Posts p
CROSS APPLY dbo.DelimitedSplitN4K(p.Title,' ') x
WHERE LTRIM(RTRIM(x.Item)) <> ''
GROUP BY x.Item
ORDER BY COUNT(*) DESC

结果:

Item             
-------- ------- 
to       3812411 
in       3331522 
a        2543636 
How      1770915 
the      1534298 
with     1341632 
of       1297468 
and      1166664 
on       970554  
from     964449  
for      886007  
not      835979  
is       704724  
using    703007  
I        633838  
-        632441  
an       548450  
when     449169  
file     409717  
how      358745  
data     335271  
do       323854  
can      310298  
get      305922  
or       266317  
error    263563  
use      258408  
value    254392  
it       251254  
my       238902  
function 235832  
by       231025  
Android  228308  
as       216654  
array    209157  
working  207445  
does     207274  
Is       205613  
multiple 203336  
that     197826  
Why      196979  
into     196591  
after    192056  
string   189053  
PHP      187018  
one      182360  
class    179965  
if       179590  
text     174878  
table    169393

相关文章

(一)日志传送架构 (1.1)相关服务器 主服务器 :用于生产的服务器,上面运行这生产SQL Server数据库...
(一)事故背景 最近在SQL Server 2012生产数据库上配置完事物复制(发布订阅)后,生产数据库业务出现了...
(一)测试目的 目前公司使用的SQL SERVER 2012高可用环境为主备模式,其中主库可执行读写操作,备库既...
(一)背景个人在使用sql server时,用到了sql server的发布订阅来做主从同步,类似MySQL的异步复制。在...
UNION和OR谓词 找出 product 和 product2 中售价高于 500 的商品的基本信息. select * from product wh...
datawhale组队学习task03