比如说,我们在一篇文章中找到了“通用汽车”的文字.我们有一组数据,其中包含文章和中提到的正确实体.因此,如果我们在新文章中找到“通用汽车”,那么它是否属于先前数据中包含已知正版的那类文章提及“通用汽车”与没有提到该实体的文章类别?
(我不是为每个实体创建一个类,并尝试将每个新文章分类到每个可能的类中.我已经有一个启发式方法来查找实体名称的合理提及,我只是想验证有限数量的可信度.实体名称提到该方法已经检测到的每篇文章.)
鉴于潜在的类和文章的数量相当大,而朴素的贝叶斯相对简单,我想在sql中完成整个过程,但我在评分查询时遇到问题…
这是我到目前为止所拥有的:
CREATE TABLE `each_entity_word` ( `word` varchar(20) NOT NULL,`entity_id` int(10) unsigned NOT NULL,`word_count` mediumint(8) unsigned NOT NULL,PRIMARY KEY (`word`,`entity_id`) ); CREATE TABLE `each_entity_sum` ( `entity_id` int(10) unsigned NOT NULL DEFAULT '0',`word_count_sum` int(10) unsigned DEFAULT NULL,`doc_count` mediumint(8) unsigned NOT NULL,PRIMARY KEY (`entity_id`) ); CREATE TABLE `total_entity_word` ( `word` varchar(20) NOT NULL,`word_count` int(10) unsigned NOT NULL,PRIMARY KEY (`word`) ); CREATE TABLE `total_entity_sum` ( `word_count_sum` bigint(20) unsigned NOT NULL,`doc_count` int(10) unsigned NOT NULL,`pkey` enum('singleton') NOT NULL DEFAULT 'singleton',PRIMARY KEY (`pkey`) );
标记数据中的每篇文章被分成不同的单词,并且对于每个实体的每篇文章,每个单词被添加到each_entity_word和/或其word_count递增,并且doc_count在entity_word_sum中递增,两者都相对于entity_id.对于该文中已知的每个实体重复这一过程.
对于每个文章而言,无论每个单词中包含的实体total_entity_word,total_entity_word_sum都是类似地递增的.
> P(word |任何文件)应该等于
word_count在total_entity_word中为该单词结束
doc_count在total_entity_sum中
> P(word |文件提及实体x)
应该等于word_count
通过doc_count in中的entity_id x为该单词的each_entity_word
entity_id x的each_entity_sum
> P(单词|文档未提及实体x)应该相等(total_entity_word中的word_count减去该实体中该单词的each_entity_word中的word_count)(total_entity_sum中的doc_count减去each_entity_sum中该实体的doc_count)
> P(文档提及实体x)应该等于each_entity_um中的doc_count,该实体ID超过doc_count中的total_entity_word
> P(文档未提及实体x)应等于1减(doc_count在each_entity_sum中,x的实体ID超过doc_count在total_entity_word中).
对于新文章,将其拆分为单词,只需选择(‘I’,’想要’,’到’,’使用’…)中的单词对应each_entity_word或total_entity_word.在我正在使用的数据库平台(MysqL)IN子句中进行了相对较好的优化.
在sql中也没有product()聚合函数,所以当然你可以只使用sum(log(x))或exp(sum(log(x)))来得到product(x)的等价物.
所以,如果我收到一篇新文章,将其分成不同的单词,并将这些单词放入一个大的IN()子句和一个潜在的实体id进行测试,我怎样才能得到该文章落入该实体的朴素贝叶斯概率sql中的id类?
编辑:
尝试#1:
set @entity_id = 1; select @entity_doc_count = doc_count from each_entity_sum where entity_id=@entity_id; select @total_doc_count = doc_count from total_entity_sum; select exp( log(@entity_doc_count / @total_doc_count) + ( sum(log((ifnull(ew.word_count,0) + 1) / @entity_doc_count)) / sum(log(((aew.word_count + 1) - ifnull(ew.word_count,0)) / (@total_doc_count - @entity_doc_count))) ) ) as likelihood,from total_entity_word aew left outer join each_entity_word ew on ew.word=aew.word and ew.entity_id=@entity_id where aew.word in ('I','want','to','use'...);
解决方法
或者,我建议使用已建立的stats包以及db的连接器.如果您想从Naive Bayes切换到更复杂的东西,这将使您的应用程序更加灵活:
http://rpgsql.sourceforge.net/
bnd.pr> data(airquality) bnd.pr> db.write.table(airquality,no.clobber = F) bnd.pr> bind.proxy("airquality") bnd.pr> summary(airquality) Table name: airquality Database: test Host: localhost Dimensions: 6 (columns) 153 (rows) bnd.pr> print(airquality) Day Month Ozone Solar.R Temp 1 1 5 41 190 67 2 2 5 36 118 72 3 3 5 12 149 74 4 4 5 18 313 62 5 5 5 NA NA 56 6 6 5 28 NA 66 7 7 5 23 299 65 8 8 5 19 99 59 9 9 5 8 19 61 10 10 5 NA 194 69 Continues for 143 more rows and 1 more cols... bnd.pr> airquality[50:55,] Ozone Solar.R Wind Temp Month Day 50 12 120 11.5 73 6 19 51 13 137 10.3 76 6 20 52 NA 150 6.3 77 6 21 53 NA 59 1.7 76 6 22 54 NA 91 4.6 76 6 23 55 NA 250 6.3 76 6 24 bnd.pr> airquality[["Ozone"]] [1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6 [19] 30 11 1 11 4 32 NA NA NA 23 45 115 37 NA NA NA NA NA [37] NA 29 NA 71 39 NA NA 23 NA NA 21 37 20 12 13 NA NA NA [55] NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA [73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50 [91] 64 59 39 9 16 78 35 66 122 89 110 NA NA 44 28 65 NA 22 [109] 59 23 31 44 21 9 NA 45 168 73 NA 76 118 84 85 96 78 73 [127] 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13 [145] 23 36 7 14 30 NA 14 18 20
然后,您将要安装e1071软件包以执行Naive Bayes.在R提示符下:
[ramanujan:~/base]$R R version 2.7.2 (2008-08-25) Copyright (C) 2008 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos,'help()' for on-line help,or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. ~/.Rprofile loaded. Welcome at Sun Apr 19 00:45:30 2009 > install.packages("e1071") > install.packages("mlbench") > library(e1071) > ?naiveBayes > example(naiveBayes)
更多信息: