我的第一个想法是修剪任何仅包含某些列表中的单词的主要句子(例如,停止单词,加上标题中的单词,以及与标签非常弱相关的SO语料库中的单词,这同样可能发生在任何问题,而不管它的标签)
解决方法
听起来你对automatic text summarization感兴趣.有关问题的详细介绍,涉及的问题和可用的算法,请查看Das和Martin的论文A Survey on Automatic Text Summarization(2007).
简单算法
简单但合理有效的总结算法是从包含最频繁内容词的原始文本中选择有限数量的句子(即,最不频繁的句子不包括stop list个词).
Summarizer(originalText,maxSummarySize): // start with the raw freqs,e.g. [(10,'the'),(3,'language'),(8,'code')...] wordFrequences = getWordCounts(originalText) // filter,e.g. [(3,'code')...] contentWordFrequences = filtStopWords(wordFrequences) // sort by freq & drop counts,e.g. ['code','language'...] contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences) // Split Sentences sentences = getSentences(originalText) // Select up to maxSummarySize sentences setSummarySentences = {} foreach word in contentWordsSortbyFreq: firstMatchingSentence = search(sentences,word) setSummarySentences.add(firstMatchingSentence) if setSummarySentences.size() = maxSummarySize: break // construct summary out of select sentences,preserving original ordering summary = "" foreach sentence in sentences: if sentence in setSummarySentences: summary = summary + " " + sentence return summary
使用这种算法进行汇总的一些开源包是:
Classifier4J(Java)
如果您使用Java,可以使用Classifier4J的SimpleSummarizer模块.
使用示例here,我们假设原始文本是:
Classifier4J is a java package for working with text. Classifier4J includes a summariser. A Summariser allows the summary of text. A Summariser is really cool. I don’t think there are any other java summarisers.
如下面的代码片段所示,您可以轻松地创建一个简单的一句话总结:
// Request a 1 sentence summary String summary = summariser.summarise(longOriginalText,1);
使用上面的算法,这将产生Classifier4J包括一个summariser ..
NClassifier(C#)
如果您使用C#,则会有一个名为NClassifier的Classifier4J到C#的端口
Tristan Havelick的NLTK(Python)总结
Classifier4J的总结器有一个在Python的Natural Language Toolkit (NLTK)可用here构建的正在进行中的Python端口.