统计 – 给定文档,选择一个相关的代码段

当我在这里提出一个问题时,自动搜索返回的问题的工具提示给出了问题的第一点,但是其中相当的百分比不会给出任何对于理解问题更有用的文本标题.有没有人有一个想法如何使一个过滤器修剪无用的一个问题的位？

我的第一个想法是修剪任何仅包含某些列表中的单词的主要句子(例如,停止单词,加上标题中的单词,以及与标签非常弱相关的SO语料库中的单词,这同样可能发生在任何问题,而不管它的标签)

解决方法

听起来你对automatic text summarization感兴趣.有关问题的详细介绍,涉及的问题和可用的算法,请查看Das和Martin的论文A Survey on Automatic Text Summarization(2007).

简单算法

简单但合理有效的总结算法是从包含最频繁内容词的原始文本中选择有限数量的句子(即,最不频繁的句子不包括 stop list个词).

Summarizer(originalText,maxSummarySize):
   // start with the raw freqs,e.g. [(10,'the'),(3,'language'),(8,'code')...]
   wordFrequences = getWordCounts(originalText)
   // filter,e.g. [(3,'code')...]
   contentWordFrequences = filtStopWords(wordFrequences)
   // sort by freq & drop counts,e.g. ['code','language'...]
   contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences)

   // Split Sentences
   sentences = getSentences(originalText)

   // Select up to maxSummarySize sentences
   setSummarySentences = {}
   foreach word in contentWordsSortbyFreq:
      firstMatchingSentence = search(sentences,word)
      setSummarySentences.add(firstMatchingSentence)
      if setSummarySentences.size() = maxSummarySize:
         break

   // construct summary out of select sentences,preserving original ordering
   summary = ""
   foreach sentence in sentences:
     if sentence in setSummarySentences:
        summary = summary + " " + sentence

   return summary

使用这种算法进行汇总的一些开源包是：

Classifier4J(Java)

如果您使用Java,可以使用Classifier4J的SimpleSummarizer模块.

使用示例here,我们假设原始文本是：

Classifier4J is a java package for working with text. Classifier4J includes a summariser. A Summariser allows the summary of text. A Summariser is really cool. I don’t think there are any other java summarisers.

如下面的代码片段所示,您可以轻松地创建一个简单的一句话总结：

// Request a 1 sentence summary
String summary = summariser.summarise(longOriginalText,1);

使用上面的算法,这将产生Classifier4J包括一个summariser ..

NClassifier(C#)

如果您使用C#,则会有一个名为NClassifier的Classifier4J到C#的端口

Tristan Havelick的NLTK(Python)总结

Classifier4J的总结器有一个在Python的Natural Language Toolkit (NLTK)可用here构建的正在进行中的Python端口.

原文链接：https://www.f2er.com/html/225129.html

统计 – 给定文档,选择一个相关的代码段

解决方法

猜你在找的HTML相关文章