【Python】词频统计

前端之家收集整理的这篇文章主要介绍了【Python】词频统计前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
  • 需求:一篇文章,出现了哪些词?哪些词出现得最多?
  • @H_404_3@

    英文文本词频统计

    英文文本:Hamlet 分析词频

    统计英文词频分为两步:

    • 文本去噪及归一化
    • 使用字典表达词频
    • @H_404_3@

      代码

      #CalHamletV1.py
      def getText():
          txt = open("hamlet.txt","r").read()
          txt = txt.lower()
          for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
              txt = txt.replace(ch," ")   #将文本中特殊字符替换为空格
          return txt
       
      hamletTxt = getText()
      words  = hamletTxt.split()
      counts = {}
      for word in words:           
          counts[word] = counts.get(word,0) + 1
      items = list(counts.items())
      items.sort(key=lambda x:x[1],reverse=True) 
      for i in range(10):
          word,count = items[i]
          print ("{0:<10}{1:>5}".format(word,count))

      运行结果:

      the        1138
      and         965
      to          754
      of          669
      you         550
      i           542
      a           542
      my          514
      hamlet      462
      in          436

      中文文本词频统计

      中文文本:《三国演义》分析人物

      统计中文词频分为两步:

      • 中文文本分词
      • 使用字典表达词频
      • @H_404_3@
        #CalThreeKingdomsV1.py
        import jieba
        txt = open("threekingdoms.txt","r",encoding='utf-8').read()
        words  = jieba.lcut(txt)
        counts = {}
        for word in words:
            if len(word) == 1:
                continue
            else:
                counts[word] = counts.get(word,reverse=True) 
        for i in range(15):
            word,count))

        运行结果:

        曹操      953
        孔明  836
        将军  772
        却说  656
        玄德  585
        关公  510
        丞相  491
        二人  469
        不可  440
        荆州  425
        玄德曰     390
        孔明曰     390
        不能  384
        如此  378
        张飞  358

        能很明显的看到有一些不相关或重复的信息

        优化版本

        统计中文词频分为三步:

        • 中文文本分词
        • 使用字典表达词频
        • 扩展程序解决问题
        • @H_404_3@

          我们将不相关或重复的信息放在 excludes 集合里面进行排除。

          #CalThreeKingdomsV2.py
          import jieba
          excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
          txt = open("threekingdoms.txt",encoding='utf-8').read()
          words  = jieba.lcut(txt)
          counts = {}
          for word in words:
              if len(word) == 1:
                  continue
              elif word == "诸葛亮" or word == "孔明曰":
                  rword = "孔明"
              elif word == "关公" or word == "云长":
                  rword = "关羽"
              elif word == "玄德" or word == "玄德曰":
                  rword = "刘备"
              elif word == "孟德" or word == "丞相":
                  rword = "曹操"
              else:
                  rword = word
              counts[rword] = counts.get(rword,0) + 1
          for word in excludes:
              del counts[word]
          items = list(counts.items())
          items.sort(key=lambda x:x[1],count))

          考研英语词频统计

          将词频统计应用到考研英语中,我们可以统计出出现次数较多的关键单词。
          文本链接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r

          # CalHamletV1.py
          def getText():
              txt = open("86_17_1_2.txt"," ")   #将文本中特殊字符替换为空格
              return txt
          
          pyTxt = getText()   #获得没有任何标点的txt文件
          words  = pyTxt.split()  #获得单词
          counts = {} #字典,键值对
          excludes = {"the","a","of","to","and","in","b","c","d","is",\
                      "was","are","have","were","had","that","for","it",\
                      "on","be","as","with","by","not","their","they",\
                      "from","more","but","or","you","at","has","we","an",\
                      "this","can","which","will","your","one","he","his","all","people","should","than","points","there","i","what","about","new","if","”",\
                      "its","been","part","so","who","would","answer","some","our","may","most","do","when","1","text","section","2","many","time","into",\
                      "10","no","other","up","following","【答案】","only","out","each","much","them","such","world","these","sheet","life","how","because","3","even",\
                      "work","directions","use","could","now","first","make","years","way","20","those","over","also","best","two","well","15","us","write","4","5","being","social","read","like","according","just","take","paragraph","any","english","good","after","own","year","must","american","less","her","between","then","children","before","very","human","long","while","often","my","too",\
                      "40","four","research","author","questions","still","last","business","education","need","information","public","says","passage","reading","through","women","she","health","example","help","get","different","him","mark","might","off","job","30","writing","choose","words","economic","become","science","society","without","made","high","students","few","better","since","6","rather","however","great","where","culture","come",\
                      "both","three","same","government","old","find","number","means","study","put","8","change","does","today","think","future","school","yet","man","things","far","line","7","13","50","used","states","down","12","14","16","end","11","making","9","another","young","system","important","letter","17","chinese","every","see","s","test","word","century","language","little",\
                      "give","said","25","state","problems","sentence","food","translation","given","child","18","longer","question","back","don’t","19","against","always","answers","know","having","among","instead","comprehension","large","35","want","likely","keep","family","go","why","41","home","law","place","look","day","men","22","26","45","it’s","others","companies","countries","once","money","24","though",\
                      "27","29","31","say","national","ii","23","based","found","28","32","past","living","university","scientific","–","36","38","working","around","data","right","21","jobs","33","34","possible","feel","process","effect","growth","probably","seems","fact","below","37","39","history","technology","never","sentences","47","true","scientists","power","thought","during","48","early","parents",\
                      "something","market","times","46","certain","whether","000","did","enough","problem","least","federal","age","idea","learn","common","political","pay","view","going","attention","happiness","moral","show","live","until","52","49","ago","percent","stress","43","44","42","meaning","51","e","iii","u","60","anything","53","55","cultural","nothing","short","100","water","car","56","58","【解析】","54","59","57","v","。","63","64","65","61","62","66","70","75","f","【考点分析】","67","here","68","71","72","69","73","74","选项a","ourselves","teachers","helps","参考范文","gdp","yourself","gone","150"}
          for word in words:
              if word not in excludes:
                  counts[word] = counts.get(word,count))
          
          x = len(counts)
          print(x)
          
          r = 0
          
          next = eval(input("1继续"))
          
          while next == 1:
              r += 100
              for i in range(r,r+100):
                  word,count = items[i]
                  print ("\"{}\"".format(word),end = ",")
              next = eval(input("1继续"))
          原文链接:https://www.f2er.com/python/997019.html

猜你在找的Python相关文章