我有一个这样的文件:
This is a file with many words. Some of the words appear more than once. Some of the words only appear one time.
我想生成一个两列列表。第一列显示出什么词,第二列显示出现的频率,例如:
this@1 is@1 a@1 file@1 with@1 many@1 words3 some@2 of@2 the@2 only@1 appear@2 more@1 than@1 one@1 once@1 time@1
>为了使此工作更简单,在处理列表之前,我将删除所有标点符号,并将所有文本更改为小写字母。
>除非有一个简单的解决方案,单词和单词可以算作两个单独的单词。
到目前为止,我有这个:
sed -i "s/ /\n/g" ./file1.txt # put all words on a new line while read line do count="$(grep -c $line file1.txt)" echo $line"@"$count >> file2.txt # add word and frequency to file done < ./file1.txt sort -u -d # remove duplicate lines
由于某些原因,这只是在每个单词之后显示“0”。
不是sed和grep,而是tr,sort,uniq和awk:
原文链接:/bash/387544.html% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF This is a file with many words. Some of the words appear more than once. Some of the words only appear one time. EOF a@1 appear@2 file@1 is@1 many@1 more@1 of@2 once.@1 one@1 only@1 Some@2 than@1 the@2 This@1 time.@1 with@1 words@2 words.@1