我正在玩
study,一个Perl功能来检查一个字符串,使后续的正则表达式可能更快:
while( <> ) { study; $count++ if /PATTERN/; $count++ if /OTHER/; $count++ if /PATTERN2/; }
关于哪些情况将从中受益匪浅?你可以从the docs中挑选出几件事情:
>具有恒定字符串的模式
>多种模式
>较短的目标字符串可能会更好(需要较少的学习时间)
我正在寻找具体的案例,我不仅可以表现出很大的优势,而且还可以稍微调整以弥补这一优势。 the docs年的警告之一是您应该对个别情况进行基准测试。我想找到一些边缘情况,一个字符串(或模式)的小差异在性能上有很大差异。
如果没有使用study,请不要回答。我宁愿拥有正确的答案,而不是快速猜测。这里没有紧迫性,这并不妨碍任何工作。
而且,作为一个奖励,我一直在使用比较两个NYTProf运行的基准测试工具,我宁愿使用通常的基准测试工具。如果我想出了一种自动化的方法,我也会分享。
解决方法
谷歌提出了这个
lovely test scenario:
#!/usr/bin/perl # # Exercise 7.8 # # This is a more difficult exercise. The study function in Perl may speed up searches # for motifs in DNA or protein. Read the Perl documentation on this function. Its use # is simple: given some sequence data in a variable $sequence,type: # # study $sequence; # # before doing the searches. Do you think study will speed up searches in DNA or # protein,based on what you've read about it in the documentation? # # For lots of extra credit! Now read the Perl documentation on the standard module # Benchmark. (Type perldoc Benchmark,or visit the Perl home page at http://www. # perl.com.) See if your guess is right by writing a program that benchmarks motif # searches of DNA and of protein,with and without study. # # Answer to Exercise 7.8 use strict; use warnings; use Benchmark; my $dna = join ('',qw( agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc acacctgagccactctcagatgaggaccta )); my $protein = join('',qw( MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR )); my $count = 1000; print "DNA pattern matches without 'study' function:\n"; timethis($count,' for(my $i=1 ; $i < 10000; ++$i) { $dna =~ /aggtc/; $dna =~ /aatggccgt/; $dna =~ /gatcgatcagctagcat/; $dna =~ /gtatgaac/; $dna =~ /[ac][cg][gt][ta]/; $dna =~ /ccccccccc/; } ' ); print "\nDNA pattern matches with 'study' function:\n"; timethis($count,' study $dna; for(my $i=1 ; $i < 10000; ++$i) { $dna =~ /aggtc/; $dna =~ /aatggccgt/; $dna =~ /gatcgatcagctagcat/; $dna =~ /gtatgaac/; $dna =~ /[ac][cg][gt][ta]/; $dna =~ /ccccccccc/; } ' ); print "\nProtein pattern matches without 'study' function:\n"; timethis($count,' for(my $i=1 ; $i < 10000; ++$i) { $protein =~ /PH.EI/; $protein =~ /KFTEQGESMRLY/; $protein =~ /[YAL][NVP][ISV][KQE]/; $protein =~ /DKKQIR/; $protein =~ /[MD][VT][HQ][ER]/; $protein =~ /NVPISVKQEITFTDVSEQL/; } ' ); print "\nProtein pattern matches with 'study' function:\n"; timethis($count,' study $protein; for(my $i=1 ; $i < 10000; ++$i) { $protein =~ /PH.EI/; $protein =~ /KFTEQGESMRLY/; $protein =~ /[YAL][NVP][ISV][KQE]/; $protein =~ /DKKQIR/; $protein =~ /[MD][VT][HQ][ER]/; $protein =~ /NVPISVKQEITFTDVSEQL/; } ' );
请注意,对于最有利可图的情况(蛋白质匹配),报告的收益仅为约2%:
# $ perl exer07.08 # On my computer,this is the output I get: your results probably vary. # DNA pattern matches without 'study' function: # timethis 1000: 29 wallclock secs (29.25 usr + 0.00 sys = 29.25 cpu) @ 34.19/s (n=1000) # # DNA pattern matches with 'study' function: # timethis 1000: 30 wallclock secs (29.21 usr + 0.15 sys = 29.36 cpu) @ 34.06/s (n=1000) # # Protein pattern matches without 'study' function: # timethis 1000: 32 wallclock secs (29.47 usr + 0.04 sys = 29.51 cpu) @ 33.89/s (n=1000) # # Protein pattern matches with 'study' function: # timethis 1000: 30 wallclock secs (28.97 usr + 0.02 sys = 28.99 cpu) @ 34.49/s (n=1000) #