【正则表达式】pyahocorasick介绍

前端之家收集整理的这篇文章主要介绍了【正则表达式】pyahocorasick介绍前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。

pyahocorasick是个python模块,由两种数据结构实现:trie和Aho-Corasick自动机。

Trie是一个字符串索引的词典,检索相关项时时间和字符串长度成正比。

AC自动机能够在一次运行中找到给定集合所有字符串。AC自动机其实就是在Trie树上实现KMP,可以完成多模式串的匹配。
(推荐学习资料:http://www.jb51.cc/article/p-sglormit-cu.htmlhttp://www.cnblogs.com/kuangbin/p/3164106.html

作者

Wojciech Muła,wojciech_mula@poczta.onet.pl
官方地址:
https://pypi.python.org/pypi/pyahocorasick/

安装

python setup.py install

API

模块ahocorasick包含几个常量和Automaton类

Unicode和bytes

Automaton接受和返回的字符串既可以是unicode,也可以是bytes,取决于编译阶段设置。用模块成员unicode表示选择的类型。

需要注意的是:如果选择使用unicode,trie存储每个字母时用2或者4个bytes;如果选择bytes,每个字母只需要一个byte。

constants

  • unicode:
  • STORE_ANY,STORE_INTS,STORE_LENGTH
  • EMPTY,TRIE,AHOCORASICK
  • MATCH_EXACT_LENGTH,MATCH_AT_MOST_PREFIX,MATCH_AT_LEAST_PREFIX

Antomaton类

略。有兴趣了解请看官网介绍。

AutomatonSearchIter类

略。有兴趣了解请看官网介绍。

示例

import ahocorasick
A = ahocorasick.Automaton()

# 向trie树中添加单词
for index,word in enumerate("he her hers she".split()):
    A.add_word(word,(index,word))
# 用法分析add_word(word,[value]) => bool
# 根据Automaton构造函数的参数store设置,value这样考虑:
# 1. 如果store设置为STORE_LENGTH,不能传递value,默认保存len(word)
# 2. 如果store设置为STORE_INTS,value可选,但必须是int类型,默认是len(automaton)
# 3. 如果store设置为STORE_ANY,value必须写,可以是任意类型

# 测试单词是否在树中
if "he" in A:
    print True
else:
    print False
A.get("he")
# (0,'he')
A.get("cat","<not exists>")
# '<not exists>'
A.get("dog")
# KeyError

# 将trie树转化为Aho-Corasick自动
A.make_automaton()

# 找到所有匹配字符串
for item in A.iter("_hershe_"):
    print item
#(2,(0,'he'))
#(3,(1,'her'))
#(4,(2,'hers'))
#(6,(3,'she'))
#(6,'he'))

示例2

import ahocorasick
A = ahocorasick.Automaton()

# 添加单词
for index,word in enumerate("cat catastropha rat rate bat".split()):
    A.add_word(word,word))

# prefix
list(A.keys("cat"))
## ["cat","catastropha"]

list(A.keys("?at","?",ahocprasick.MATCH_EXACT_LENGTH))
## ['bat','cat','rat']

list(A.keys("?at?",ahocorasick.MATCH_AT_MOST_PREFIX))
## ["bat","cat","rat","rate"]

list(A.keys("?at?",ahocorasick.MATCH_AT_LEAST_PREFIX))
## ['rate']
## keys用法分析
## keys([prefix,[wildcard,[how]]]) => yield strings
## If prefix (a string) is given,then only words sharing this prefix are yielded.
## If wildcard (single character) is given,then prefix is treated as a simple pattern with selected wildcard. Optional parameter how controls which strings are matched:
## MATCH_EXACT_LENGTH [default]:Only strings with the same length as a pattern’s length are yielded. In other words,literally match a pattern.
## MATCH_AT_LEAST_PREFIX:Strings that have length greater or equal to a pattern’s length are yielded.
## MATCH_AT_MOST_PREFIX:Strings that have length less or equal to a pattern’s length are yielded.
原文链接:/regex/359172.html

猜你在找的正则表达式相关文章