我正在编写一个
HTML解析器,它使用TagSoup将格式良好的结构传递给
XMLSlurper.
这是通用代码:
def htmlText = """ <html> <body> <div id="divId" class="divclass"> <h2>Heading 2</h2> <ol> <li><h3><a class="Box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li> <li><h3><a class="Box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li> </ol> </div> </body> </html> """ def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText ); html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem -> def link = linkItem.h3.a.@href def address = linkItem.address.text() println "$link: $address\n" }
我希望每个人都允许我依次选择每个’li’,这样我就可以检索相应的href和地址细节.相反,我得到这个输出:
#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111
我已经在网上检查过各种各样的例子,这些例子要么处理XML,要么就像“从这个文件中检索所有链接”这样的单行示例.似乎it.h3.a. @ href表达式正在收集文本中的所有href,即使我将它传递给父’li’节点.
你能让我知道吗:
>为什么我要显示输出
>我如何检索每个’li’项目的href /地址对
谢谢.
解决方法
用find替换grep:
html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem -> def link = linkItem.h3.a.@href def address = linkItem.address.text() println "$link: $address\n" }
然后你会得到的
#href1: Here is the addressTelephone number: telephone #href2: Here is another addressAnother telephone: 0845 1111111
grep返回一个ArrayList但find返回一个NodeChild类:
println html.'**'.grep { it.@class == 'divclass' }.getClass() println html.'**'.find { it.@class == 'divclass' }.getClass()
结果是:
class java.util.ArrayList class groovy.util.slurpersupport.NodeChild
因此,如果您想使用grep,那么您可以将其他每个像这样嵌套以使其工作
html.'**'.grep { it.@class == 'divclass' }.ol.li.each { it.each { linkItem -> def link = linkItem.h3.a.@href def address = linkItem.address.text() println "$link: $address\n" } }
长话短说,在你的情况下,使用find而不是grep.