在这个页面中,我必须单击Busqueda de Companias选项卡然后开始有趣的一面.当我点击时,我得到下一个屏幕:
在这个页面中,我必须设置选项Nombre,然后我必须插入一个带有名称的字符串.例如,我将添加字符串PROAÑO& ASOCIADOS CIA. LTDA.我会得到下一个屏幕:
在这个屏幕中,我有这个企业的信息.然后,我必须单击Informacion Estados Financieros选项卡,我将进入下一个屏幕:
在这个最终屏幕中,我必须单击选项卡Estado Situacion,我将从Codigo de la cuenta contable,Nombre de la cuenta contable和Valor列中获取企业的信息.我想将这些信息保存在数据框中.我发现的大多数复杂的一面开始时我必须设置元素Nombre,插入一个字符串,然后Buscar并单击直到找到Informacion Estados Financieros选项卡.我尝试过使用rvest包中的html_session和html_form但元素是空的.
你能帮我解决一下这个问题吗?
解决方法
这是一个自包含的代码示例,使用问题中引用的网站.
观察:请不要运行此代码.
Why? Having 1k Stack users hit the web-site is a DDOS attack.
简介先决条件
> https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/
>安装RStudio [推荐]
>创建一个项目并打开下面的代码文件下面的代码将带您从第二页[http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul]到最后一页,您感兴趣的信息是…
有用的参考文献:
如果您对使用RSelenium感兴趣,我强烈建议您阅读以下参考资料,感谢John Harrison开发RSelenium软件包.
- RSelenium Basics
07002
> RSelenium无头浏览
07003
> RSelenium Vignette
07004
代码示例
# We want to make this as easy as possible to use # So we need to install required packages for the user... # if (!require(RSelenium)) install.packages("RSelenium") if (!require(XML)) install.packages("XML") if (!require(RJSONIO)) install.packages("RSJONIO") if (!require(stringr)) install.packages("stringr") # Data # mainPage <- "http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul" businessPage <- "http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul" # StartServer # We assume RSelenium is not setup,so we check if the RSelenium # server is available,if not we install RSelenium server. checkForServer() # OK. now we start the server RSelenium::startServer() remDr <- RSelenium::remoteDriver$new() # We assume the user has installed Firefox and the Selenium IDE # https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/ # # Ok we open firefix remDr$open(silent = T) # Open up a firefox window... # Now we open the browser and required URL... # This is the page that matters... remDr$navigate(businessPage) # First things first on the first page,lets get the id's for the radio_button,# name Element,and button. We need all three. # radioButton <- remDr$findElements(using = 'css selector',".z-radio-cnt") nameElement <- remDr$findElements(using = 'css selector',".z-comboBox-inp") searchButton <- remDr$findElements(using = 'css selector',".z-button-cm") # Optional: we can highlight the radio elements returned # lapply(radioButton,function(x){x$highlightElement()}) # Optional: we can highlight the nameElement returned # lapply(nameElement,function(x){x$highlightElement()}) # Optional: we can highlight the searchButton returned # lapply(searchButton,function(x){x$highlightElement()}) # Now we can select and press the third radio button radioButton[[3]]$clickElement() # We fill in the required name... nameElement[[1]]$sendKeysToElement(list("PROAÑO & ASOCIADOS CIA. LTDA.")) # This is subtle but required the page triggers a drop down list,so rather than # hitting the searchButton,we first select,and hit enter in the drop down menu... selectElement <- remDr$findElements(using = 'css selector',".z-comboitem-text") selectElement[[1]]$clickElement() # OK,now we can click the search button,which will cause the next page to open searchButton[[1]]$clickElement() # New Page opens... # # Ok,so now we first pull the list of buttons... finPageButton <- remDr$findElements(using = 'class name',"m_iconos") # Now we can press the required button to open the page we want to get too... finPageButton[[9]]$clickElement() # We are now on the required page.提取表值…
下一步是提取表值.为此,我们提取.z-listitem css选择器数据.现在我们可以检查以确认我们是否看到了数据行.我们这样做,所以我们现在可以提取返回的值并填充列表或Dataframe.
# Ok,now we need to extract the table,we identify and pull out the # '.z-listitem' and assign to modalWindow modalWindow <- remDr$findElements(using = 'css selector',".z-listitem") # Now we can extract the lines from modalWindow... Now that each line is # returned as a single line of text,so we split into three based on the # line marker "/n' lineText <- str_split(modalWindow[[1]]$getElementText()[1],'\n') lineText在这里,结果是:
> lineText <- stringr::str_split(modalWindow[[1]]$getElementText()[1],'\n') > lineText [[1]] [1] "10" [2] "OPERACIONES DE INGRESO CON PARTES RELACIONADAS EN PARAÍSOS FISCALES,JURISDICCIONES DE MENOR IMPOSICIÓN Y REGÍMENES FISCALES PREFERENTES" [3] "0.00"处理隐藏数据.
Selenium WebDriver和RSelenium只与网页的可见元素进行交互.如果我们尝试读取整个表,我们将只返回可见(未隐藏)的表项.
我们可以通过滚动到表格底部来解决此问题.由于滚动操作,我们强制表填充.然后我们可以提取完整的表格.
# Select the .z-listBox-body modalWindow <- remDr$findElements(using = 'css selector',".z-listBox-body") # Now we tell the window we want to scroll to the bottom of the table # This triggers the table to populate all the rows modalWindow[[1]]$executeScript("window.scrollTo(0,document.body.scrollHeight)") # Now we can extract the complete table modalWindow <- remDr$findElements(using = 'css selector',".z-listitem") lineText <- stringr::str_split(modalWindow[[9]]$getElementText(),'\n') lineText代码的作用.
上面的代码示例是自包含的.我的意思是它应该安装你需要的一切,包括所需的包.一旦依赖的R包安装,R代码将调用checkForServer(),如果未安装Selenium,则调用将安装它.这可能要花点时间
我的建议是你逐步完成代码,因为我没有包含任何延迟(在你想要的生产中),请注意我还没有针对速度进行优化,而是为了一点清晰[从我的角度来看] ……
> Mac OS X 10.11.5
> RStudio 0.99.893
> R版本3.2.4(2016-03-10) – “非常安全的菜肴”