我试图从这个站点使用R来刮取数据:
http://www.soccer24.com/kosovo/superliga/results/#
http://www.soccer24.com/kosovo/superliga/results/#
我可以做以下事情:
library(rvest) doc <- html("http://www.soccer24.com/kosovo/superliga/results/")
但我很难知道如何获取数据.这是因为网站上的实际数据似乎是由Javascript生成的.我能做的是
html_text(doc)
但这会给人一种奇怪的文本模糊(这确实包含了数据,但却散布着奇怪的代码,而且根本不清楚我将如何解析它.
@R_502_323@
使用Selenium与phantomjs
library(RSelenium) pJS <- phantom() remDr <- remoteDriver(browserName = "phantomjs") appURL <- "http://www.soccer24.com/kosovo/superliga/results/#" remDr$open() remDr$navigate(appURL)
如果你想按下更多数据按钮,直到它不可见(所有匹配假定显示):
webElem <- remDr$findElement("css","#tournament-page-results-more a") while(webElem$isElementDisplayed()[[1]]){ webElem$clickElement() Sys.sleep(5) webElem <- remDr$findElement("css","#tournament-page-results-more a") } doc <- htmlParse(remDr$getPageSource()[[1]])
删除不需要的舍入数据并使用XML :: readHTMLTable以简化操作
# remove unwanted rounds html. Sometimes there are end of season extra games. # These are presented in a seperate table. invisible(doc["//table/*/tr[@class='event_round']",fun = removeNodes]) appData <- readHTMLTable(doc,which = seq(length(doc["//table"])-1),stringsAsFactors = FALSE,trim = TRUE) if(!is.data.frame(appData)){appData <- do.call(rbind,appData)} row.names(appData) <- NULL names(appData) <- c("blank","Date","hteam","ateam","score") pJS$stop() > head(appData) blank Date hteam ateam score 1 01.04. 18:00 Ferronikeli Ferizaj 4 : 0 2 01.04. 18:00 Istogu Hajvalia 2 : 1 3 01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0 4 01.04. 18:00 Prishtina Drenica 3 : 0 5 31.03. 18:00 Besa Peje Drita 1 : 0 6 31.03. 18:00 Trepca 89 Vellaznimi 2 : 0 > tail(appData) blank Date hteam ateam score 115 17.08. 22:00 Besa Peje Trepca 89 3 : 3 116 17.08. 22:00 Ferronikeli Hajvalia 2 : 5 117 17.08. 22:00 Trepca Mitrovice Ferizaj 1 : 0 118 17.08. 22:00 Vellaznimi Drenica 2 : 1 119 16.08. 22:00 Kosova Vushtrri Drita 0 : 1 120 16.08. 22:00 Prishtina Istogu 2 : 1
根据需要进行进一步格式化.