我正在为学校开展一个项目,我需要收集个人NCAA足球运动员的职业统计数据.每个玩家的数据都采用这种格式.
http://www.sports-reference.com/cfb/players/ryan-aplin-1.html
我找不到所有球员的总数,所以我需要逐页进行并拉出每个传球得分Rushing&的最后一排.接收等html表
http://www.sports-reference.com/cfb/players/
例如,在这里找到姓氏为A的每个玩家.
http://www.sports-reference.com/cfb/players/a-index.html
这是我第一次真正进入数据抓取,所以我试着用答案找到类似的问题.我找到的最接近的答案是this question
我相信我可以使用非常相似的东西,我用收集的玩家的名字切换页码.但是,我不知道如何更改它以查找播放器名称而不是页码.
塞缪尔·文图拉(Samuel L. Ventura)最近也发表了关于NFL数据数据搜集的讨论,可以在here找到.
编辑:
Ben真的很有帮助,并提供了一些很棒的代码.第一部分非常有效,但是当我尝试运行第二部分时,我遇到了这个问题.
> # unlist into a single character vector > links <- unlist(links) > # Go to each URL in the list and scrape all the data from the tables > # this will take some time... don't interrupt it! > all_tables <- lapply(links,readHTMLTable,stringsAsFactors = FALSE) Error in UseMethod("xmlNamespaceDefinitions") : no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL" > # Put player names in the list so we know who the data belong to > # extract names from the URLs to their stats page... > toMatch <- c("http://www.sports-reference.com/cfb/players/","-1.html") > player_names <- unique (gsub(paste(toMatch,collapse="|"),"",links)) Error: cannot allocate vector of size 512 Kb > # assign player names to list of tables > names(all_tables) <- player_names Error: object 'player_names' not found > fix(inx_page) Error in edit(name,file,title,editor) : unexpected '<' occurred on line 1 use a command like x <- edit() to recover In addition: Warning message: In edit.default(name,editor = defaultEditor) : deparse may be incomplete
由于没有足够的内存(我目前使用的计算机只有4GB),这可能是一个错误.虽然我不明白这个错误
> all_tables <- lapply(links,stringsAsFactors = FALSE) Error in UseMethod("xmlNamespaceDefinitions") : no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
通过我的其他数据集,我的玩家真的只能追溯到2007年.如果从2007年开始有一些方法可以帮助人们缩小数据.如果我有一个名单我想拉的名单,我可以直接替换lnk
links[[i]] <- paste0("http://www.sports-reference.com",lnk)
只有我需要的球员?
解决方法
以下是如何轻松获取所有播放器页面上所有表格中的所有数据…
首先列出所有玩家页面的URL …
require(RCurl); require(XML) n <- length(letters) # pre-allocate list to fill links <- vector("list",length = n) for(i in 1:n){ print(i) # keep track of what the function is up to # get all html on each page of the a-z index pages inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/",letters[i],"-index.html"))) # scrape URLs for each player from each index page lnk <- unname(xpathSApply(inx_page,"//a/@href")) # skip first 63 and last 10 links as they are constant on each page lnk <- lnk[-c(1:63,(length(lnk)-10):length(lnk))] # only keep links that go to players (exclude schools) lnk <- lnk[grep("players",lnk)] # now we have a list of all the URLs to all the players on that index page # but the URLs are incomplete,so let's complete them so we can use them from # anywhere links[[i]] <- paste0("http://www.sports-reference.com",lnk) } # unlist into a single character vector links <- unlist(links)
现在我们有一个大约67,000个URL的向量(看起来像很多玩家,这可能是对的吗?),所以:
其次,抓取每个URL的所有表格以获取其数据,如下所示:
# Go to each URL in the list and scrape all the data from the tables # this will take some time... don't interrupt it! # start edit1 here - just so you can see what's changed # pre-allocate list all_tables <- vector("list",length = (length(links))) for(i in 1:length(links)){ print(i) # error handling - skips to next URL if it gets an error result <- try( all_tables[[i]] <- readHTMLTable(links[i],stringsAsFactors = FALSE) ); if(class(result) == "try-error") next; } # end edit1 here # Put player names in the list so we know who the data belong to # extract names from the URLs to their stats page... toMatch <- c("http://www.sports-reference.com/cfb/players/","-1.html") player_names <- unique (gsub(paste(toMatch,links)) # assign player names to list of tables names(all_tables) <- player_names
结果看起来像这样(这只是输出的片段):
all_tables $`neli-aasa` $`neli-aasa`$defense Year School Conf Class Pos Solo Ast Tot Loss Sk Int Yds Avg TD PD FR Yds TD FF 1 *2007 Utah MWC FR DL 2 1 3 0.0 0.0 0 0 0 0 0 0 0 0 2 *2010 Utah MWC SR DL 4 4 8 2.5 1.5 0 0 0 1 0 0 0 0 $`neli-aasa`$kick_ret Year School Conf Class Pos Ret Yds Avg TD Ret Yds Avg TD 1 *2007 Utah MWC FR DL 0 0 0 0 0 0 2 *2010 Utah MWC SR DL 2 24 12.0 0 0 0 0 $`neli-aasa`$receiving Year School Conf Class Pos Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD 1 *2007 Utah MWC FR DL 1 41 41.0 0 0 0 0 1 41 41.0 0 2 *2010 Utah MWC SR DL 0 0 0 0 0 0 0 0 0
最后,假设我们只想看看传球表……
# just show passing tables passing <- lapply(all_tables,function(i) i$passing) # but lots of NULL in here,and not a convenient format,so... passing <- do.call(rbind,passing)
我们最终得到了一个可供进一步分析的数据框(也只是一个片段)……
Year School Conf Class Pos Cmp Att Pct Yds Y/A AY/A TD Int Rate james-aaron 1978 Air Force Ind QB 28 56 50.0 316 5.6 3.6 1 3 92.6 jeff-aaron.1 2000 Alabama-Birmingham CUSA JR QB 100 182 54.9 1135 6.2 6.0 5 3 113.1 jeff-aaron.2 2001 Alabama-Birmingham CUSA SR QB 77 148 52.0 828 5.6 4.3 4 6 99.8