我最近开始了解网络抓取工具,并使用
Ruby,Anemone和
Mongodb构建了一个示例抓取工具进行存储.我正在一个庞大的公共网站上测试爬虫,可能有数十亿个链接.
crawler.rb正在索引正确的信息,尽管当我检查活动监视器中的内存使用情况时,它会显示内存不断增长.我只运行了爬虫大约6-7个小时,内存显示为1.38GB的mongod和1.37GB的Ruby进程.它似乎每小时左右增长约100MB.
好像我可能有内存泄漏?他们是一种更优化的方式,我可以实现相同的爬行而不会使内存升级失控,从而可以运行更长时间吗?
# Sample web_crawler.rb with Anemone,Mongodb and Ruby. require 'anemone' # do not store the page's body. module Anemone class Page def to_hash {'url' => @url.to_s,'links' => links.map(&:to_s),'code' => @code,'visited' => @visited,'depth' => @depth,'referer' => @referer.to_s,'fetched' => @fetched} end def self.from_hash(hash) page = self.new(URI(hash['url'])) {'@links' => hash['links'].map { |link| URI(link) },'@code' => hash['code'].to_i,'@visited' => hash['visited'],'@depth' => hash['depth'].to_i,'@referer' => hash['referer'],'@fetched' => hash['fetched'] }.each do |var,value| page.instance_variable_set(var,value) end page end end end Anemone.crawl("http://www.example.com/",:discard_page_bodies => true,:threads => 1,:obey_robots_txt => true,:user_agent => "Example - Web Crawler",:large_scale_crawl => true) do | anemone | anemone.storage = Anemone::Storage.MongoDB #only crawl pages that contain /example in url anemone.focus_crawl do |page| links = page.links.delete_if do |link| (link.to_s =~ /example/).nil? end end # only process pages in the /example directory anemone.on_pages_like(/example/) do | page | regex = /some type of regex/ example = page.doc.css('#example_div').inner_html.gsub(regex,'') rescue next # Save to text file if !example.nil? and example != "" open('example.txt','a') { |f| f.puts "#{example}"} end page.discard_doc! end end
解决方法
我也有这个问题,但我使用redis作为数据存储区.
这是我的抓取工具:
require "rubygems" require "anemone" urls = File.open("urls.csv") opts = {discard_page_bodies: true,skip_query_strings: true,depth_limit:2000,read_timeout: 10} File.open("results.csv","a") do |result_file| while row = urls.gets row_ = row.strip.split(',') if row_[1].start_with?("http://") url = row_[1] else url = "http://#{row_[1]}" end Anemone.crawl(url,options = opts) do |anemone| anemone.storage = Anemone::Storage.Redis puts "crawling #{url}" anemone.on_every_page do |page| next if page.body == nil if page.body.downcase.include?("sometext") puts "found one at #{url}" result_file.puts "#{row_[0]},#{row_[1]}" next end # end if end # end on_every_page end # end crawl end # end while # we're done puts "We're done." end # end File.open
我将here中的补丁应用到了海葵宝石中的core.rb文件中:
35 # Prevent page_queue from using excessive RAM. Can indirectly limit ra te of crawling. You'll additionally want to use discard_page_bodies and/or a non-memory 'storage' option 36 :max_page_queue_size => 100,
…
(以下曾经在第155行)
157 page_queue = SizedQueue.new(@opts[:max_page_queue_size])
我有一个小时的cron工作:
#!/usr/bin/env python import redis r = redis.Redis() r.flushall()
尝试保持redis的内存使用率下降.我现在正在重新开始一次巨大的爬行,所以我们会看到它如何发展!
我会用结果报告……