我正在用
scrapy
抓取一些页面并得到以下错误:
twisted.internet.error.ConnectionLost
我的命令行输出:
2015-05-04 18:40:32+0800 [cnproxy] INFO: Spider opened 2015-05-04 18:40:32+0800 [cnproxy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min) 2015-05-04 18:40:32+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-05-04 18:40:32+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy1.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:32+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy3.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy3.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy8.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy8.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy9.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy2.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy9.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy10.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy10.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu1.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy7.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy7.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy5.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy5.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy6.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy6.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu2.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:34+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (Failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (Failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy4.html> (Failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:35+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy4.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 2015-05-04 18:40:35+0800 [cnproxy] INFO: Closing spider (finished) 2015-05-04 18:40:35+0800 [cnproxy] INFO: Dumping Scrapy stats: {'downloader/exception_count': 36,'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 36,'downloader/request_bytes': 8121,'downloader/request_count': 36,'downloader/request_method_count/GET': 36,'finish_reason': 'finished','finish_time': datetime.datetime(2015,5,4,10,40,35,608377),'log_count/DEBUG': 38,'log_count/ERROR': 12,'log_count/INFO': 7,'scheduler/dequeued': 36,'scheduler/dequeued/memory': 36,'scheduler/enqueued': 36,'scheduler/enqueued/memory': 36,'start_time': datetime.datetime(2015,32,624695)} 2015-05-04 18:40:35+0800 [cnproxy] INFO: Spider closed (finished)
我的settings.py:
SPIDER_MODULES = ['proxy.spiders'] NEWSPIDER_MODULES = 'proxy.spiders' DOWNLOAD_DELAY = 0 DOWNLOAD_TIMEOUT = 30 ITEM_PIPELINES = { 'proxy.pipelines.ProxyPipeline':100,} CONCURRENT_ITEMS = 100 CONCURRENT_REQUESTS_PER_DOMAIN = 64 #CONCURRENT_SPIDERS = 128 LOG_ENABLED = True LOG_ENCODING = 'utf-8' LOG_FILE = '/home/hadoop/modules/scrapy/myapp/proxy/proxy.log' LOG_LEVEL = 'DEBUG' LOG_STDOUT = False
我的蜘蛛proxy_spider.py:
from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor from proxy.items import ProxyItem import re class ProxycrawlerSpider(CrawlSpider): name = 'cnproxy' allowed_domains = ['www.cnproxy.com'] indexes = [1,2,3,6,7,8,9,10] start_urls = [] for i in indexes: url = 'http://www.cnproxy.com/proxy%s.html' % i start_urls.append(url) start_urls.append('http://www.cnproxy.com/proxyedu1.html') start_urls.append('http://www.cnproxy.com/proxyedu2.html') def parse_ip(self,response): sel = HtmlXPathSelector(response) addresses = sel.select('//tr[position()>1]/td[position()=1]').re('\d{1,3}\.\d{1,3}') protocols = sel.select('//tr[position()>1]/td[position()=2]').re('<td>(.*)<\/td>') locations = sel.select('//tr[position()>1]/td[position()=4]').re('<td>(.*)<\/td>') ports_re = re.compile('write\(":"(.*)\)') raw_ports = ports_re.findall(response.body); port_map = {'z':'3','m':'4','k':'2','l':'9','d':'0','b':'5','i':'7','w':'6','r':'8','c':'1','+':''} ports = [] for port in raw_ports: tmp = port for key in port_map: tmp = tmp.replace(key,port_map[key]); ports.append(tmp) items = [] for i in range(len(addresses)): item = ProxyItem() item['address'] = addresses[i] item['protocol'] = protocols[i] item['location'] = locations[i] item['port'] = ports[i] items.append(item) return items
我的管道或设置有什么问题吗?
如果不是,我怎么能防止twisted.internet.error.ConnectionLost错误.
我试过scrapy外壳
$scrapy shell http://www.cnproxy.com/proxy1.html
并获得与标题相同的错误.
但我可以使用我的chrome访问该页面.我尝试过其他类似的网页
$scrapy shell http://stackoverflow.com
他们都运作良好.
解决方法
您需要设置用户代理字符串.似乎有些网站不喜欢它,并在您的用户代理不是浏览器时阻止.
你可以找到 examples of user agent strings.
你可以找到 examples of user agent strings.
article确定了阻止蜘蛛被阻止的最佳做法.
USER_AGENT =’Mozilla / 5.0(Windows NT 6.3; Win64; x64)AppleWebKit / 537.36(KHTML,与Gecko一样)Chrome / 37.0.2049.0 Safari / 537.36′
你也可以试试user-agent randomiser