我正在为特定网站编写网络抓取工具.该应用程序是一个不使用多个线程的VB.Net
Windows窗体应用程序 – 每个Web请求都是连续的.但是,在十次成功的页面检索之后,每个连续的请求都会超时.
我已经回顾了已经在SO上发布的类似问题,并在我的GetPage例程中实现了推荐的技术,如下所示:
- Public Function GetPage(ByVal url As String) As String
- Dim result As String = String.Empty
- Dim uri As New Uri(url)
- Dim sp As ServicePoint = ServicePointManager.FindServicePoint(uri)
- sp.ConnectionLimit = 100
- Dim request As HttpWebRequest = WebRequest.Create(uri)
- request.KeepAlive = False
- request.Timeout = 15000
- Try
- Using response As HttpWebResponse = DirectCast(request.GetResponse,HttpWebResponse)
- Using dataStream As Stream = response.GetResponseStream()
- Using reader As New StreamReader(dataStream)
- If response.StatusCode <> HttpStatusCode.OK Then
- Throw New Exception("Got response status code: " + response.StatusCode)
- End If
- result = reader.ReadToEnd()
- End Using
- End Using
- response.Close()
- End Using
- Catch ex As Exception
- Dim msg As String = "Error reading page """ & url & """. " & ex.Message
- Logger.LogMessage(msg,logoutputLevel.Diagnostics)
- End Try
- Return result
- End Function
我错过了什么吗?我没有关闭或处理应该是的对象吗?看起来奇怪的是它总是在连续十次请求之后发生.
笔记:
ServicePointManager.DefaultConnectionLimit = 100
>如果我将KeepAlive设置为true,则在五个请求之后开始超时.
>所有请求都针对同一域中的页面.
编辑
我认为该网站有一些DOS保护,当它被一些rapis请求击中时就开始了.您可能想尝试在webrequest上设置UserAgent.