Python论坛的帖子：

Python论坛 - 讨论区

标题：[python-chinese] 下载网页的小程序的问题

楼主 2007年11月06日星期二 11:41

aaron aaronkowk在gmail.com
星期二十一月 6 11:41:30 HKT 2007

想做一个简单的爬虫程序，其实都不算爬虫了，我想把"http://news.qq.com/a/
"下面的新闻页面全部下载下来，当然用下载工具可以，但是我想自己写一段程序，让它具有通用性，以后只要改一下起始页面地址，就可以下载，而不需要每次都用下载工具去分析到底要下哪些，下面是我的源码：

>  # -*- coding:utf-8 -*-
> # file: collect.py
> #
>
> import urllib
> import urllister
>
> from sgmllib import SGMLParser
>
> class URLLister(SGMLParser):
>  def reset(self):
>   SGMLParser.reset(self)
>   self.urls = []
>
>  def start_a(self, attrs):
>   href = [v for k, v in attrs if k=='href']
>   if href:
>    self.urls.extend(href)
>
> PreURL = "http://news.qq.com/"
> usock = urllib.urlopen("http://news.qq.com/a/20071101/")
> parser = URLLister()
> parser.feed(usock.read())
> parser.close()
> usock.close()
> i = 1
> for url in parser.urls:
>  page = urllib.urlopen(PreURL + url)
>  data = page.read()
>  filename = 'D:\\test\\' + str(i) + '.htm'
>  i  = i + 1
>  file = open(filename, 'wb')
>  file.write(data)
>  file.close()
>  page.close()
>

试过了几次，都是在下载到第24个页面时报错如下：


> Traceback (most recent call last):
>   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in
> 
>     page = urllib.urlopen(PreURL + url)
>   File "C:\Python25\lib\urllib.py", line 82, in urlopen
>     return opener.open(url)
>   File "C:\Python25\lib\urllib.py", line 190, in open
>     return getattr(self, name)(url)
>   File "C:\Python25\lib\urllib.py", line 328, in open_http
>     errcode, errmsg, headers = h.getreply()
>   File "C:\Python25\lib\httplib.py", line 1195, in getreply
>     response = self._conn.getresponse()
>   File "C:\Python25\lib\httplib.py", line 924, in getresponse
>     response.begin()
>   File "C:\Python25\lib\httplib.py", line 385, in begin
>     version, status, reason = self._read_status()
>   File "C:\Python25\lib\httplib.py", line 343, in _read_status
>     line = self.fp.readline()
>   File "C:\Python25\lib\socket.py", line 331, in readline
>     data = recv(1)
> IOError: [Errno socket error] (10054, 'Connection reset by peer')


想请问是什么原因，另外，为什么这段代码的执行速度特别慢？一开始分析那个http://news.qq.com/a/20071101/
的页面时IDLE界面几乎死掉，大概过了几分钟，才开始下载，而且下载一个页面都需要至少一分钟的时间。
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071106/b86b0af5/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

文心力

0楼 2007年11月06日星期二 14:00

Cyril.Liu terry6394在gmail.com
星期二十一月 6 14:00:41 HKT 2007

"IDLE界面几乎死掉"
慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。

On 11/6/07, aaron <aaronkowk at gmail.com> wrote:
>
> 想做一个简单的爬虫程序，其实都不算爬虫了，我想把"http://news.qq.com/a/
> "下面的新闻页面全部下载下来，当然用下载工具可以，但是我想自己写一段程序，让它具有通用性，以后只要改一下起始页面地址，就可以下载，而不需要每次都用下载工具去分析到底要下哪些，下面是我的源码：
>
> >  # -*- coding:utf-8 -*-
> > # file: collect.py
> > #
> >
> > import urllib
> > import urllister
> >
> > from sgmllib import SGMLParser
> >
> > class URLLister(SGMLParser):
> >  def reset(self):
> >   SGMLParser.reset(self)
> >   self.urls = []
> >
> >  def start_a(self, attrs):
> >   href = [v for k, v in attrs if k=='href']
> >   if href:
> >    self.urls.extend(href)
> >
> > PreURL = "http://news.qq.com/"
> > usock = urllib.urlopen("http://news.qq.com/a/20071101/")
> > parser = URLLister()
> > parser.feed(usock.read())
> > parser.close()
> > usock.close()
> > i = 1
> > for url in parser.urls:
> >  page = urllib.urlopen(PreURL + url)
> >  data = page.read()
> >  filename = 'D:\\test\\' + str(i) + '.htm'
> >  i  = i + 1
> >  file = open(filename, 'wb')
> >  file.write(data)
> >  file.close()
> >  page.close()
> >
>
> 试过了几次，都是在下载到第24个页面时报错如下：
>
>
> > Traceback (most recent call last):
> >   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in
> > 
> >     page = urllib.urlopen(PreURL + url)
> >   File "C:\Python25\lib\urllib.py", line 82, in urlopen
> >     return opener.open(url)
> >   File "C:\Python25\lib\urllib.py", line 190, in open
> >     return getattr(self, name)(url)
> >   File "C:\Python25\lib\urllib.py", line 328, in open_http
> >     errcode, errmsg, headers = h.getreply()
> >   File "C:\Python25\lib\httplib.py", line 1195, in getreply
> >     response = self._conn.getresponse()
> >   File "C:\Python25\lib\httplib.py", line 924, in getresponse
> >     response.begin()
> >   File "C:\Python25\lib\httplib.py", line 385, in begin
> >     version, status, reason = self._read_status()
> >   File "C:\Python25\lib\httplib.py", line 343, in _read_status
> >     line = self.fp.readline()
> >   File "C:\Python25\lib\socket.py", line 331, in readline
> >     data = recv(1)
> > IOError: [Errno socket error] (10054, 'Connection reset by peer')
>
>
> 想请问是什么原因，另外，为什么这段代码的执行速度特别慢？一开始分析那个http://news.qq.com/a/20071101/
> 的页面时IDLE界面几乎死掉，大概过了几分钟，才开始下载，而且下载一个页面都需要至少一分钟的时间。
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20071106/16d2360d/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年11月06日星期二 14:10

aaron aaronkowk在gmail.com
星期二十一月 6 14:10:50 HKT 2007

谢谢，我是初学者，对网络上实际的访问底层等东西几乎没什么了解，看来这项工作实现起来没那么容易，还是得知道很多基本的知识。
关于多线程编程，我以为用不到，还没开始看，谢谢你的提醒，我会马上试试看的。

在07-11-6，Cyril. Liu <terry6394在gmail.com> 写道：
>
> "IDLE界面几乎死掉"
> 慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。
>
> On 11/6/07, aaron < aaronkowk在gmail.com> wrote:
> >
> >  想做一个简单的爬虫程序，其实都不算爬虫了，我想把" http://news.qq.com/a/
> > "下面的新闻页面全部下载下来，当然用下载工具可以，但是我想自己写一段程序，让它具有通用性，以后只要改一下起始页面地址，就可以下载，而不需要每次都用下载工具去分析到底要下哪些，下面是我的源码：
> >
> > >  # -*- coding:utf-8 -*-
> > > # file: collect.py
> > > #
> > >
> > > import urllib
> > > import urllister
> > >
> > > from sgmllib import SGMLParser
> > >
> > > class URLLister(SGMLParser):
> > >  def reset(self):
> > >   SGMLParser.reset(self)
> > >   self.urls = []
> > >
> > >  def start_a(self, attrs):
> > >   href = [v for k, v in attrs if k=='href']
> > >   if href:
> > >    self.urls.extend(href)
> > >
> > > PreURL = "http://news.qq.com/"
> > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/")
> > > parser = URLLister()
> > > parser.feed(usock.read())
> > > parser.close()
> > > usock.close()
> > > i = 1
> > > for url in parser.urls:
> > >  page = urllib.urlopen(PreURL + url)
> > >  data = page.read()
> > >  filename = 'D:\\test\\' + str(i) + '.htm'
> > >  i  = i + 1
> > >  file = open(filename, 'wb')
> > >  file.write(data)
> > >  file.close()
> > >  page.close()
> > >
> >
> > 试过了几次，都是在下载到第24个页面时报错如下：
> >
> >
> > > Traceback (most recent call last):
> > >   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in
> > > 
> > >     page = urllib.urlopen(PreURL + url)
> > >   File "C:\Python25\lib\urllib.py", line 82, in urlopen
> > >     return opener.open(url)
> > >   File "C:\Python25\lib\urllib.py", line 190, in open
> > >     return getattr(self, name)(url)
> > >   File "C:\Python25\lib\urllib.py", line 328, in open_http
> > >     errcode, errmsg, headers = h.getreply()
> > >   File "C:\Python25\lib\httplib.py", line 1195, in getreply
> > >     response = self._conn.getresponse()
> > >   File "C:\Python25\lib\httplib.py", line 924, in getresponse
> > >     response.begin()
> > >   File "C:\Python25\lib\httplib.py", line 385, in begin
> > >     version, status, reason = self._read_status()
> > >   File "C:\Python25\lib\httplib.py", line 343, in _read_status
> > >     line = self.fp.readline()
> > >   File "C:\Python25\lib\socket.py", line 331, in readline
> > >     data = recv(1)
> > > IOError: [Errno socket error] (10054, 'Connection reset by peer')
> >
> >
> > 想请问是什么原因，另外，为什么这段代码的执行速度特别慢？一开始分析那个http://news.qq.com/a/20071101/的页面时IDLE界面几乎死掉，大概过了几分钟，才开始下载，而且下载一个页面都需要至少一分钟的时间。
> >
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> > python-chinese-request在lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071106/75e6f278/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

邵峰

0楼 2007年11月06日星期二 14:34

Shao Feng sevenever在gmail.com
星期二十一月 6 14:34:39 HKT 2007

被reset是不是因为你下载的网页内容重包含什么敏感词汇啊？
另外，尝试改一下http request的head,比如user-agent什么的。

On 11/6/07, aaron <aaronkowk在gmail.com> wrote:
>
> 谢谢，我是初学者，对网络上实际的访问底层等东西几乎没什么了解，看来这项工作实现起来没那么容易，还是得知道很多基本的知识。
> 关于多线程编程，我以为用不到，还没开始看，谢谢你的提醒，我会马上试试看的。
>
> 在07-11-6，Cyril. Liu <terry6394在gmail.com> 写道：
> >
> > "IDLE界面几乎死掉"
> > 慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。
> >
> > On 11/6/07, aaron < aaronkowk在gmail.com > wrote:
> > >
> > >  想做一个简单的爬虫程序，其实都不算爬虫了，我想把" http://news.qq.com/a/"下面的新闻页面全部下载下来，当然用下载工具可以，但是我想自己写一段程序，让它具有通用性，以后只要改一下起始页面地址，就可以下载，而不需要每次都用下载工具去分析到底要下哪些，下面是我的源码：
> > >
> > >
> > > >  # -*- coding:utf-8 -*-
> > > > # file: collect.py
> > > > #
> > > >
> > > > import urllib
> > > > import urllister
> > > >
> > > > from sgmllib import SGMLParser
> > > >
> > > > class URLLister(SGMLParser):
> > > >  def reset(self):
> > > >   SGMLParser.reset(self)
> > > >   self.urls = []
> > > >
> > > >  def start_a(self, attrs):
> > > >   href = [v for k, v in attrs if k=='href']
> > > >   if href:
> > > >    self.urls.extend(href)
> > > >
> > > > PreURL = "http://news.qq.com/"
> > > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/")
> > > > parser = URLLister()
> > > > parser.feed(usock.read())
> > > > parser.close()
> > > > usock.close()
> > > > i = 1
> > > > for url in parser.urls:
> > > >  page = urllib.urlopen(PreURL + url)
> > > >  data = page.read ()
> > > >  filename = 'D:\\test\\' + str(i) + '.htm'
> > > >  i  = i + 1
> > > >  file = open(filename, 'wb')
> > > >  file.write(data)
> > > >  file.close()
> > > >  page.close()
> > > >
> > >
> > > 试过了几次，都是在下载到第24个页面时报错如下：
> > >
> > >
> > > > Traceback (most recent call last):
> > > >   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in
> > > > 
> > > >     page = urllib.urlopen(PreURL + url)
> > > >   File "C:\Python25\lib\urllib.py", line 82, in urlopen
> > > >     return opener.open(url)
> > > >   File "C:\Python25\lib\urllib.py", line 190, in open
> > > >     return getattr(self, name)(url)
> > > >   File "C:\Python25\lib\urllib.py", line 328, in open_http
> > > >     errcode, errmsg, headers = h.getreply()
> > > >   File "C:\Python25\lib\httplib.py", line 1195, in getreply
> > > >     response = self._conn.getresponse()
> > > >   File "C:\Python25\lib\httplib.py", line 924, in getresponse
> > > >     response.begin()
> > > >   File "C:\Python25\lib\httplib.py", line 385, in begin
> > > >     version, status, reason = self._read_status()
> > > >   File "C:\Python25\lib\httplib.py", line 343, in _read_status
> > > >     line = self.fp.readline()
> > > >   File "C:\Python25\lib\socket.py", line 331, in readline
> > > >     data = recv(1)
> > > > IOError: [Errno socket error] (10054, 'Connection reset by peer')
> > >
> > >
> > > 想请问是什么原因，另外，为什么这段代码的执行速度特别慢？一开始分析那个http://news.qq.com/a/20071101/的页面时IDLE界面几乎死掉，大概过了几分钟，才开始下载，而且下载一个页面都需要至少一分钟的时间。
> > >
> > >
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese在lists.python.cn
> > > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > > Unsubscribe: send unsubscribe to
> > > python-chinese-request在lists.python.cn
> > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > >
> >
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> >
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071106/12b3e429/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

温铭

0楼 2007年11月06日星期二 14:41

Wayne moonbingbing在gmail.com
星期二十一月 6 14:41:24 HKT 2007

建议不要在IDLE下面跑程序，可能会有一些莫名其妙的错误。命令行下面试一试

在07-11-6，Shao Feng <sevenever at gmail.com> 写道：
>
> 被reset是不是因为你下载的网页内容重包含什么敏感词汇啊？
> 另外，尝试改一下http request的head,比如user-agent什么的。
>
> On 11/6/07, aaron <aaronkowk at gmail.com > wrote:
> >
> > 谢谢，我是初学者，对网络上实际的访问底层等东西几乎没什么了解，看来这项工作实现起来没那么容易，还是得知道很多基本的知识。
> > 关于多线程编程，我以为用不到，还没开始看，谢谢你的提醒，我会马上试试看的。
> >
> > 在07-11-6，Cyril. Liu <terry6394 at gmail.com> 写道：
> > >
> > > "IDLE界面几乎死掉"
> > > 慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。
> > >
> > > On 11/6/07, aaron < aaronkowk at gmail.com > wrote:
> > > >
> > > >  想做一个简单的爬虫程序，其实都不算爬虫了，我想把" http://news.qq.com/a/"下面的新闻页面全部下载下来，当然用下载工具可以，但是我想自己写一段程序，让它具有通用性，以后只要改一下起始页面地址，就可以下载，而不需要每次都用下载工具去分析到底要下哪些，下面是我的源码：
> > > >
> > > >
> > > > >  # -*- coding:utf-8 -*-
> > > > > # file: collect.py
> > > > > #
> > > > >
> > > > > import urllib
> > > > > import urllister
> > > > >
> > > > > from sgmllib import SGMLParser
> > > > >
> > > > > class URLLister(SGMLParser):
> > > > >  def reset(self):
> > > > >   SGMLParser.reset(self)
> > > > >   self.urls = []
> > > > >
> > > > >  def start_a(self, attrs):
> > > > >   href = [v for k, v in attrs if k=='href']
> > > > >   if href:
> > > > >    self.urls.extend(href)
> > > > >
> > > > > PreURL = "http://news.qq.com/"
> > > > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/")
> > > > > parser = URLLister()
> > > > > parser.feed(usock.read())
> > > > > parser.close()
> > > > > usock.close()
> > > > > i = 1
> > > > > for url in parser.urls:
> > > > >  page = urllib.urlopen(PreURL + url)
> > > > >  data = page.read ()
> > > > >  filename = 'D:\\test\\' + str(i) + '.htm'
> > > > >  i  = i + 1
> > > > >  file = open(filename, 'wb')
> > > > >  file.write(data)
> > > > >  file.close()
> > > > >  page.close()
> > > > >
> > > >
> > > > 试过了几次，都是在下载到第24个页面时报错如下：
> > > >
> > > >
> > > > > Traceback (most recent call last):
> > > > >   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28,
> > > > > in 
> > > > >     page = urllib.urlopen(PreURL + url)
> > > > >   File "C:\Python25\lib\urllib.py", line 82, in urlopen
> > > > >     return opener.open(url)
> > > > >   File "C:\Python25\lib\urllib.py", line 190, in open
> > > > >     return getattr(self, name)(url)
> > > > >   File "C:\Python25\lib\urllib.py", line 328, in open_http
> > > > >     errcode, errmsg, headers = h.getreply()
> > > > >   File "C:\Python25\lib\httplib.py", line 1195, in getreply
> > > > >     response = self._conn.getresponse()
> > > > >   File "C:\Python25\lib\httplib.py", line 924, in getresponse
> > > > >     response.begin()
> > > > >   File "C:\Python25\lib\httplib.py", line 385, in begin
> > > > >     version, status, reason = self._read_status()
> > > > >   File "C:\Python25\lib\httplib.py", line 343, in _read_status
> > > > >     line = self.fp.readline()
> > > > >   File "C:\Python25\lib\socket.py", line 331, in readline
> > > > >     data = recv(1)
> > > > > IOError: [Errno socket error] (10054, 'Connection reset by peer')
> > > >
> > > >
> > > > 想请问是什么原因，另外，为什么这段代码的执行速度特别慢？一开始分析那个http://news.qq.com/a/20071101/的页面时IDLE界面几乎死掉，大概过了几分钟，才开始下载，而且下载一个页面都需要至少一分钟的时间。
> > > >
> > > >
> > > > _______________________________________________
> > > > python-chinese
> > > > Post: send python-chinese at lists.python.cn
> > > > Subscribe: send subscribe to python-chinese-request at lists.python.cn
> > > > Unsubscribe: send unsubscribe to
> > > > python-chinese-request at lists.python.cn
> > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > > >
> > >
> > >
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese at lists.python.cn
> > > Subscribe: send subscribe to python-chinese-request at lists.python.cn
> > > Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> > >
> > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > >
> >
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese at lists.python.cn
> > Subscribe: send subscribe to python-chinese-request at lists.python.cn
> > Unsubscribe: send unsubscribe to
> > python-chinese-request at lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>



-- 
wayne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20071106/4b0797a7/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年11月09日星期五 13:58

Chen Memo ooozid在gmail.com
星期五十一月 9 13:58:08 HKT 2007

urllib ËÆºõ²»ÄÜ´¦ÀíÒ»Ð©Òì³£Çé¿ö£¬±ÈÈçÁ¬½Ó³¬Ê±£¬Òì³£ÖÐ¶Ï£¬404µÈ´íÎó£¬ÎÒÓÃµÄÊÇhttpconnection

ÔÚ07-11-6£¬Wayne <moonbingbing在gmail.com> Ð´µÀ£º
>
> ½¨Òé²»ÒªÔÚIDLEÏÂÃæÅÜ³ÌÐò£¬¿ÉÄÜ»áÓÐÒ»Ð©ÄªÃûÆäÃîµÄ´íÎó¡£ÃüÁîÐÐÏÂÃæÊÔÒ»ÊÔ
>
> ÔÚ07-11-6£¬Shao Feng <sevenever在gmail.com> Ð´µÀ£º
> >
> > ±»resetÊÇ²»ÊÇÒòÎªÄãÏÂÔØµÄÍøÒ³ÄÚÈÝÖØ°üº¬Ê²Ã´Ãô¸Ð´Ê»ã°¡£¿
> > ÁíÍâ£¬³¢ÊÔ¸ÄÒ»ÏÂhttp requestµÄhead,±ÈÈçuser-agentÊ²Ã´µÄ¡£
> >
> > On 11/6/07, aaron < aaronkowk在gmail.com > wrote:
> > >
> > > Ð»Ð»£¬ÎÒÊÇ³õÑ§Õß£¬¶ÔÍøÂçÉÏÊµ¼ÊµÄ·ÃÎÊµ×²ãµÈ¶«Î÷¼¸ºõÃ»Ê²Ã´ÁË½â£¬¿´À´ÕâÏî¹¤×÷ÊµÏÖÆðÀ´Ã»ÄÇÃ´ÈÝÒ×£¬»¹ÊÇµÃÖªµÀºÜ¶à»ù±¾µÄÖªÊ¶¡£
> > > ¹ØÓÚ¶àÏß³Ì±à³Ì£¬ÎÒÒÔÎªÓÃ²»µ½£¬»¹Ã»¿ªÊ¼¿´£¬Ð»Ð»ÄãµÄÌáÐÑ£¬ÎÒ»áÂíÉÏÊÔÊÔ¿´µÄ¡£
> > >
> > > ÔÚ07-11-6£¬Cyril. Liu <terry6394在gmail.com> Ð´µÀ£º
> > > >
> > > > "IDLE½çÃæ¼¸ºõËÀµô"
> > > > ÂýµÄÔÒò¹À¼ÆÊÇÒòÎªÄã´ò¿ªURL±»×èÈûµÄÎÊÌâ,
> > > > ¾ÍÏñÓÃIEµÄÊ±ºò¾³£»á¼ÙËÀ¡£ÎÒÏëÄãÓÃ¶àÏß³ÌÈ¥×öµÄ»°¿ÉÄÜ»á±È½ÏºÃÒ»µã¡£±Ï¾¹ÍøÒ³Ò»¸öÒ»¸ö±È½ÏÂý¡£
> > > >
> > > > On 11/6/07, aaron < aaronkowk在gmail.com > wrote:
> > > > >
> > > > >  Ïë×öÒ»¸ö¼òµ¥µÄÅÀ³æ³ÌÐò£¬ÆäÊµ¶¼²»ËãÅÀ³æÁË£¬ÎÒÏë°Ñ" http://news.qq.com/a/"ÏÂÃæµÄÐÂÎÅÒ³ÃæÈ«²¿ÏÂÔØÏÂÀ´£¬µ±È»ÓÃÏÂÔØ¹¤¾ß¿ÉÒÔ£¬µ«ÊÇÎÒÏë×Ô¼ºÐ´Ò»¶Î³ÌÐò£¬ÈÃËü¾ßÓÐÍ¨ÓÃÐÔ£¬ÒÔºóÖ»Òª¸ÄÒ»ÏÂÆðÊ¼Ò³ÃæµØÖ·£¬¾Í¿ÉÒÔÏÂÔØ£¬¶ø²»ÐèÒªÃ¿´Î¶¼ÓÃÏÂÔØ¹¤¾ßÈ¥·ÖÎöµ½µ×ÒªÏÂÄÄÐ©£¬ÏÂÃæÊÇÎÒµÄÔ´Âë£º
> > > > >
> > > > >
> > > > > >  # -*- coding:utf-8 -*-
> > > > > > # file: collect.py
> > > > > > #
> > > > > >
> > > > > > import urllib
> > > > > > import urllister
> > > > > >
> > > > > > from sgmllib import SGMLParser
> > > > > >
> > > > > > class URLLister(SGMLParser):
> > > > > >  def reset(self):
> > > > > >   SGMLParser.reset(self)
> > > > > >   self.urls = []
> > > > > >
> > > > > >  def start_a(self, attrs):
> > > > > >   href = [v for k, v in attrs if k=='href']
> > > > > >   if href:
> > > > > >    self.urls.extend(href)
> > > > > >
> > > > > > PreURL = "http://news.qq.com/"
> > > > > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/")
> > > > > > parser = URLLister()
> > > > > > parser.feed(usock.read())
> > > > > > parser.close()
> > > > > > usock.close()
> > > > > > i = 1
> > > > > > for url in parser.urls:
> > > > > >  page = urllib.urlopen(PreURL + url)
> > > > > >  data = page.read ()
> > > > > >  filename = 'D:\\test\\' + str(i) + '.htm'
> > > > > >  i  = i + 1
> > > > > >  file = open(filename, 'wb')
> > > > > >  file.write(data)
> > > > > >  file.close()
> > > > > >  page.close()
> > > > > >
> > > > >
> > > > > ÊÔ¹ýÁË¼¸´Î£¬¶¼ÊÇÔÚÏÂÔØµ½µÚ24¸öÒ³ÃæÊ±±¨´íÈçÏÂ£º
> > > > >
> > > > >
> > > > > > Traceback (most recent call last):
> > > > > >   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28,
> > > > > > in 
> > > > > >     page = urllib.urlopen(PreURL + url)
> > > > > >   File "C:\Python25\lib\urllib.py", line 82, in urlopen
> > > > > >     return opener.open(url)
> > > > > >   File "C:\Python25\lib\urllib.py", line 190, in open
> > > > > >     return getattr(self, name)(url)
> > > > > >   File "C:\Python25\lib\urllib.py", line 328, in open_http
> > > > > >     errcode, errmsg, headers = h.getreply()
> > > > > >   File "C:\Python25\lib\httplib.py", line 1195, in getreply
> > > > > >     response = self._conn.getresponse()
> > > > > >   File "C:\Python25\lib\httplib.py", line 924, in getresponse
> > > > > >     response.begin()
> > > > > >   File "C:\Python25\lib\httplib.py", line 385, in begin
> > > > > >     version, status, reason = self._read_status()
> > > > > >   File "C:\Python25\lib\httplib.py", line 343, in _read_status
> > > > > >     line = self.fp.readline()
> > > > > >   File "C:\Python25\lib\socket.py", line 331, in readline
> > > > > >     data = recv(1)
> > > > > > IOError: [Errno socket error] (10054, 'Connection reset by
> > > > > > peer')
> > > > >
> > > > >
> > > > > ÏëÇëÎÊÊÇÊ²Ã´ÔÒò£¬ÁíÍâ£¬ÎªÊ²Ã´Õâ¶Î´úÂëµÄÖ´ÐÐËÙ¶ÈÌØ±ðÂý£¿Ò»¿ªÊ¼·ÖÎöÄÇ¸öhttp://news.qq.com/a/20071101/µÄÒ³ÃæÊ±IDLE½çÃæ¼¸ºõËÀµô£¬´ó¸Å¹ýÁË¼¸·ÖÖÓ£¬²Å¿ªÊ¼ÏÂÔØ£¬¶øÇÒÏÂÔØÒ»¸öÒ³Ãæ¶¼ÐèÒªÖÁÉÙÒ»·ÖÖÓµÄÊ±¼ä¡£
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > python-chinese
> > > > > Post: send python-chinese在lists.python.cn
> > > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > > > >
> > > > > Unsubscribe: send unsubscribe to
> > > > > python-chinese-request在lists.python.cn
> > > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > python-chinese
> > > > Post: send python-chinese在lists.python.cn
> > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > > > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> > > >
> > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > > >
> > >
> > >
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese在lists.python.cn
> > > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > > Unsubscribe: send unsubscribe to
> > > python-chinese-request在lists.python.cn
> > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > >
> >
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> > python-chinese-request在lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
>
> --
> wayne
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071109/c4f9d7fd/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！