Python论坛  - 讨论区

标题:[python-chinese] 下载网页的小程序的问题

2007年11月06日 星期二 11:41

aaron aaronkowk在gmail.com
星期二 十一月 6 11:41:30 HKT 2007

想做一个简单的爬虫程序,其实都不算爬虫了,我想把"http://news.qq.com/a/
"下面的新闻页面全部下载下来,当然用下载工具可以,但是我想自己写一段程序,让它具有通用性,以后只要改一下起始页面地址,就可以下载,而不需要每次都用下载工具去分析到底要下哪些,下面是我的源码:

>  # -*- coding:utf-8 -*-
> # file: collect.py
> #
>
> import urllib
> import urllister
>
> from sgmllib import SGMLParser
>
> class URLLister(SGMLParser):
>  def reset(self):
>   SGMLParser.reset(self)
>   self.urls = []
>
>  def start_a(self, attrs):
>   href = [v for k, v in attrs if k=='href']
>   if href:
>    self.urls.extend(href)
>
> PreURL = "http://news.qq.com/"
> usock = urllib.urlopen("http://news.qq.com/a/20071101/")
> parser = URLLister()
> parser.feed(usock.read())
> parser.close()
> usock.close()
> i = 1
> for url in parser.urls:
>  page = urllib.urlopen(PreURL + url)
>  data = page.read()
>  filename = 'D:\\test\\' + str(i) + '.htm'
>  i  = i + 1
>  file = open(filename, 'wb')
>  file.write(data)
>  file.close()
>  page.close()
>

试过了几次,都是在下载到第24个页面时报错如下:


> Traceback (most recent call last):
>   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in
> 
>     page = urllib.urlopen(PreURL + url)
>   File "C:\Python25\lib\urllib.py", line 82, in urlopen
>     return opener.open(url)
>   File "C:\Python25\lib\urllib.py", line 190, in open
>     return getattr(self, name)(url)
>   File "C:\Python25\lib\urllib.py", line 328, in open_http
>     errcode, errmsg, headers = h.getreply()
>   File "C:\Python25\lib\httplib.py", line 1195, in getreply
>     response = self._conn.getresponse()
>   File "C:\Python25\lib\httplib.py", line 924, in getresponse
>     response.begin()
>   File "C:\Python25\lib\httplib.py", line 385, in begin
>     version, status, reason = self._read_status()
>   File "C:\Python25\lib\httplib.py", line 343, in _read_status
>     line = self.fp.readline()
>   File "C:\Python25\lib\socket.py", line 331, in readline
>     data = recv(1)
> IOError: [Errno socket error] (10054, 'Connection reset by peer')


想请问是什么原因,另外,为什么这段代码的执行速度特别慢?一开始分析那个http://news.qq.com/a/20071101/
的页面时IDLE界面几乎死掉,大概过了几分钟,才开始下载,而且下载一个页面都需要至少一分钟的时间。
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071106/b86b0af5/attachment.htm 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年11月06日 星期二 14:00

Cyril.Liu terry6394在gmail.com
星期二 十一月 6 14:00:41 HKT 2007

"IDLE界面几乎死掉"
慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。

On 11/6/07, aaron <aaronkowk at gmail.com> wrote:
>
> 想做一个简单的爬虫程序,其实都不算爬虫了,我想把"http://news.qq.com/a/
> "下面的新闻页面全部下载下来,当然用下载工具可以,但是我想自己写一段程序,让它具有通用性,以后只要改一下起始页面地址,就可以下载,而不需要每次都用下载工具去分析到底要下哪些,下面是我的源码:
>
> >  # -*- coding:utf-8 -*-
> > # file: collect.py
> > #
> >
> > import urllib
> > import urllister
> >
> > from sgmllib import SGMLParser
> >
> > class URLLister(SGMLParser):
> >  def reset(self):
> >   SGMLParser.reset(self)
> >   self.urls = []
> >
> >  def start_a(self, attrs):
> >   href = [v for k, v in attrs if k=='href']
> >   if href:
> >    self.urls.extend(href)
> >
> > PreURL = "http://news.qq.com/"
> > usock = urllib.urlopen("http://news.qq.com/a/20071101/")
> > parser = URLLister()
> > parser.feed(usock.read())
> > parser.close()
> > usock.close()
> > i = 1
> > for url in parser.urls:
> >  page = urllib.urlopen(PreURL + url)
> >  data = page.read()
> >  filename = 'D:\\test\\' + str(i) + '.htm'
> >  i  = i + 1
> >  file = open(filename, 'wb')
> >  file.write(data)
> >  file.close()
> >  page.close()
> >
>
> 试过了几次,都是在下载到第24个页面时报错如下:
>
>
> > Traceback (most recent call last):
> >   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in
> > 
> >     page = urllib.urlopen(PreURL + url)
> >   File "C:\Python25\lib\urllib.py", line 82, in urlopen
> >     return opener.open(url)
> >   File "C:\Python25\lib\urllib.py", line 190, in open
> >     return getattr(self, name)(url)
> >   File "C:\Python25\lib\urllib.py", line 328, in open_http
> >     errcode, errmsg, headers = h.getreply()
> >   File "C:\Python25\lib\httplib.py", line 1195, in getreply
> >     response = self._conn.getresponse()
> >   File "C:\Python25\lib\httplib.py", line 924, in getresponse
> >     response.begin()
> >   File "C:\Python25\lib\httplib.py", line 385, in begin
> >     version, status, reason = self._read_status()
> >   File "C:\Python25\lib\httplib.py", line 343, in _read_status
> >     line = self.fp.readline()
> >   File "C:\Python25\lib\socket.py", line 331, in readline
> >     data = recv(1)
> > IOError: [Errno socket error] (10054, 'Connection reset by peer')
>
>
> 想请问是什么原因,另外,为什么这段代码的执行速度特别慢?一开始分析那个http://news.qq.com/a/20071101/
> 的页面时IDLE界面几乎死掉,大概过了几分钟,才开始下载,而且下载一个页面都需要至少一分钟的时间。
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20071106/16d2360d/attachment.html 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年11月06日 星期二 14:10

aaron aaronkowk在gmail.com
星期二 十一月 6 14:10:50 HKT 2007

谢谢,我是初学者,对网络上实际的访问底层等东西几乎没什么了解,看来这项工作实现起来没那么容易,还是得知道很多基本的知识。
关于多线程编程,我以为用不到,还没开始看,谢谢你的提醒,我会马上试试看的。

在07-11-6,Cyril. Liu <terry6394在gmail.com> 写道:
>
> "IDLE界面几乎死掉"
> 慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。
>
> On 11/6/07, aaron < aaronkowk在gmail.com> wrote:
> >
> >  想做一个简单的爬虫程序,其实都不算爬虫了,我想把" http://news.qq.com/a/
> > "下面的新闻页面全部下载下来,当然用下载工具可以,但是我想自己写一段程序,让它具有通用性,以后只要改一下起始页面地址,就可以下载,而不需要每次都用下载工具去分析到底要下哪些,下面是我的源码:
> >
> > >  # -*- coding:utf-8 -*-
> > > # file: collect.py
> > > #
> > >
> > > import urllib
> > > import urllister
> > >
> > > from sgmllib import SGMLParser
> > >
> > > class URLLister(SGMLParser):
> > >  def reset(self):
> > >   SGMLParser.reset(self)
> > >   self.urls = []
> > >
> > >  def start_a(self, attrs):
> > >   href = [v for k, v in attrs if k=='href']
> > >   if href:
> > >    self.urls.extend(href)
> > >
> > > PreURL = "http://news.qq.com/"
> > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/")
> > > parser = URLLister()
> > > parser.feed(usock.read())
> > > parser.close()
> > > usock.close()
> > > i = 1
> > > for url in parser.urls:
> > >  page = urllib.urlopen(PreURL + url)
> > >  data = page.read()
> > >  filename = 'D:\\test\\' + str(i) + '.htm'
> > >  i  = i + 1
> > >  file = open(filename, 'wb')
> > >  file.write(data)
> > >  file.close()
> > >  page.close()
> > >
> >
> > 试过了几次,都是在下载到第24个页面时报错如下:
> >
> >
> > > Traceback (most recent call last):
> > >   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in
> > > 
> > >     page = urllib.urlopen(PreURL + url)
> > >   File "C:\Python25\lib\urllib.py", line 82, in urlopen
> > >     return opener.open(url)
> > >   File "C:\Python25\lib\urllib.py", line 190, in open
> > >     return getattr(self, name)(url)
> > >   File "C:\Python25\lib\urllib.py", line 328, in open_http
> > >     errcode, errmsg, headers = h.getreply()
> > >   File "C:\Python25\lib\httplib.py", line 1195, in getreply
> > >     response = self._conn.getresponse()
> > >   File "C:\Python25\lib\httplib.py", line 924, in getresponse
> > >     response.begin()
> > >   File "C:\Python25\lib\httplib.py", line 385, in begin
> > >     version, status, reason = self._read_status()
> > >   File "C:\Python25\lib\httplib.py", line 343, in _read_status
> > >     line = self.fp.readline()
> > >   File "C:\Python25\lib\socket.py", line 331, in readline
> > >     data = recv(1)
> > > IOError: [Errno socket error] (10054, 'Connection reset by peer')
> >
> >
> > 想请问是什么原因,另外,为什么这段代码的执行速度特别慢?一开始分析那个http://news.qq.com/a/20071101/的页面时IDLE界面几乎死掉,大概过了几分钟,才开始下载,而且下载一个页面都需要至少一分钟的时间。
> >
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> > python-chinese-request在lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071106/75e6f278/attachment.htm 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年11月06日 星期二 14:34

Shao Feng sevenever在gmail.com
星期二 十一月 6 14:34:39 HKT 2007

被reset是不是因为你下载的网页内容重包含什么敏感词汇啊?
另外,尝试改一下http request的head,比如user-agent什么的。

On 11/6/07, aaron <aaronkowk在gmail.com> wrote:
>
> 谢谢,我是初学者,对网络上实际的访问底层等东西几乎没什么了解,看来这项工作实现起来没那么容易,还是得知道很多基本的知识。
> 关于多线程编程,我以为用不到,还没开始看,谢谢你的提醒,我会马上试试看的。
>
> 在07-11-6,Cyril. Liu <terry6394在gmail.com> 写道:
> >
> > "IDLE界面几乎死掉"
> > 慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。
> >
> > On 11/6/07, aaron < aaronkowk在gmail.com > wrote:
> > >
> > >  想做一个简单的爬虫程序,其实都不算爬虫了,我想把" http://news.qq.com/a/"下面的新闻页面全部下载下来,当然用下载工具可以,但是我想自己写一段程序,让它具有通用性,以后只要改一下起始页面地址,就可以下载,而不需要每次都用下载工具去分析到底要下哪些,下面是我的源码:
> > >
> > >
> > > >  # -*- coding:utf-8 -*-
> > > > # file: collect.py
> > > > #
> > > >
> > > > import urllib
> > > > import urllister
> > > >
> > > > from sgmllib import SGMLParser
> > > >
> > > > class URLLister(SGMLParser):
> > > >  def reset(self):
> > > >   SGMLParser.reset(self)
> > > >   self.urls = []
> > > >
> > > >  def start_a(self, attrs):
> > > >   href = [v for k, v in attrs if k=='href']
> > > >   if href:
> > > >    self.urls.extend(href)
> > > >
> > > > PreURL = "http://news.qq.com/"
> > > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/")
> > > > parser = URLLister()
> > > > parser.feed(usock.read())
> > > > parser.close()
> > > > usock.close()
> > > > i = 1
> > > > for url in parser.urls:
> > > >  page = urllib.urlopen(PreURL + url)
> > > >  data = page.read ()
> > > >  filename = 'D:\\test\\' + str(i) + '.htm'
> > > >  i  = i + 1
> > > >  file = open(filename, 'wb')
> > > >  file.write(data)
> > > >  file.close()
> > > >  page.close()
> > > >
> > >
> > > 试过了几次,都是在下载到第24个页面时报错如下:
> > >
> > >
> > > > Traceback (most recent call last):
> > > >   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in
> > > > 
> > > >     page = urllib.urlopen(PreURL + url)
> > > >   File "C:\Python25\lib\urllib.py", line 82, in urlopen
> > > >     return opener.open(url)
> > > >   File "C:\Python25\lib\urllib.py", line 190, in open
> > > >     return getattr(self, name)(url)
> > > >   File "C:\Python25\lib\urllib.py", line 328, in open_http
> > > >     errcode, errmsg, headers = h.getreply()
> > > >   File "C:\Python25\lib\httplib.py", line 1195, in getreply
> > > >     response = self._conn.getresponse()
> > > >   File "C:\Python25\lib\httplib.py", line 924, in getresponse
> > > >     response.begin()
> > > >   File "C:\Python25\lib\httplib.py", line 385, in begin
> > > >     version, status, reason = self._read_status()
> > > >   File "C:\Python25\lib\httplib.py", line 343, in _read_status
> > > >     line = self.fp.readline()
> > > >   File "C:\Python25\lib\socket.py", line 331, in readline
> > > >     data = recv(1)
> > > > IOError: [Errno socket error] (10054, 'Connection reset by peer')
> > >
> > >
> > > 想请问是什么原因,另外,为什么这段代码的执行速度特别慢?一开始分析那个http://news.qq.com/a/20071101/的页面时IDLE界面几乎死掉,大概过了几分钟,才开始下载,而且下载一个页面都需要至少一分钟的时间。
> > >
> > >
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese在lists.python.cn
> > > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > > Unsubscribe: send unsubscribe to
> > > python-chinese-request在lists.python.cn
> > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > >
> >
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> >
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071106/12b3e429/attachment.html 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年11月06日 星期二 14:41

Wayne moonbingbing在gmail.com
星期二 十一月 6 14:41:24 HKT 2007

建议不要在IDLE下面跑程序,可能会有一些莫名其妙的错误。命令行下面试一试

在07-11-6,Shao Feng <sevenever at gmail.com> 写道:
>
> 被reset是不是因为你下载的网页内容重包含什么敏感词汇啊?
> 另外,尝试改一下http request的head,比如user-agent什么的。
>
> On 11/6/07, aaron <aaronkowk at gmail.com > wrote:
> >
> > 谢谢,我是初学者,对网络上实际的访问底层等东西几乎没什么了解,看来这项工作实现起来没那么容易,还是得知道很多基本的知识。
> > 关于多线程编程,我以为用不到,还没开始看,谢谢你的提醒,我会马上试试看的。
> >
> > 在07-11-6,Cyril. Liu <terry6394 at gmail.com> 写道:
> > >
> > > "IDLE界面几乎死掉"
> > > 慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。
> > >
> > > On 11/6/07, aaron < aaronkowk at gmail.com > wrote:
> > > >
> > > >  想做一个简单的爬虫程序,其实都不算爬虫了,我想把" http://news.qq.com/a/"下面的新闻页面全部下载下来,当然用下载工具可以,但是我想自己写一段程序,让它具有通用性,以后只要改一下起始页面地址,就可以下载,而不需要每次都用下载工具去分析到底要下哪些,下面是我的源码:
> > > >
> > > >
> > > > >  # -*- coding:utf-8 -*-
> > > > > # file: collect.py
> > > > > #
> > > > >
> > > > > import urllib
> > > > > import urllister
> > > > >
> > > > > from sgmllib import SGMLParser
> > > > >
> > > > > class URLLister(SGMLParser):
> > > > >  def reset(self):
> > > > >   SGMLParser.reset(self)
> > > > >   self.urls = []
> > > > >
> > > > >  def start_a(self, attrs):
> > > > >   href = [v for k, v in attrs if k=='href']
> > > > >   if href:
> > > > >    self.urls.extend(href)
> > > > >
> > > > > PreURL = "http://news.qq.com/"
> > > > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/")
> > > > > parser = URLLister()
> > > > > parser.feed(usock.read())
> > > > > parser.close()
> > > > > usock.close()
> > > > > i = 1
> > > > > for url in parser.urls:
> > > > >  page = urllib.urlopen(PreURL + url)
> > > > >  data = page.read ()
> > > > >  filename = 'D:\\test\\' + str(i) + '.htm'
> > > > >  i  = i + 1
> > > > >  file = open(filename, 'wb')
> > > > >  file.write(data)
> > > > >  file.close()
> > > > >  page.close()
> > > > >
> > > >
> > > > 试过了几次,都是在下载到第24个页面时报错如下:
> > > >
> > > >
> > > > > Traceback (most recent call last):
> > > > >   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28,
> > > > > in 
> > > > >     page = urllib.urlopen(PreURL + url)
> > > > >   File "C:\Python25\lib\urllib.py", line 82, in urlopen
> > > > >     return opener.open(url)
> > > > >   File "C:\Python25\lib\urllib.py", line 190, in open
> > > > >     return getattr(self, name)(url)
> > > > >   File "C:\Python25\lib\urllib.py", line 328, in open_http
> > > > >     errcode, errmsg, headers = h.getreply()
> > > > >   File "C:\Python25\lib\httplib.py", line 1195, in getreply
> > > > >     response = self._conn.getresponse()
> > > > >   File "C:\Python25\lib\httplib.py", line 924, in getresponse
> > > > >     response.begin()
> > > > >   File "C:\Python25\lib\httplib.py", line 385, in begin
> > > > >     version, status, reason = self._read_status()
> > > > >   File "C:\Python25\lib\httplib.py", line 343, in _read_status
> > > > >     line = self.fp.readline()
> > > > >   File "C:\Python25\lib\socket.py", line 331, in readline
> > > > >     data = recv(1)
> > > > > IOError: [Errno socket error] (10054, 'Connection reset by peer')
> > > >
> > > >
> > > > 想请问是什么原因,另外,为什么这段代码的执行速度特别慢?一开始分析那个http://news.qq.com/a/20071101/的页面时IDLE界面几乎死掉,大概过了几分钟,才开始下载,而且下载一个页面都需要至少一分钟的时间。
> > > >
> > > >
> > > > _______________________________________________
> > > > python-chinese
> > > > Post: send python-chinese at lists.python.cn
> > > > Subscribe: send subscribe to python-chinese-request at lists.python.cn
> > > > Unsubscribe: send unsubscribe to
> > > > python-chinese-request at lists.python.cn
> > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > > >
> > >
> > >
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese at lists.python.cn
> > > Subscribe: send subscribe to python-chinese-request at lists.python.cn
> > > Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> > >
> > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > >
> >
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese at lists.python.cn
> > Subscribe: send subscribe to python-chinese-request at lists.python.cn
> > Unsubscribe: send unsubscribe to
> > python-chinese-request at lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>



-- 
wayne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20071106/4b0797a7/attachment.htm 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年11月09日 星期五 13:58

Chen Memo ooozid在gmail.com
星期五 十一月 9 13:58:08 HKT 2007

urllib Ëƺõ²»ÄÜ´¦ÀíһЩÒì³£Çé¿ö£¬±ÈÈçÁ¬½Ó³¬Ê±£¬Òì³£Öжϣ¬404µÈ´íÎó£¬ÎÒÓõÄÊÇhttpconnection

ÔÚ07-11-6£¬Wayne <moonbingbing在gmail.com> дµÀ£º
>
> ½¨Òé²»ÒªÔÚIDLEÏÂÃæÅܳÌÐò£¬¿ÉÄÜ»áÓÐһЩĪÃûÆäÃîµÄ´íÎó¡£ÃüÁîÐÐÏÂÃæÊÔÒ»ÊÔ
>
> ÔÚ07-11-6£¬Shao Feng <sevenever在gmail.com> дµÀ£º
> >
> > ±»resetÊDz»ÊÇÒòΪÄãÏÂÔصÄÍøÒ³ÄÚÈÝÖØ°üº¬Ê²Ã´Ãô¸Ð´Ê»ã°¡£¿
> > ÁíÍ⣬³¢ÊÔ¸ÄÒ»ÏÂhttp requestµÄhead,±ÈÈçuser-agentʲôµÄ¡£
> >
> > On 11/6/07, aaron < aaronkowk在gmail.com > wrote:
> > >
> > > лл£¬ÎÒÊdzõѧÕߣ¬¶ÔÍøÂçÉÏʵ¼ÊµÄ·ÃÎʵײãµÈ¶«Î÷¼¸ºõûʲôÁ˽⣬¿´À´ÕâÏ×÷ʵÏÖÆðÀ´Ã»ÄÇôÈÝÒ×£¬»¹ÊǵÃÖªµÀºÜ¶à»ù±¾µÄ֪ʶ¡£
> > > ¹ØÓÚ¶àÏ̱߳à³Ì£¬ÎÒÒÔΪÓò»µ½£¬»¹Ã»¿ªÊ¼¿´£¬Ð»Ð»ÄãµÄÌáÐÑ£¬ÎÒ»áÂíÉÏÊÔÊÔ¿´µÄ¡£
> > >
> > > ÔÚ07-11-6£¬Cyril. Liu <terry6394在gmail.com> дµÀ£º
> > > >
> > > > "IDLE½çÃ漸ºõËÀµô"
> > > > ÂýµÄÔ­Òò¹À¼ÆÊÇÒòΪÄã´ò¿ªURL±»×èÈûµÄÎÊÌâ,
> > > > ¾ÍÏñÓÃIEµÄʱºò¾­³£»á¼ÙËÀ¡£ÎÒÏëÄãÓöàÏß³ÌÈ¥×öµÄ»°¿ÉÄÜ»á±È½ÏºÃÒ»µã¡£±Ï¾¹ÍøÒ³Ò»¸öÒ»¸ö±È½ÏÂý¡£
> > > >
> > > > On 11/6/07, aaron < aaronkowk在gmail.com > wrote:
> > > > >
> > > > >  Ïë×öÒ»¸ö¼òµ¥µÄÅÀ³æ³ÌÐò£¬Æäʵ¶¼²»ËãÅÀ³æÁË£¬ÎÒÏë°Ñ" http://news.qq.com/a/"ÏÂÃæµÄÐÂÎÅÒ³ÃæÈ«²¿ÏÂÔØÏÂÀ´£¬µ±È»ÓÃÏÂÔع¤¾ß¿ÉÒÔ£¬µ«ÊÇÎÒÏë×Ô¼ºÐ´Ò»¶Î³ÌÐò£¬ÈÃËü¾ßÓÐͨÓÃÐÔ£¬ÒÔºóÖ»Òª¸ÄÒ»ÏÂÆðʼҳÃæµØÖ·£¬¾Í¿ÉÒÔÏÂÔØ£¬¶ø²»ÐèҪÿ´Î¶¼ÓÃÏÂÔع¤¾ßÈ¥·ÖÎöµ½µ×ÒªÏÂÄÄЩ£¬ÏÂÃæÊÇÎÒµÄÔ´Â룺
> > > > >
> > > > >
> > > > > >  # -*- coding:utf-8 -*-
> > > > > > # file: collect.py
> > > > > > #
> > > > > >
> > > > > > import urllib
> > > > > > import urllister
> > > > > >
> > > > > > from sgmllib import SGMLParser
> > > > > >
> > > > > > class URLLister(SGMLParser):
> > > > > >  def reset(self):
> > > > > >   SGMLParser.reset(self)
> > > > > >   self.urls = []
> > > > > >
> > > > > >  def start_a(self, attrs):
> > > > > >   href = [v for k, v in attrs if k=='href']
> > > > > >   if href:
> > > > > >    self.urls.extend(href)
> > > > > >
> > > > > > PreURL = "http://news.qq.com/"
> > > > > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/")
> > > > > > parser = URLLister()
> > > > > > parser.feed(usock.read())
> > > > > > parser.close()
> > > > > > usock.close()
> > > > > > i = 1
> > > > > > for url in parser.urls:
> > > > > >  page = urllib.urlopen(PreURL + url)
> > > > > >  data = page.read ()
> > > > > >  filename = 'D:\\test\\' + str(i) + '.htm'
> > > > > >  i  = i + 1
> > > > > >  file = open(filename, 'wb')
> > > > > >  file.write(data)
> > > > > >  file.close()
> > > > > >  page.close()
> > > > > >
> > > > >
> > > > > ÊÔ¹ýÁ˼¸´Î£¬¶¼ÊÇÔÚÏÂÔص½µÚ24¸öÒ³Ãæʱ±¨´íÈçÏ£º
> > > > >
> > > > >
> > > > > > Traceback (most recent call last):
> > > > > >   File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28,
> > > > > > in 
> > > > > >     page = urllib.urlopen(PreURL + url)
> > > > > >   File "C:\Python25\lib\urllib.py", line 82, in urlopen
> > > > > >     return opener.open(url)
> > > > > >   File "C:\Python25\lib\urllib.py", line 190, in open
> > > > > >     return getattr(self, name)(url)
> > > > > >   File "C:\Python25\lib\urllib.py", line 328, in open_http
> > > > > >     errcode, errmsg, headers = h.getreply()
> > > > > >   File "C:\Python25\lib\httplib.py", line 1195, in getreply
> > > > > >     response = self._conn.getresponse()
> > > > > >   File "C:\Python25\lib\httplib.py", line 924, in getresponse
> > > > > >     response.begin()
> > > > > >   File "C:\Python25\lib\httplib.py", line 385, in begin
> > > > > >     version, status, reason = self._read_status()
> > > > > >   File "C:\Python25\lib\httplib.py", line 343, in _read_status
> > > > > >     line = self.fp.readline()
> > > > > >   File "C:\Python25\lib\socket.py", line 331, in readline
> > > > > >     data = recv(1)
> > > > > > IOError: [Errno socket error] (10054, 'Connection reset by
> > > > > > peer')
> > > > >
> > > > >
> > > > > ÏëÇëÎÊÊÇʲôԭÒò£¬ÁíÍ⣬ΪʲôÕâ¶Î´úÂëµÄÖ´ÐÐËÙ¶ÈÌرðÂý£¿Ò»¿ªÊ¼·ÖÎöÄǸöhttp://news.qq.com/a/20071101/µÄÒ³ÃæʱIDLE½çÃ漸ºõËÀµô£¬´ó¸Å¹ýÁ˼¸·ÖÖÓ£¬²Å¿ªÊ¼ÏÂÔØ£¬¶øÇÒÏÂÔØÒ»¸öÒ³Ã涼ÐèÒªÖÁÉÙÒ»·ÖÖÓµÄʱ¼ä¡£
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > python-chinese
> > > > > Post: send python-chinese在lists.python.cn
> > > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > > > >
> > > > > Unsubscribe: send unsubscribe to
> > > > > python-chinese-request在lists.python.cn
> > > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > python-chinese
> > > > Post: send python-chinese在lists.python.cn
> > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > > > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> > > >
> > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > > >
> > >
> > >
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese在lists.python.cn
> > > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > > Unsubscribe: send unsubscribe to
> > > python-chinese-request在lists.python.cn
> > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > >
> >
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> > python-chinese-request在lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
>
> --
> wayne
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒƳý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071109/c4f9d7fd/attachment.html 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

如下红色区域有误,请重新填写。

    你的回复:

    请 登录 后回复。还没有在Zeuux哲思注册吗?现在 注册 !

    Zeuux © 2025

    京ICP备05028076号