2007年11月06日 星期二 11:41
想做一个简单的爬虫程序,其实都不算爬虫了,我想把"http://news.qq.com/a/ "下面的新闻页面全部下载下来,当然用下载工具可以,但是我想自己写一段程序,让它具有通用性,以后只要改一下起始页面地址,就可以下载,而不需要每次都用下载工具去分析到底要下哪些,下面是我的源码: > # -*- coding:utf-8 -*- > # file: collect.py > # > > import urllib > import urllister > > from sgmllib import SGMLParser > > class URLLister(SGMLParser): > def reset(self): > SGMLParser.reset(self) > self.urls = [] > > def start_a(self, attrs): > href = [v for k, v in attrs if k=='href'] > if href: > self.urls.extend(href) > > PreURL = "http://news.qq.com/" > usock = urllib.urlopen("http://news.qq.com/a/20071101/") > parser = URLLister() > parser.feed(usock.read()) > parser.close() > usock.close() > i = 1 > for url in parser.urls: > page = urllib.urlopen(PreURL + url) > data = page.read() > filename = 'D:\\test\\' + str(i) + '.htm' > i = i + 1 > file = open(filename, 'wb') > file.write(data) > file.close() > page.close() > 试过了几次,都是在下载到第24个页面时报错如下: > Traceback (most recent call last): > File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in >> page = urllib.urlopen(PreURL + url) > File "C:\Python25\lib\urllib.py", line 82, in urlopen > return opener.open(url) > File "C:\Python25\lib\urllib.py", line 190, in open > return getattr(self, name)(url) > File "C:\Python25\lib\urllib.py", line 328, in open_http > errcode, errmsg, headers = h.getreply() > File "C:\Python25\lib\httplib.py", line 1195, in getreply > response = self._conn.getresponse() > File "C:\Python25\lib\httplib.py", line 924, in getresponse > response.begin() > File "C:\Python25\lib\httplib.py", line 385, in begin > version, status, reason = self._read_status() > File "C:\Python25\lib\httplib.py", line 343, in _read_status > line = self.fp.readline() > File "C:\Python25\lib\socket.py", line 331, in readline > data = recv(1) > IOError: [Errno socket error] (10054, 'Connection reset by peer') 想请问是什么原因,另外,为什么这段代码的执行速度特别慢?一开始分析那个http://news.qq.com/a/20071101/ 的页面时IDLE界面几乎死掉,大概过了几分钟,才开始下载,而且下载一个页面都需要至少一分钟的时间。 -------------- 下一部分 -------------- 一个HTML附件被移除... URL: http://python.cn/pipermail/python-chinese/attachments/20071106/b86b0af5/attachment.htm
2007年11月06日 星期二 14:00
"IDLE界面几乎死掉" 慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。 On 11/6/07, aaron <aaronkowk at gmail.com> wrote: > > 想做一个简单的爬虫程序,其实都不算爬虫了,我想把"http://news.qq.com/a/ > "下面的新闻页面全部下载下来,当然用下载工具可以,但是我想自己写一段程序,让它具有通用性,以后只要改一下起始页面地址,就可以下载,而不需要每次都用下载工具去分析到底要下哪些,下面是我的源码: > > > # -*- coding:utf-8 -*- > > # file: collect.py > > # > > > > import urllib > > import urllister > > > > from sgmllib import SGMLParser > > > > class URLLister(SGMLParser): > > def reset(self): > > SGMLParser.reset(self) > > self.urls = [] > > > > def start_a(self, attrs): > > href = [v for k, v in attrs if k=='href'] > > if href: > > self.urls.extend(href) > > > > PreURL = "http://news.qq.com/" > > usock = urllib.urlopen("http://news.qq.com/a/20071101/") > > parser = URLLister() > > parser.feed(usock.read()) > > parser.close() > > usock.close() > > i = 1 > > for url in parser.urls: > > page = urllib.urlopen(PreURL + url) > > data = page.read() > > filename = 'D:\\test\\' + str(i) + '.htm' > > i = i + 1 > > file = open(filename, 'wb') > > file.write(data) > > file.close() > > page.close() > > > > 试过了几次,都是在下载到第24个页面时报错如下: > > > > Traceback (most recent call last): > > File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in > >> > page = urllib.urlopen(PreURL + url) > > File "C:\Python25\lib\urllib.py", line 82, in urlopen > > return opener.open(url) > > File "C:\Python25\lib\urllib.py", line 190, in open > > return getattr(self, name)(url) > > File "C:\Python25\lib\urllib.py", line 328, in open_http > > errcode, errmsg, headers = h.getreply() > > File "C:\Python25\lib\httplib.py", line 1195, in getreply > > response = self._conn.getresponse() > > File "C:\Python25\lib\httplib.py", line 924, in getresponse > > response.begin() > > File "C:\Python25\lib\httplib.py", line 385, in begin > > version, status, reason = self._read_status() > > File "C:\Python25\lib\httplib.py", line 343, in _read_status > > line = self.fp.readline() > > File "C:\Python25\lib\socket.py", line 331, in readline > > data = recv(1) > > IOError: [Errno socket error] (10054, 'Connection reset by peer') > > > 想请问是什么原因,另外,为什么这段代码的执行速度特别慢?一开始分析那个http://news.qq.com/a/20071101/ > 的页面时IDLE界面几乎死掉,大概过了几分钟,才开始下载,而且下载一个页面都需要至少一分钟的时间。 > > _______________________________________________ > python-chinese > Post: send python-chinese at lists.python.cn > Subscribe: send subscribe to python-chinese-request at lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request at lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://python.cn/pipermail/python-chinese/attachments/20071106/16d2360d/attachment.html
2007年11月06日 星期二 14:10
谢谢,我是初学者,对网络上实际的访问底层等东西几乎没什么了解,看来这项工作实现起来没那么容易,还是得知道很多基本的知识。 关于多线程编程,我以为用不到,还没开始看,谢谢你的提醒,我会马上试试看的。 在07-11-6,Cyril. Liu <terry6394在gmail.com> 写道: > > "IDLE界面几乎死掉" > 慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。 > > On 11/6/07, aaron < aaronkowk在gmail.com> wrote: > > > > 想做一个简单的爬虫程序,其实都不算爬虫了,我想把" http://news.qq.com/a/ > > "下面的新闻页面全部下载下来,当然用下载工具可以,但是我想自己写一段程序,让它具有通用性,以后只要改一下起始页面地址,就可以下载,而不需要每次都用下载工具去分析到底要下哪些,下面是我的源码: > > > > > # -*- coding:utf-8 -*- > > > # file: collect.py > > > # > > > > > > import urllib > > > import urllister > > > > > > from sgmllib import SGMLParser > > > > > > class URLLister(SGMLParser): > > > def reset(self): > > > SGMLParser.reset(self) > > > self.urls = [] > > > > > > def start_a(self, attrs): > > > href = [v for k, v in attrs if k=='href'] > > > if href: > > > self.urls.extend(href) > > > > > > PreURL = "http://news.qq.com/" > > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/") > > > parser = URLLister() > > > parser.feed(usock.read()) > > > parser.close() > > > usock.close() > > > i = 1 > > > for url in parser.urls: > > > page = urllib.urlopen(PreURL + url) > > > data = page.read() > > > filename = 'D:\\test\\' + str(i) + '.htm' > > > i = i + 1 > > > file = open(filename, 'wb') > > > file.write(data) > > > file.close() > > > page.close() > > > > > > > 试过了几次,都是在下载到第24个页面时报错如下: > > > > > > > Traceback (most recent call last): > > > File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in > > >> > > page = urllib.urlopen(PreURL + url) > > > File "C:\Python25\lib\urllib.py", line 82, in urlopen > > > return opener.open(url) > > > File "C:\Python25\lib\urllib.py", line 190, in open > > > return getattr(self, name)(url) > > > File "C:\Python25\lib\urllib.py", line 328, in open_http > > > errcode, errmsg, headers = h.getreply() > > > File "C:\Python25\lib\httplib.py", line 1195, in getreply > > > response = self._conn.getresponse() > > > File "C:\Python25\lib\httplib.py", line 924, in getresponse > > > response.begin() > > > File "C:\Python25\lib\httplib.py", line 385, in begin > > > version, status, reason = self._read_status() > > > File "C:\Python25\lib\httplib.py", line 343, in _read_status > > > line = self.fp.readline() > > > File "C:\Python25\lib\socket.py", line 331, in readline > > > data = recv(1) > > > IOError: [Errno socket error] (10054, 'Connection reset by peer') > > > > > > 想请问是什么原因,另外,为什么这段代码的执行速度特别慢?一开始分析那个http://news.qq.com/a/20071101/的页面时IDLE界面几乎死掉,大概过了几分钟,才开始下载,而且下载一个页面都需要至少一分钟的时间。 > > > > > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to > > python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -------------- 下一部分 -------------- 一个HTML附件被移除... URL: http://python.cn/pipermail/python-chinese/attachments/20071106/75e6f278/attachment.htm
2007年11月06日 星期二 14:34
被reset是不是因为你下载的网页内容重包含什么敏感词汇啊? 另外,尝试改一下http request的head,比如user-agent什么的。 On 11/6/07, aaron <aaronkowk在gmail.com> wrote: > > 谢谢,我是初学者,对网络上实际的访问底层等东西几乎没什么了解,看来这项工作实现起来没那么容易,还是得知道很多基本的知识。 > 关于多线程编程,我以为用不到,还没开始看,谢谢你的提醒,我会马上试试看的。 > > 在07-11-6,Cyril. Liu <terry6394在gmail.com> 写道: > > > > "IDLE界面几乎死掉" > > 慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。 > > > > On 11/6/07, aaron < aaronkowk在gmail.com > wrote: > > > > > > 想做一个简单的爬虫程序,其实都不算爬虫了,我想把" http://news.qq.com/a/"下面的新闻页面全部下载下来,当然用下载工具可以,但是我想自己写一段程序,让它具有通用性,以后只要改一下起始页面地址,就可以下载,而不需要每次都用下载工具去分析到底要下哪些,下面是我的源码: > > > > > > > > > > # -*- coding:utf-8 -*- > > > > # file: collect.py > > > > # > > > > > > > > import urllib > > > > import urllister > > > > > > > > from sgmllib import SGMLParser > > > > > > > > class URLLister(SGMLParser): > > > > def reset(self): > > > > SGMLParser.reset(self) > > > > self.urls = [] > > > > > > > > def start_a(self, attrs): > > > > href = [v for k, v in attrs if k=='href'] > > > > if href: > > > > self.urls.extend(href) > > > > > > > > PreURL = "http://news.qq.com/" > > > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/") > > > > parser = URLLister() > > > > parser.feed(usock.read()) > > > > parser.close() > > > > usock.close() > > > > i = 1 > > > > for url in parser.urls: > > > > page = urllib.urlopen(PreURL + url) > > > > data = page.read () > > > > filename = 'D:\\test\\' + str(i) + '.htm' > > > > i = i + 1 > > > > file = open(filename, 'wb') > > > > file.write(data) > > > > file.close() > > > > page.close() > > > > > > > > > > 试过了几次,都是在下载到第24个页面时报错如下: > > > > > > > > > > Traceback (most recent call last): > > > > File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, in > > > >> > > > page = urllib.urlopen(PreURL + url) > > > > File "C:\Python25\lib\urllib.py", line 82, in urlopen > > > > return opener.open(url) > > > > File "C:\Python25\lib\urllib.py", line 190, in open > > > > return getattr(self, name)(url) > > > > File "C:\Python25\lib\urllib.py", line 328, in open_http > > > > errcode, errmsg, headers = h.getreply() > > > > File "C:\Python25\lib\httplib.py", line 1195, in getreply > > > > response = self._conn.getresponse() > > > > File "C:\Python25\lib\httplib.py", line 924, in getresponse > > > > response.begin() > > > > File "C:\Python25\lib\httplib.py", line 385, in begin > > > > version, status, reason = self._read_status() > > > > File "C:\Python25\lib\httplib.py", line 343, in _read_status > > > > line = self.fp.readline() > > > > File "C:\Python25\lib\socket.py", line 331, in readline > > > > data = recv(1) > > > > IOError: [Errno socket error] (10054, 'Connection reset by peer') > > > > > > > > > 想请问是什么原因,另外,为什么这段代码的执行速度特别慢?一开始分析那个http://news.qq.com/a/20071101/的页面时IDLE界面几乎死掉,大概过了几分钟,才开始下载,而且下载一个页面都需要至少一分钟的时间。 > > > > > > > > > _______________________________________________ > > > python-chinese > > > Post: send python-chinese在lists.python.cn > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > > Unsubscribe: send unsubscribe to > > > python-chinese-request在lists.python.cn > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > > > > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -------------- 下一部分 -------------- 一个HTML附件被移除... URL: http://python.cn/pipermail/python-chinese/attachments/20071106/12b3e429/attachment.html
2007年11月06日 星期二 14:41
建议不要在IDLE下面跑程序,可能会有一些莫名其妙的错误。命令行下面试一试 在07-11-6,Shao Feng <sevenever at gmail.com> 写道: > > 被reset是不是因为你下载的网页内容重包含什么敏感词汇啊? > 另外,尝试改一下http request的head,比如user-agent什么的。 > > On 11/6/07, aaron <aaronkowk at gmail.com > wrote: > > > > 谢谢,我是初学者,对网络上实际的访问底层等东西几乎没什么了解,看来这项工作实现起来没那么容易,还是得知道很多基本的知识。 > > 关于多线程编程,我以为用不到,还没开始看,谢谢你的提醒,我会马上试试看的。 > > > > 在07-11-6,Cyril. Liu <terry6394 at gmail.com> 写道: > > > > > > "IDLE界面几乎死掉" > > > 慢的原因估计是因为你打开URL被阻塞的问题, 就像用IE的时候经常会假死。我想你用多线程去做的话可能会比较好一点。毕竟网页一个一个比较慢。 > > > > > > On 11/6/07, aaron < aaronkowk at gmail.com > wrote: > > > > > > > > 想做一个简单的爬虫程序,其实都不算爬虫了,我想把" http://news.qq.com/a/"下面的新闻页面全部下载下来,当然用下载工具可以,但是我想自己写一段程序,让它具有通用性,以后只要改一下起始页面地址,就可以下载,而不需要每次都用下载工具去分析到底要下哪些,下面是我的源码: > > > > > > > > > > > > > # -*- coding:utf-8 -*- > > > > > # file: collect.py > > > > > # > > > > > > > > > > import urllib > > > > > import urllister > > > > > > > > > > from sgmllib import SGMLParser > > > > > > > > > > class URLLister(SGMLParser): > > > > > def reset(self): > > > > > SGMLParser.reset(self) > > > > > self.urls = [] > > > > > > > > > > def start_a(self, attrs): > > > > > href = [v for k, v in attrs if k=='href'] > > > > > if href: > > > > > self.urls.extend(href) > > > > > > > > > > PreURL = "http://news.qq.com/" > > > > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/") > > > > > parser = URLLister() > > > > > parser.feed(usock.read()) > > > > > parser.close() > > > > > usock.close() > > > > > i = 1 > > > > > for url in parser.urls: > > > > > page = urllib.urlopen(PreURL + url) > > > > > data = page.read () > > > > > filename = 'D:\\test\\' + str(i) + '.htm' > > > > > i = i + 1 > > > > > file = open(filename, 'wb') > > > > > file.write(data) > > > > > file.close() > > > > > page.close() > > > > > > > > > > > > > 试过了几次,都是在下载到第24个页面时报错如下: > > > > > > > > > > > > > Traceback (most recent call last): > > > > > File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, > > > > > in> > > > > page = urllib.urlopen(PreURL + url) > > > > > File "C:\Python25\lib\urllib.py", line 82, in urlopen > > > > > return opener.open(url) > > > > > File "C:\Python25\lib\urllib.py", line 190, in open > > > > > return getattr(self, name)(url) > > > > > File "C:\Python25\lib\urllib.py", line 328, in open_http > > > > > errcode, errmsg, headers = h.getreply() > > > > > File "C:\Python25\lib\httplib.py", line 1195, in getreply > > > > > response = self._conn.getresponse() > > > > > File "C:\Python25\lib\httplib.py", line 924, in getresponse > > > > > response.begin() > > > > > File "C:\Python25\lib\httplib.py", line 385, in begin > > > > > version, status, reason = self._read_status() > > > > > File "C:\Python25\lib\httplib.py", line 343, in _read_status > > > > > line = self.fp.readline() > > > > > File "C:\Python25\lib\socket.py", line 331, in readline > > > > > data = recv(1) > > > > > IOError: [Errno socket error] (10054, 'Connection reset by peer') > > > > > > > > > > > > 想请问是什么原因,另外,为什么这段代码的执行速度特别慢?一开始分析那个http://news.qq.com/a/20071101/的页面时IDLE界面几乎死掉,大概过了几分钟,才开始下载,而且下载一个页面都需要至少一分钟的时间。 > > > > > > > > > > > > _______________________________________________ > > > > python-chinese > > > > Post: send python-chinese at lists.python.cn > > > > Subscribe: send subscribe to python-chinese-request at lists.python.cn > > > > Unsubscribe: send unsubscribe to > > > > python-chinese-request at lists.python.cn > > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > > > > > > > > > _______________________________________________ > > > python-chinese > > > Post: send python-chinese at lists.python.cn > > > Subscribe: send subscribe to python-chinese-request at lists.python.cn > > > Unsubscribe: send unsubscribe to python-chinese-request at lists.python.cn > > > > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > > > > > _______________________________________________ > > python-chinese > > Post: send python-chinese at lists.python.cn > > Subscribe: send subscribe to python-chinese-request at lists.python.cn > > Unsubscribe: send unsubscribe to > > python-chinese-request at lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > _______________________________________________ > python-chinese > Post: send python-chinese at lists.python.cn > Subscribe: send subscribe to python-chinese-request at lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request at lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -- wayne -------------- next part -------------- An HTML attachment was scrubbed... URL: http://python.cn/pipermail/python-chinese/attachments/20071106/4b0797a7/attachment.htm
2007年11月09日 星期五 13:58
urllib Ëƺõ²»ÄÜ´¦ÀíһЩÒì³£Çé¿ö£¬±ÈÈçÁ¬½Ó³¬Ê±£¬Òì³£Öжϣ¬404µÈ´íÎó£¬ÎÒÓõÄÊÇhttpconnection ÔÚ07-11-6£¬Wayne <moonbingbing在gmail.com> дµÀ£º > > ½¨Òé²»ÒªÔÚIDLEÏÂÃæÅܳÌÐò£¬¿ÉÄÜ»áÓÐһЩĪÃûÆäÃîµÄ´íÎó¡£ÃüÁîÐÐÏÂÃæÊÔÒ»ÊÔ > > ÔÚ07-11-6£¬Shao Feng <sevenever在gmail.com> дµÀ£º > > > > ±»resetÊDz»ÊÇÒòΪÄãÏÂÔصÄÍøÒ³ÄÚÈÝÖØ°üº¬Ê²Ã´Ãô¸Ð´Ê»ã°¡£¿ > > ÁíÍ⣬³¢ÊÔ¸ÄÒ»ÏÂhttp requestµÄhead,±ÈÈçuser-agentʲôµÄ¡£ > > > > On 11/6/07, aaron < aaronkowk在gmail.com > wrote: > > > > > > лл£¬ÎÒÊdzõѧÕߣ¬¶ÔÍøÂçÉÏʵ¼ÊµÄ·ÃÎʵײãµÈ¶«Î÷¼¸ºõûʲôÁ˽⣬¿´À´ÕâÏ×÷ʵÏÖÆðÀ´Ã»ÄÇôÈÝÒ×£¬»¹ÊǵÃÖªµÀºÜ¶à»ù±¾µÄ֪ʶ¡£ > > > ¹ØÓÚ¶àÏ̱߳à³Ì£¬ÎÒÒÔΪÓò»µ½£¬»¹Ã»¿ªÊ¼¿´£¬Ð»Ð»ÄãµÄÌáÐÑ£¬ÎÒ»áÂíÉÏÊÔÊÔ¿´µÄ¡£ > > > > > > ÔÚ07-11-6£¬Cyril. Liu <terry6394在gmail.com> дµÀ£º > > > > > > > > "IDLE½çÃ漸ºõËÀµô" > > > > ÂýµÄÔÒò¹À¼ÆÊÇÒòΪÄã´ò¿ªURL±»×èÈûµÄÎÊÌâ, > > > > ¾ÍÏñÓÃIEµÄʱºò¾³£»á¼ÙËÀ¡£ÎÒÏëÄãÓöàÏß³ÌÈ¥×öµÄ»°¿ÉÄÜ»á±È½ÏºÃÒ»µã¡£±Ï¾¹ÍøÒ³Ò»¸öÒ»¸ö±È½ÏÂý¡£ > > > > > > > > On 11/6/07, aaron < aaronkowk在gmail.com > wrote: > > > > > > > > > > Ïë×öÒ»¸ö¼òµ¥µÄÅÀ³æ³ÌÐò£¬Æäʵ¶¼²»ËãÅÀ³æÁË£¬ÎÒÏë°Ñ" http://news.qq.com/a/"ÏÂÃæµÄÐÂÎÅÒ³ÃæÈ«²¿ÏÂÔØÏÂÀ´£¬µ±È»ÓÃÏÂÔع¤¾ß¿ÉÒÔ£¬µ«ÊÇÎÒÏë×Ô¼ºÐ´Ò»¶Î³ÌÐò£¬ÈÃËü¾ßÓÐͨÓÃÐÔ£¬ÒÔºóÖ»Òª¸ÄÒ»ÏÂÆðʼҳÃæµØÖ·£¬¾Í¿ÉÒÔÏÂÔØ£¬¶ø²»ÐèҪÿ´Î¶¼ÓÃÏÂÔع¤¾ßÈ¥·ÖÎöµ½µ×ÒªÏÂÄÄЩ£¬ÏÂÃæÊÇÎÒµÄÔ´Â룺 > > > > > > > > > > > > > > > > # -*- coding:utf-8 -*- > > > > > > # file: collect.py > > > > > > # > > > > > > > > > > > > import urllib > > > > > > import urllister > > > > > > > > > > > > from sgmllib import SGMLParser > > > > > > > > > > > > class URLLister(SGMLParser): > > > > > > def reset(self): > > > > > > SGMLParser.reset(self) > > > > > > self.urls = [] > > > > > > > > > > > > def start_a(self, attrs): > > > > > > href = [v for k, v in attrs if k=='href'] > > > > > > if href: > > > > > > self.urls.extend(href) > > > > > > > > > > > > PreURL = "http://news.qq.com/" > > > > > > usock = urllib.urlopen(" http://news.qq.com/a/20071101/") > > > > > > parser = URLLister() > > > > > > parser.feed(usock.read()) > > > > > > parser.close() > > > > > > usock.close() > > > > > > i = 1 > > > > > > for url in parser.urls: > > > > > > page = urllib.urlopen(PreURL + url) > > > > > > data = page.read () > > > > > > filename = 'D:\\test\\' + str(i) + '.htm' > > > > > > i = i + 1 > > > > > > file = open(filename, 'wb') > > > > > > file.write(data) > > > > > > file.close() > > > > > > page.close() > > > > > > > > > > > > > > > > ÊÔ¹ýÁ˼¸´Î£¬¶¼ÊÇÔÚÏÂÔص½µÚ24¸öÒ³Ãæʱ±¨´íÈçÏ£º > > > > > > > > > > > > > > > > Traceback (most recent call last): > > > > > > File "C:\Inetpub\wwwroot\MySite\collect\collect1.py", line 28, > > > > > > in> > > > > > page = urllib.urlopen(PreURL + url) > > > > > > File "C:\Python25\lib\urllib.py", line 82, in urlopen > > > > > > return opener.open(url) > > > > > > File "C:\Python25\lib\urllib.py", line 190, in open > > > > > > return getattr(self, name)(url) > > > > > > File "C:\Python25\lib\urllib.py", line 328, in open_http > > > > > > errcode, errmsg, headers = h.getreply() > > > > > > File "C:\Python25\lib\httplib.py", line 1195, in getreply > > > > > > response = self._conn.getresponse() > > > > > > File "C:\Python25\lib\httplib.py", line 924, in getresponse > > > > > > response.begin() > > > > > > File "C:\Python25\lib\httplib.py", line 385, in begin > > > > > > version, status, reason = self._read_status() > > > > > > File "C:\Python25\lib\httplib.py", line 343, in _read_status > > > > > > line = self.fp.readline() > > > > > > File "C:\Python25\lib\socket.py", line 331, in readline > > > > > > data = recv(1) > > > > > > IOError: [Errno socket error] (10054, 'Connection reset by > > > > > > peer') > > > > > > > > > > > > > > > ÏëÇëÎÊÊÇʲôÔÒò£¬ÁíÍ⣬ΪʲôÕâ¶Î´úÂëµÄÖ´ÐÐËÙ¶ÈÌرðÂý£¿Ò»¿ªÊ¼·ÖÎöÄǸöhttp://news.qq.com/a/20071101/µÄÒ³ÃæʱIDLE½çÃ漸ºõËÀµô£¬´ó¸Å¹ýÁ˼¸·ÖÖÓ£¬²Å¿ªÊ¼ÏÂÔØ£¬¶øÇÒÏÂÔØÒ»¸öÒ³Ã涼ÐèÒªÖÁÉÙÒ»·ÖÖÓµÄʱ¼ä¡£ > > > > > > > > > > > > > > > _______________________________________________ > > > > > python-chinese > > > > > Post: send python-chinese在lists.python.cn > > > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > > > > > > > > > Unsubscribe: send unsubscribe to > > > > > python-chinese-request在lists.python.cn > > > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > > > > > > > > > > > > > _______________________________________________ > > > > python-chinese > > > > Post: send python-chinese在lists.python.cn > > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > > > > > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > > > > > > > > > _______________________________________________ > > > python-chinese > > > Post: send python-chinese在lists.python.cn > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > > Unsubscribe: send unsubscribe to > > > python-chinese-request在lists.python.cn > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > > > > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to > > python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > > -- > wayne > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20071109/c4f9d7fd/attachment.html
Zeuux © 2025
京ICP备05028076号