2007年03月13日 星期二 14:47
偶最近面试douban.com时的初试试题,我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的,后来并发访问问题搞不定,就改用BerkeleyDB了,就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里面写出来的,不要怪我,呵呵。 哪位朋友有兴趣可以发邮件给我,我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2的svn,非常不推荐只看源码,因为调试过程中发现的很多问题我直接写在提交日志里面了。 当前的状态是有2个线程,3000左右URL,速度比较慢。以前使用SQLite的时候速度更慢,不过URL在2万以上了。 另外,希望各位高人也可以看看,我在改用BerkeleyDB之后,在使用threading.Lock()这个锁的时候,时间长了会出毛病,并不是抛出异常,而是Python解释器直接中止。 -- 从前有一只很冷的毛毛虫,他想获得一点温暖。而获得温暖的机会只有从树上掉下来,落进别人的领口。 片刻的温暖,之后便失去生命。而很多同类却连这片刻的温暖都没有得到就.. 我会得到温暖么?小心翼翼的尝试,却还是会受到伤害。 我愿为那一刻的温暖去拼,可是谁愿意接受? 欢迎访问偶的博客: http://blog.csdn.net/gashero
2007年03月13日 星期二 15:10
晕........ 考试考如何写爬虫? gashero 写道: > 偶最近面试douban.com时的初试试题,我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的,后来并发访问问题搞不定,就改用BerkeleyDB了,就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里面写出来的,不要怪我,呵呵。 > 哪位朋友有兴趣可以发邮件给我,我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2的svn,非常不推荐只看源码,因为调试过程中发现的很多问题我直接写在提交日志里面了。 > 当前的状态是有2个线程,3000左右URL,速度比较慢。以前使用SQLite的时候速度更慢,不过URL在2万以上了。 > 另外,希望各位高人也可以看看,我在改用BerkeleyDB之后,在使用threading.Lock()这个锁的时候,时间长了会出毛病,并不是抛出异常,而是Python解释器直接中止。 > >
2007年03月13日 星期二 15:47
想学习一下,谢谢。 在 07-3-13,gashero<harry.python在gmail.com> 写道: > 偶最近面试douban.com时的初试试题,我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的,后来并发访问问题搞不定,就改用BerkeleyDB了,就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里
2007年03月13日 星期二 16:52
很有兴趣学习一下,我现在漫无亩目的的学习感觉收获不大,又想不出来弄点什么。 就是是不知道怎么看你的邮件原始地址,没法发邮件:( 我的邮件 zbbstar在gmail.com 看到了给我来一份,感谢 >偶最近面试douban.com时的初试试题,我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的,后来并发访问问题搞不定,就改用BerkeleyDB了,就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里面写出来的,不要怪我,呵呵。 >哪位朋友有兴趣可以发邮件给我,我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2的svn,非常不推荐只看源码,因为调试过程中发现的很多问题我直接写在提交日志里面了。 >当前的状态是有2个线程,3000左右URL,速度比较慢。以前使用SQLite的时候速度更慢,不过URL在2万以上了。 >另外,希望各位高人也可以看看,我在改用BerkeleyDB之后,在使用threading.Lock()这个锁的时候,时间长了会出毛病,并不是抛出异常,而是Python解释器直接中止。 > >-- >从前有一只很冷的毛毛虫,他想获得一点温暖。而获得温暖的机会只有从树上掉下来,落进别人的领口。 >片刻的温暖,之后便失去生命。而很多同类却连这片刻的温暖都没有得到就.. >我会得到温暖么?小心翼翼的尝试,却还是会受到伤害。 >我愿为那一刻的温暖去拼,可是谁愿意接受? > >欢迎访问偶的博客: >http://blog.csdn.net/gashero >_______________________________________________ >python-chinese >Post: send python-chinese在lists.python.cn >Subscribe: send subscribe to python-chinese-request在lists.python.cn >Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn >Detail Info: http://python.cn/mailman/listinfo/python-chinese
2007年03月13日 星期二 17:10
ÎÒÒ²ºÜÓÐÐËȤѧϰ£¬²»ÖªµÀÄܲ»ÄÜ·¢ËÍÒ»·ÝÔ´´úÂëµ½ÎÒµÄÓÊÏ䣬лл¡£ On 3/13/07, Âܲ· <ebbstar在126.com> wrote: > > ºÜÓÐÐËȤѧϰһÏ£¬ÎÒÏÖÔÚÂþÎÞĶĿµÄµÄѧϰ¸Ð¾õÊÕ»ñ²»´ó£¬ÓÖÏë²»³öÀ´Åªµãʲô¡£ > ¾ÍÊÇÊDz»ÖªµÀÔõô¿´ÄãµÄÓʼþÔʼµØÖ·£¬Ã»·¨·¢Óʼþ:( > ÎÒµÄÓʼþ zbbstar在gmail.com > ¿´µ½Á˸øÎÒÀ´Ò»·Ý£¬¸Ðл > > >ż×î½üÃæÊÔdouban.comʱµÄ³õÊÔÊÔÌ⣬ÎÒ»ØÀ´¾Í¸øʵÏÖÁËһϡ£ÃæÏòdangdang.comÍøÕ¾µÄÌض¨ÍøÕ¾ÅÀ³æ > ¡£¿ªÊ¼ÊÇʹÓÃpysqlite2Á¬½ÓSQLite×öÊý¾Ý¿âµÄ£¬ºóÀ´²¢·¢·ÃÎÊÎÊÌâ¸ã²»¶¨£¬¾Í¸ÄÓÃBerkeleyDBÁË£¬¾ÍÊÇdbhashÄ£¿é¡£Ê¹ÓÃBerkeleyDBµÄÊý¾Ý¿âÄ£ÐͲ¿·ÖÊÇÔÚµØÏÂÊÒÀïÃæд³öÀ´µÄ£¬²»Òª¹ÖÎÒ£¬ºÇºÇ¡£ > >ÄÄλÅóÓÑÓÐÐËȤ¿ÉÒÔ·¢Óʼþ¸øÎÒ£¬ÎÒ»á»Ø¸´ÕâÌõÅÀ³æµÄSVN°æ±¾¿âѹËõ°ü¡£°æ±¾ÊÇ1.3.2µÄsvn£¬ > ·Ç³£²»ÍƼöÖ»¿´Ô´Â룬ÒòΪµ÷ÊÔ¹ý³ÌÖз¢ÏֵĺܶàÎÊÌâÎÒÖ±½ÓдÔÚÌá½»ÈÕÖ¾ÀïÃæÁË¡£ > >µ±Ç°µÄ״̬ÊÇÓÐ2¸öỊ̈߳¬3000×óÓÒURL£¬ËٶȱȽÏÂý¡£ÒÔǰʹÓÃSQLiteµÄʱºòËٶȸüÂý£¬²»¹ýURLÔÚ2ÍòÒÔÉÏÁË¡£ > >ÁíÍ⣬ϣÍû¸÷λ¸ßÈËÒ²¿ÉÒÔ¿´¿´£¬ÎÒÔÚ¸ÄÓÃBerkeleyDBÖ®ºó£¬ÔÚʹÓÃthreading.Lock > ()Õâ¸öËøµÄʱºò£¬Ê±¼ä³¤ÁË»á³ö벡£¬²¢²»ÊÇÅ׳öÒì³££¬¶øÊÇPython½âÊÍÆ÷Ö±½ÓÖÐÖ¹¡£ > > > >-- > >´ÓÇ°ÓÐÒ»Ö»ºÜÀäµÄëë³æ£¬ËûÏë»ñµÃÒ»µãÎÂů¡£¶ø»ñµÃÎÂůµÄ»ú»áÖ»ÓдÓÊ÷ÉϵôÏÂÀ´£¬Âä½ø±ðÈ˵ÄÁì¿Ú¡£ > >Ƭ¿ÌµÄÎÂů£¬Ö®ºó±ãʧȥÉúÃü¡£¶øºÜ¶àͬÀàÈ´Á¬ÕâƬ¿ÌµÄÎÂů¶¼Ã»Óеõ½¾Í.. > >ÎÒ»áµÃµ½ÎÂůô£¿Ð¡ÐÄÒíÒíµÄ³¢ÊÔ£¬È´»¹ÊÇ»áÊܵ½É˺¦¡£ > >ÎÒԸΪÄÇÒ»¿ÌµÄÎÂůȥƴ£¬¿ÉÊÇËÔ¸Òâ½ÓÊÜ? > > > >»¶Ó·ÃÎÊżµÄ²©¿Í£º > >http://blog.csdn.net/gashero > >_______________________________________________ > >python-chinese > >Post: send python-chinese在lists.python.cn > >Subscribe: send subscribe to python-chinese-request在lists.python.cn > >Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > >Detail Info: http://python.cn/mailman/listinfo/python-chinese > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070313/e5e6b36e/attachment.htm
2007年03月13日 星期二 17:25
也给我来一份吧,谢谢了 -- --~--~---------~--~----~------------~-------~--~----~ Best Regards JesseZhao(ZhaoGuang) Blog : Http://JesseZhao.cnblogs.com E-Mail : Prolibertine在gmail.com IM(Live Messager) : Prolibertine在gmail.com --~--~---------~--~----~------------~-------~--~----~
2007年03月13日 星期二 17:27
¸øÎÒÒ²·¢Ò»·ÝÔ´Âë лл ----- Original Message ----- From: Xell Zhang To: python-chinese在lists.python.cn Sent: Tuesday, March 13, 2007 5:10 PM Subject: Re: [python-chinese] µØÏÂÊÒÀïµÄÅÀ³æ ÎÒÒ²ºÜÓÐÐËȤѧϰ£¬²»ÖªµÀÄܲ»ÄÜ·¢ËÍÒ»·ÝÔ´´úÂëµ½ÎÒµÄÓÊÏ䣬лл¡£ On 3/13/07, Âܲ· <ebbstar在126.com> wrote: ºÜÓÐÐËȤѧϰһÏ£¬ÎÒÏÖÔÚÂþÎÞĶĿµÄµÄѧϰ¸Ð¾õÊÕ»ñ²»´ó£¬ÓÖÏë²»³öÀ´Åªµãʲô¡£ ¾ÍÊÇÊDz»ÖªµÀÔõô¿´ÄãµÄÓʼþÔʼµØÖ·£¬Ã»·¨·¢Óʼþ:( ÎÒµÄÓʼþ zbbstar在gmail.com ¿´µ½Á˸øÎÒÀ´Ò»·Ý£¬¸Ðл >ż×î½üÃæÊÔdouban.comʱµÄ³õÊÔÊÔÌ⣬ÎÒ»ØÀ´¾Í¸øʵÏÖÁËһϡ£ÃæÏòdangdang.comÍøÕ¾µÄÌض¨ÍøÕ¾ÅÀ³æ¡£¿ªÊ¼ÊÇʹÓÃpysqlite2Á¬½ÓSQLite×öÊý¾Ý¿âµÄ£¬ºóÀ´²¢·¢·ÃÎÊÎÊÌâ¸ã²»¶¨£¬¾Í¸ÄÓÃBerkeleyDBÁË£¬¾ÍÊÇdbhashÄ£¿é¡£Ê¹ÓÃBerkeleyDBµÄÊý¾Ý¿âÄ£ÐͲ¿·ÖÊÇÔÚµØÏÂÊÒÀïÃæд³öÀ´µÄ£¬²»Òª¹ÖÎÒ£¬ºÇºÇ¡£ >ÄÄλÅóÓÑÓÐÐËȤ¿ÉÒÔ·¢Óʼþ¸øÎÒ£¬ÎÒ»á»Ø¸´ÕâÌõÅÀ³æµÄSVN°æ±¾¿âѹËõ°ü¡£°æ±¾ÊÇ1.3.2µÄsvn£¬·Ç³£²»ÍƼöÖ»¿´Ô´Â룬ÒòΪµ÷ÊÔ¹ý³ÌÖз¢ÏֵĺܶàÎÊÌâÎÒÖ±½ÓдÔÚÌá½»ÈÕÖ¾ÀïÃæÁË¡£ >µ±Ç°µÄ״̬ÊÇÓÐ2¸öỊ̈߳¬3000×óÓÒURL£¬ËٶȱȽÏÂý¡£ÒÔǰʹÓÃSQLiteµÄʱºòËٶȸüÂý£¬²»¹ýURLÔÚ2ÍòÒÔÉÏÁË¡£ >ÁíÍ⣬ϣÍû¸÷λ¸ßÈËÒ²¿ÉÒÔ¿´¿´£¬ÎÒÔÚ¸ÄÓÃBerkeleyDBÖ®ºó£¬ÔÚʹÓÃthreading.Lock()Õâ¸öËøµÄʱºò£¬Ê±¼ä³¤ÁË»á³ö벡£¬²¢²»ÊÇÅ׳öÒì³££¬¶øÊÇPython½âÊÍÆ÷Ö±½ÓÖÐÖ¹¡£ > >-- >´ÓÇ°ÓÐÒ»Ö»ºÜÀäµÄëë³æ£¬ËûÏë»ñµÃÒ»µãÎÂů¡£¶ø»ñµÃÎÂůµÄ»ú»áÖ»ÓдÓÊ÷ÉϵôÏÂÀ´£¬Âä½ø±ðÈ˵ÄÁì¿Ú¡£ >Ƭ¿ÌµÄÎÂů£¬Ö®ºó±ãʧȥÉúÃü¡£¶øºÜ¶àͬÀàÈ´Á¬ÕâƬ¿ÌµÄÎÂů¶¼Ã»Óеõ½¾Í.. >ÎÒ»áµÃµ½ÎÂůô£¿Ð¡ÐÄÒíÒíµÄ³¢ÊÔ£¬È´»¹ÊÇ»áÊܵ½É˺¦¡£ >ÎÒԸΪÄÇÒ»¿ÌµÄÎÂůȥƴ£¬¿ÉÊÇËÔ¸Òâ½ÓÊÜ? > >»¶Ó·ÃÎÊżµÄ²©¿Í£º > http://blog.csdn.net/gashero >_______________________________________________ >python-chinese >Post: send python-chinese在lists.python.cn >Subscribe: send subscribe to python-chinese-request在lists.python.cn >Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn >Detail Info: http://python.cn/mailman/listinfo/python-chinese _______________________________________________ python-chinese Post: send python-chinese在lists.python.cn Subscribe: send subscribe to python-chinese-request在lists.python.cn Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn Detail Info: http://python.cn/mailman/listinfo/python-chinese ------------------------------------------------------------------------------ _______________________________________________ python-chinese Post: send python-chinese在lists.python.cn Subscribe: send subscribe to python-chinese-request在lists.python.cn Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn Detail Info: http://python.cn/mailman/listinfo/python-chinese -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070313/8bc76e14/attachment.html
2007年03月13日 星期二 22:22
¸øÎÒÒ²·¢Ò»·Ý°É£¬Ñ§Ï°Ò»Ï£¬Thanks. -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070313/59bb5b59/attachment.htm
2007年03月13日 星期二 22:29
ÓÐÐËȤ,ллÌṩһ·Ý! ÔÚ07-3-13£¬Gu Yingbo <tensiongyb在gmail.com> дµÀ£º > > ¸øÎÒÒ²·¢Ò»·Ý°É£¬Ñ§Ï°Ò»Ï£¬Thanks. > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -- ºì¿ÍÍøÂçhttp://www.allhonker.com ¹úÄÚÊ×¼ÒÓ¦ÓÃwiki¼¼ÊõµÄºÚ¿ÍÀàÕ¾µã! -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070313/6e6f1b99/attachment.html
2007年03月13日 星期二 23:03
去年写的Quick & Dirty 蜘蛛程序,抓指定网站的, 现在都忘光了,看能不能给大家参考一下。 # -*- coding: utf-8 -*- from twisted.python import threadable threadable.init() from twisted.internet import reactor, threads import urllib2 import urllib import urlparse import time from sgmllib import SGMLParser from usearch import USearch # 此部分负责数据库操作,无法公布源码 class URLLister(SGMLParser): def reset(self): SGMLParser.reset(self) self.urls = [] def start_a(self, attrs): href = [v for k, v in attrs if k=='href'] if href: self.urls.extend(href) class Filter: def __init__(self, Host, denys=None, allows=None): self.deny_words = denys self.allow_words = allows # Check url is valid or not. def verify(self, url): for k in self.deny_words: if url.find(k) != -1: return False for k in self.allow_words: if url.find(k) !=-1: return True return True class Host: def __init__(self, hostname, entry_url=None, description=None, encoding=None, charset=None): self.hostname = hostname self.entry_url = entry_url self.encoding = encoding self.charset = charset self.description = description def configxml(self): import elementtree.ElementTree as ET root = ET.Element("config") en = ET.SubElement(root, "encoding") en.text = self.encoding ch = ET.SubElement(root, "charset") ch.text = self.charset entry = ET.SubElement(root, "entry_url") entry.text = self.entry_url return ET.tostring(root) def parse_config(self, configstring): import elementtree.ElementTree as ET from StringIO import StringIO tree = ET.parse(StringIO(configstring)) self.encoding = tree.findtext(".//encoding") self.charset = tree.findtext(".//charset") self.entry_url = tree.findtext(".//entry_url") def create(self): u = USearch() self.configs = self.configxml() ret = u.CreateDomain(self.hostname,self.description, self.configs) #print ret def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 == visiting, 2 = visited. # TODO: load domain data from backend database. u = USearch() try: ret = u.ListDomain(flag)['result'] for d in ret: if d.domain == self.hostname: self.parse_config(d.parse_config) self.description = d.description return True except: pass return False class Page: def __init__(self, url, host, description=None): self.url = url self.description = description self.host = host self.page_request = None self.content = None self.status_code = None self.encoding = None self.charset = None self.length = 0 self.md5 = None self.urls = [] # Read web page. def get_page(self, url=None): if not url: url = self.url type = get_type(self.host.hostname,url) if type != 0: return None try: opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] self.page_request = opener.open(urllib.unquote(url)) #self.page_request = urllib2.urlopen(url) self.content = self.page_request.read() self.status_code = self.page_request.code return self.status_code except: self.stats_code = 500 print "ERROR READING: %s" % self.url return None def get_header(self): if not self.page_request: self.get_page() header = self.page_request.info() try: self.length = header['Content-Length'] content_type = header['Content-Type'] #if content_type.find('charset') == -1: self.charset = self.host.charset self.encoding = self.host.encoding except: pass def get_urls(self): if not self.page_request: self.get_page() if self.status_code != 200: return parser = URLLister() try: parser.feed(self.content) except: print "ERROR: Parse urls error!" return #print "URLS: ", parser.urls #self.urls = parser.urls if not self.charset: self.charset = "gbk" for i in parser.urls: try: type = get_type(self.host.hostname,i) if type == 4: i = join_url(self.host.hostname, self.url, i) if type == 0 or type ==4: if i: i = urllib.quote(i) self.urls.append(i.decode(self.charset).encode('utf-8')) except: pass parser.close() self.page_request.close() def save_header(self): # Save header info into db. pass def save_current_url(self): save_url = urllib.quote(self.url) usearch = USearch() usearch.CreateUrl( domain=self.host.hostname, url=save_url, length=self.length, status_code=self.status_code) # Set URL's flag def flag_url(self, flag): usearch = USearch() usearch.UpdateUrl(status=flag) def save_urls(self): # Save all the founded urls into db print "RELEATED_URLS:", len(self.urls) usearch = USearch() usearch.CreateRelateUrl(urllib.quote(self.url), self.urls) def save_page(self): usearch = USearch() import cgi try: content = self.content.decode(self.charset).encode('utf-8') usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'), content) except: print "ERROR to save page" return -1 print "SAVE PAGE Done", self.url return 0 def get_type(domain, url): if not url: return 5 import urlparse tup = urlparse.urlparse(url) if tup[0] == "http": # check if the same domain if tup[1] == domain: return 0 else: return 1 # outside link if tup[0] == "javascript": return 2 if tup[0] == "ftp": return 3 if tup[0] == "mailto": return 5 return 4 # internal link def join_url(domain, referral, url): if not url or len(url) ==0: return None tup = urlparse.urlparse(url) if not tup: return None if tup[0] == "javascript" or tup[0] == "ftp": return None else: if url[0] == "/": # means root link begins newurl = "http://%s%s" % ( domain, url) return newurl if url[0] == ".": return None # ignore relative link at first. else: # if referral.rfind("/") != -1: # referral = referral[0:referral.rfind("/")+1] # newurl = "%s%s" % (referral, url) newurl = urlparse.urljoin(referral, url) return newurl if __name__ == '__main__': def done(x): u = USearch() x = urllib.quote(x.decode('gbk').encode('utf-8')) u.SetUrlStatus(x, '2') time.sleep(2) print "DONE: ",x url = next_url(h) if not url: reactor.stop() else:threads.deferToThread(spider, h, url ).addCallback(done) def next_url(host): u = USearch() ret = u.GetTaskUrls(host.hostname,'0',1)['result'] try: url = urllib.unquote(ret[0].url) except: return None if urlparse.urlparse(url)[1] != host.hostname: next_url(host) return urllib.unquote(ret[0].url) def spider(host, surf_url): #surf_url = surf_url.decode(host.charset).encode('utf-8') surf_url = urllib.unquote(surf_url) p = Page(surf_url, host) #try: if not p.get_page(): print "ERROR: GET %s error!" % surf_url return surf_url # Something Wrong! p.get_header() # Get page's header p.get_urls() # Get all the urls in page #print p.urls p.save_current_url() # Save current page's url info into DB p.save_urls() p.save_page() #except: # pass return surf_url import sys #host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312") #host.create() #~ h = Host("www.chilema.cn") #~ h.load() #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/") #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/") #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/") #~ threads.deferToThread(spider, h, "http://www.chilema.cn/Eat/").addCallback(done) #host = Host("www.ziye114.com", "", "Beijing Local", "gb2312") #host.create() hostname = sys.argv[1] entry_url = "" if len(sys.argv) == 3: entry_url = sys.argv[2] h = Host(hostname) hostname_url = "http://%s/%s" % (hostname,entry_url) h.load() threads.deferToThread(spider, h, hostname_url).addCallback(done) threads.deferToThread(spider, h, next_url(h)).addCallback(done) threads.deferToThread(spider, h, next_url(h)).addCallback(done) threads.deferToThread(spider, h, next_url(h)).addCallback(done) reactor.run() ------------------------------ Best Regards, Devin Deng
2007年03月14日 星期三 01:30
我觉得楼主愿意分享自己的代码的精神让人敬佩不已,各位回帖求代码的兄弟的求知精神同样让人欣赏,不过如果各位兄弟能直接把请求信发到楼主的email信箱那就更好了。 On 3/13/07, hackergene <hackergene在gmail.com> wrote: > 有兴趣,谢谢提供一份! > > 在07-3-13,Gu Yingbo <tensiongyb在gmail.com> 写道: > > 给我也发一份吧,学习一下,Thanks. > > > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to > python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to > python-chinese-request在lists.python.cn > > Detail Info: > http://python.cn/mailman/listinfo/python-chinese > > > > > > -- > 红客网络http://www.allhonker.com > 国内首家应用wiki技术的黑客类站点! > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to > python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to > python-chinese-request在lists.python.cn > Detail Info: > http://python.cn/mailman/listinfo/python-chinese >
2007年03月14日 星期三 09:21
关键问题是要找到性能瓶颈,如果是数据库的问题,可以换sqlrelay连接池,换mysql 数据库;果是线程数不够,就增加线程数,开启线程池;如果是python里的罗辑比较慢 可以把关键的地方移出写c模块;如果是网络速度慢....... -----邮件原件----- 发件人: python-chinese-bounces在lists.python.cn [mailto:python-chinese-bounces在lists.python.cn] 代表 gashero 发送时间: 2007年3月13日 14:47 收件人: Python中国用户组 主题: [python-chinese] 地下室里的爬虫 偶最近面试douban.com时的初试试题,我回来就给实现了一下。面向dangdang.com网站 的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的,后来并发访问问题搞 不定,就改用BerkeleyDB了,就是dbhash模块。使用BerkeleyDB的数据库模型部分是在 地下室里面写出来的,不要怪我,呵呵。 哪位朋友有兴趣可以发邮件给我,我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2 的svn,非常不推荐只看源码,因为调试过程中发现的很多问题我直接写在提交日志里 面了。 当前的状态是有2个线程,3000左右URL,速度比较慢。以前使用SQLite的时候速度更 慢,不过URL在2万以上了。 另外,希望各位高人也可以看看,我在改用BerkeleyDB之后,在使用threading.Lock() 这个锁的时候,时间长了会出毛病,并不是抛出异常,而是Python解释器直接中止。 -- 从前有一只很冷的毛毛虫,他想获得一点温暖。而获得温暖的机会只有从树上掉下来, 落进别人的领口。 片刻的温暖,之后便失去生命。而很多同类却连这片刻的温暖都没有得到就.. 我会得到温暖么?小心翼翼的尝试,却还是会受到伤害。 我愿为那一刻的温暖去拼,可是谁愿意接受? 欢迎访问偶的博客: http://blog.csdn.net/gashero _______________________________________________ python-chinese Post: send python-chinese在lists.python.cn Subscribe: send subscribe to python-chinese-request在lists.python.cn Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn Detail Info: http://python.cn/mailman/listinfo/python-chinese
2007年03月14日 星期三 09:50
如果虫子太厉害会被目标网站给挂的吧。
2007年03月14日 星期三 10:47
2007/3/14, vicalloy <zbirder在gmail.com>: > > Èç¹û³æ×ÓÌ«À÷º¦»á±»Ä¿±êÍøÕ¾¸ø¹ÒµÄ°É¡£ ÄÇ»¹ÊÇÂýÂýÅÀ°É£¬ÓûËÙÔò²»´ï¡£¡£¡£¹þ¹þ¡£¡£¡£ -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070314/3f9436a4/attachment.html
2007年03月14日 星期三 11:14
ѧϰһÏ£¬Çë·¢¸øÎÒÒ»·Ý Ö±½Ó»Ø¸´Õâ¸öÓʼþ¡£ лл On 3/14/07, Í·Ì«ÔÎ <torrycn在gmail.com> wrote: > > > > 2007/3/14, vicalloy <zbirder在gmail.com>: > > > > Èç¹û³æ×ÓÌ«À÷º¦»á±»Ä¿±êÍøÕ¾¸ø¹ÒµÄ°É¡£ > > > ÄÇ»¹ÊÇÂýÂýÅÀ°É£¬ÓûËÙÔò²»´ï¡£¡£¡£¹þ¹þ¡£¡£¡£ > > > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -- ------------------------------------------------------------------- python-chinese list PythonÖÐÎļ¼ÊõÌÖÂÛÓʼþÁÐ±í ·¢ÑÔ: ·¢Óʼþµ½ python-chinese在lists.python.cn ¶©ÔÄ: ·¢ËÍ subscribe µ½ python-chinese-request在lists.python.cn Í˶©: ·¢ËÍ unsubscribe µ½ python-chinese-request在lists.python.cn Ïêϸ˵Ã÷: http://python.cn/mailman/listinfo/python-chinese -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070314/6948639e/attachment.htm
2007年03月14日 星期三 11:31
On 3/13/07, Devin Deng <deng.devin在gmail.com> wrote: > 去年写的Quick & Dirty 蜘蛛程序,抓指定网站的, > 现在都忘光了,看能不能给大家参考一下。 收集! http://wiki.woodpecker.org.cn/moin/MicroProj/2007-03-14 > > > # -*- coding: utf-8 -*- > from twisted.python import threadable > threadable.init() > from twisted.internet import reactor, threads > > import urllib2 > import urllib > import urlparse > import time > from sgmllib import SGMLParser > > > from usearch import USearch # 此部分负责数据库操作,无法公布源码 > > class URLLister(SGMLParser): > > def reset(self): > SGMLParser.reset(self) > self.urls = [] > > def start_a(self, attrs): > href = [v for k, v in attrs if k=='href'] > if href: > self.urls.extend(href) > > class Filter: > > def __init__(self, Host, denys=None, allows=None): > self.deny_words = denys > self.allow_words = allows > > # Check url is valid or not. > def verify(self, url): > > for k in self.deny_words: > if url.find(k) != -1: > return False > > for k in self.allow_words: > if url.find(k) !=-1: > return True > > return True > > > > class Host: > > def __init__(self, hostname, entry_url=None, description=None, > encoding=None, charset=None): > self.hostname = hostname > self.entry_url = entry_url > self.encoding = encoding > self.charset = charset > self.description = description > > def configxml(self): > import elementtree.ElementTree as ET > > root = ET.Element("config") > en = ET.SubElement(root, "encoding") > en.text = self.encoding > > ch = ET.SubElement(root, "charset") > ch.text = self.charset > > entry = ET.SubElement(root, "entry_url") > entry.text = self.entry_url > > return ET.tostring(root) > > def parse_config(self, configstring): > import elementtree.ElementTree as ET > from StringIO import StringIO > tree = ET.parse(StringIO(configstring)) > self.encoding = tree.findtext(".//encoding") > self.charset = tree.findtext(".//charset") > self.entry_url = tree.findtext(".//entry_url") > > def create(self): > u = USearch() > self.configs = self.configxml() > > ret = u.CreateDomain(self.hostname,self.description, self.configs) > #print ret > > def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 == > visiting, 2 = visited. > # TODO: load domain data from backend database. > u = USearch() > try: > ret = u.ListDomain(flag)['result'] > for d in ret: > > if d.domain == self.hostname: > self.parse_config(d.parse_config) > self.description = d.description > return True > except: > pass > return False > > > class Page: > > def __init__(self, url, host, description=None): > self.url = url > self.description = description > self.host = host > self.page_request = None > self.content = None > > self.status_code = None > self.encoding = None > self.charset = None > self.length = 0 > self.md5 = None > self.urls = [] > > # Read web page. > def get_page(self, url=None): > if not url: url = self.url > type = get_type(self.host.hostname,url) > if type != 0: return None > try: > opener = urllib2.build_opener() > opener.addheaders = [('User-agent', 'Mozilla/5.0')] > self.page_request = opener.open(urllib.unquote(url)) > #self.page_request = urllib2.urlopen(url) > self.content = self.page_request.read() > self.status_code = self.page_request.code > return self.status_code > except: > self.stats_code = 500 > print "ERROR READING: %s" % self.url > return None > > > def get_header(self): > > if not self.page_request: > self.get_page() > header = self.page_request.info() > try: > self.length = header['Content-Length'] > content_type = header['Content-Type'] > #if content_type.find('charset') == -1: > self.charset = self.host.charset > > self.encoding = self.host.encoding > except: > pass > > > def get_urls(self): > > if not self.page_request: > self.get_page() > > if self.status_code != 200: > return > > parser = URLLister() > > try: > parser.feed(self.content) > except: > print "ERROR: Parse urls error!" > return > > #print "URLS: ", parser.urls > #self.urls = parser.urls > if not self.charset: self.charset = "gbk" > for i in parser.urls: > try: > type = get_type(self.host.hostname,i) > > if type == 4: > i = join_url(self.host.hostname, self.url, i) > if type == 0 or type ==4: > if i: > i = urllib.quote(i) > self.urls.append(i.decode(self.charset).encode('utf-8')) > except: > pass > > parser.close() > self.page_request.close() > > def save_header(self): > # Save header info into db. > pass > > def save_current_url(self): > save_url = urllib.quote(self.url) > usearch = USearch() > usearch.CreateUrl( domain=self.host.hostname, url=save_url, > length=self.length, status_code=self.status_code) > > # Set URL's flag > def flag_url(self, flag): > usearch = USearch() > usearch.UpdateUrl(status=flag) > > def save_urls(self): > # Save all the founded urls into db > print "RELEATED_URLS:", len(self.urls) > usearch = USearch() > usearch.CreateRelateUrl(urllib.quote(self.url), self.urls) > > def save_page(self): > usearch = USearch() > import cgi > > try: > content = self.content.decode(self.charset).encode('utf-8') > usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'), > content) > except: > print "ERROR to save page" > return -1 > print "SAVE PAGE Done", self.url > return 0 > > > > def get_type(domain, url): > if not url: return 5 > import urlparse > tup = urlparse.urlparse(url) > if tup[0] == "http": > # check if the same domain > if tup[1] == domain: return 0 > else: return 1 # outside link > if tup[0] == "javascript": > return 2 > if tup[0] == "ftp": > return 3 > if tup[0] == "mailto": > return 5 > > return 4 # internal link > > def join_url(domain, referral, url): > > if not url or len(url) ==0: return None > tup = urlparse.urlparse(url) > if not tup: return None > > if tup[0] == "javascript" or tup[0] == "ftp": return None > > > else: > if url[0] == "/": # means root link begins > newurl = "http://%s%s" % ( domain, url) > return newurl > if url[0] == ".": return None # ignore relative link at first. > else: > > # if referral.rfind("/") != -1: > # referral = referral[0:referral.rfind("/")+1] > # newurl = "%s%s" % (referral, url) > newurl = urlparse.urljoin(referral, url) > return newurl > > if __name__ == '__main__': > > def done(x): > > u = USearch() > x = urllib.quote(x.decode('gbk').encode('utf-8')) > u.SetUrlStatus(x, '2') > time.sleep(2) > print "DONE: ",x > url = next_url(h) > if not url: reactor.stop() > else:threads.deferToThread(spider, h, url ).addCallback(done) > > > def next_url(host): > u = USearch() > ret = u.GetTaskUrls(host.hostname,'0',1)['result'] > try: > url = urllib.unquote(ret[0].url) > except: > return None > > if urlparse.urlparse(url)[1] != host.hostname: next_url(host) > return urllib.unquote(ret[0].url) > > def spider(host, surf_url): > > #surf_url = surf_url.decode(host.charset).encode('utf-8') > surf_url = urllib.unquote(surf_url) > p = Page(surf_url, host) > #try: > if not p.get_page(): > print "ERROR: GET %s error!" % surf_url > return surf_url # Something Wrong! > p.get_header() # Get page's header > p.get_urls() # Get all the urls in page > #print p.urls > p.save_current_url() # Save current page's url info into DB > p.save_urls() > p.save_page() > #except: > # pass > > return surf_url > > > import sys > #host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312") > #host.create() > > #~ h = Host("www.chilema.cn") > #~ h.load() > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/") > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/") > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/") > > #~ threads.deferToThread(spider, h, > "http://www.chilema.cn/Eat/").addCallback(done) > > #host = Host("www.ziye114.com", "", "Beijing Local", "gb2312") > #host.create() > > hostname = sys.argv[1] > entry_url = "" > if len(sys.argv) == 3: entry_url = sys.argv[2] > > h = Host(hostname) > hostname_url = "http://%s/%s" % (hostname,entry_url) > h.load() > threads.deferToThread(spider, h, hostname_url).addCallback(done) > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > reactor.run() > > ------------------------------ > > Best Regards, > > Devin Deng > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese -- '''Time is unimportant, only life important! http://zoomquiet.org blog在http://blog.zoomquiet.org/pyblosxom/ wiki在http://wiki.woodpecker.org.cn/moin/ZoomQuiet scrap在http://floss.zoomquiet.org douban在http://www.douban.com/people/zoomq/ ____________________________________ Pls. use OpenOffice.org to replace M$ Office. http://zh.openoffice.org Pls. use 7-zip to replace WinRAR/WinZip. http://7-zip.org/zh-cn/ You can get the truely Freedom 4 software. '''
2007年03月14日 星期三 23:15
瓶颈多半是CPU上 或着网络 在 07-3-14,Zoom. Quiet<zoom.quiet在gmail.com> 写道: > On 3/13/07, Devin Deng <deng.devin在gmail.com> wrote: > > 去年写的Quick & Dirty 蜘蛛程序,抓指定网站的, > > 现在都忘光了,看能不能给大家参考一下。 > 收集! > http://wiki.woodpecker.org.cn/moin/MicroProj/2007-03-14 > > > > > > > # -*- coding: utf-8 -*- > > from twisted.python import threadable > > threadable.init() > > from twisted.internet import reactor, threads > > > > import urllib2 > > import urllib > > import urlparse > > import time > > from sgmllib import SGMLParser > > > > > > from usearch import USearch # 此部分负责数据库操作,无法公布源码 > > > > class URLLister(SGMLParser): > > > > def reset(self): > > SGMLParser.reset(self) > > self.urls = [] > > > > def start_a(self, attrs): > > href = [v for k, v in attrs if k=='href'] > > if href: > > self.urls.extend(href) > > > > class Filter: > > > > def __init__(self, Host, denys=None, allows=None): > > self.deny_words = denys > > self.allow_words = allows > > > > # Check url is valid or not. > > def verify(self, url): > > > > for k in self.deny_words: > > if url.find(k) != -1: > > return False > > > > for k in self.allow_words: > > if url.find(k) !=-1: > > return True > > > > return True > > > > > > > > class Host: > > > > def __init__(self, hostname, entry_url=None, description=None, > > encoding=None, charset=None): > > self.hostname = hostname > > self.entry_url = entry_url > > self.encoding = encoding > > self.charset = charset > > self.description = description > > > > def configxml(self): > > import elementtree.ElementTree as ET > > > > root = ET.Element("config") > > en = ET.SubElement(root, "encoding") > > en.text = self.encoding > > > > ch = ET.SubElement(root, "charset") > > ch.text = self.charset > > > > entry = ET.SubElement(root, "entry_url") > > entry.text = self.entry_url > > > > return ET.tostring(root) > > > > def parse_config(self, configstring): > > import elementtree.ElementTree as ET > > from StringIO import StringIO > > tree = ET.parse(StringIO(configstring)) > > self.encoding = tree.findtext(".//encoding") > > self.charset = tree.findtext(".//charset") > > self.entry_url = tree.findtext(".//entry_url") > > > > def create(self): > > u = USearch() > > self.configs = self.configxml() > > > > ret = u.CreateDomain(self.hostname,self.description, self.configs) > > #print ret > > > > def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 == > > visiting, 2 = visited. > > # TODO: load domain data from backend database. > > u = USearch() > > try: > > ret = u.ListDomain(flag)['result'] > > for d in ret: > > > > if d.domain == self.hostname: > > self.parse_config(d.parse_config) > > self.description = d.description > > return True > > except: > > pass > > return False > > > > > > class Page: > > > > def __init__(self, url, host, description=None): > > self.url = url > > self.description = description > > self.host = host > > self.page_request = None > > self.content = None > > > > self.status_code = None > > self.encoding = None > > self.charset = None > > self.length = 0 > > self.md5 = None > > self.urls = [] > > > > # Read web page. > > def get_page(self, url=None): > > if not url: url = self.url > > type = get_type(self.host.hostname,url) > > if type != 0: return None > > try: > > opener = urllib2.build_opener() > > opener.addheaders = [('User-agent', 'Mozilla/5.0')] > > self.page_request = opener.open(urllib.unquote(url)) > > #self.page_request = urllib2.urlopen(url) > > self.content = self.page_request.read() > > self.status_code = self.page_request.code > > return self.status_code > > except: > > self.stats_code = 500 > > print "ERROR READING: %s" % self.url > > return None > > > > > > def get_header(self): > > > > if not self.page_request: > > self.get_page() > > header = self.page_request.info() > > try: > > self.length = header['Content-Length'] > > content_type = header['Content-Type'] > > #if content_type.find('charset') == -1: > > self.charset = self.host.charset > > > > self.encoding = self.host.encoding > > except: > > pass > > > > > > def get_urls(self): > > > > if not self.page_request: > > self.get_page() > > > > if self.status_code != 200: > > return > > > > parser = URLLister() > > > > try: > > parser.feed(self.content) > > except: > > print "ERROR: Parse urls error!" > > return > > > > #print "URLS: ", parser.urls > > #self.urls = parser.urls > > if not self.charset: self.charset = "gbk" > > for i in parser.urls: > > try: > > type = get_type(self.host.hostname,i) > > > > if type == 4: > > i = join_url(self.host.hostname, self.url, i) > > if type == 0 or type ==4: > > if i: > > i = urllib.quote(i) > > self.urls.append(i.decode(self.charset).encode('utf-8')) > > except: > > pass > > > > parser.close() > > self.page_request.close() > > > > def save_header(self): > > # Save header info into db. > > pass > > > > def save_current_url(self): > > save_url = urllib.quote(self.url) > > usearch = USearch() > > usearch.CreateUrl( domain=self.host.hostname, url=save_url, > > length=self.length, status_code=self.status_code) > > > > # Set URL's flag > > def flag_url(self, flag): > > usearch = USearch() > > usearch.UpdateUrl(status=flag) > > > > def save_urls(self): > > # Save all the founded urls into db > > print "RELEATED_URLS:", len(self.urls) > > usearch = USearch() > > usearch.CreateRelateUrl(urllib.quote(self.url), self.urls) > > > > def save_page(self): > > usearch = USearch() > > import cgi > > > > try: > > content = self.content.decode(self.charset).encode('utf-8') > > usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'), > > content) > > except: > > print "ERROR to save page" > > return -1 > > print "SAVE PAGE Done", self.url > > return 0 > > > > > > > > def get_type(domain, url): > > if not url: return 5 > > import urlparse > > tup = urlparse.urlparse(url) > > if tup[0] == "http": > > # check if the same domain > > if tup[1] == domain: return 0 > > else: return 1 # outside link > > if tup[0] == "javascript": > > return 2 > > if tup[0] == "ftp": > > return 3 > > if tup[0] == "mailto": > > return 5 > > > > return 4 # internal link > > > > def join_url(domain, referral, url): > > > > if not url or len(url) ==0: return None > > tup = urlparse.urlparse(url) > > if not tup: return None > > > > if tup[0] == "javascript" or tup[0] == "ftp": return None > > > > > > else: > > if url[0] == "/": # means root link begins > > newurl = "http://%s%s" % ( domain, url) > > return newurl > > if url[0] == ".": return None # ignore relative link at first. > > else: > > > > # if referral.rfind("/") != -1: > > # referral = referral[0:referral.rfind("/")+1] > > # newurl = "%s%s" % (referral, url) > > newurl = urlparse.urljoin(referral, url) > > return newurl > > > > if __name__ == '__main__': > > > > def done(x): > > > > u = USearch() > > x = urllib.quote(x.decode('gbk').encode('utf-8')) > > u.SetUrlStatus(x, '2') > > time.sleep(2) > > print "DONE: ",x > > url = next_url(h) > > if not url: reactor.stop() > > else:threads.deferToThread(spider, h, url ).addCallback(done) > > > > > > def next_url(host): > > u = USearch() > > ret = u.GetTaskUrls(host.hostname,'0',1)['result'] > > try: > > url = urllib.unquote(ret[0].url) > > except: > > return None > > > > if urlparse.urlparse(url)[1] != host.hostname: next_url(host) > > return urllib.unquote(ret[0].url) > > > > def spider(host, surf_url): > > > > #surf_url = surf_url.decode(host.charset).encode('utf-8') > > surf_url = urllib.unquote(surf_url) > > p = Page(surf_url, host) > > #try: > > if not p.get_page(): > > print "ERROR: GET %s error!" % surf_url > > return surf_url # Something Wrong! > > p.get_header() # Get page's header > > p.get_urls() # Get all the urls in page > > #print p.urls > > p.save_current_url() # Save current page's url info into DB > > p.save_urls() > > p.save_page() > > #except: > > # pass > > > > return surf_url > > > > > > import sys > > #host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312") > > #host.create() > > > > #~ h = Host("www.chilema.cn") > > #~ h.load() > > > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/") > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/") > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/") > > > > #~ threads.deferToThread(spider, h, > > "http://www.chilema.cn/Eat/").addCallback(done) > > > > #host = Host("www.ziye114.com", "", "Beijing Local", "gb2312") > > #host.create() > > > > hostname = sys.argv[1] > > entry_url = "" > > if len(sys.argv) == 3: entry_url = sys.argv[2] > > > > h = Host(hostname) > > hostname_url = "http://%s/%s" % (hostname,entry_url) > > h.load() > > threads.deferToThread(spider, h, hostname_url).addCallback(done) > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > reactor.run() > > > > ------------------------------ > > > > Best Regards, > > > > Devin Deng > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > -- > '''Time is unimportant, only life important! > http://zoomquiet.org > blog在http://blog.zoomquiet.org/pyblosxom/ > wiki在http://wiki.woodpecker.org.cn/moin/ZoomQuiet > scrap在http://floss.zoomquiet.org > douban在http://www.douban.com/people/zoomq/ > ____________________________________ > Pls. use OpenOffice.org to replace M$ Office. > http://zh.openoffice.org > Pls. use 7-zip to replace WinRAR/WinZip. > http://7-zip.org/zh-cn/ > You can get the truely Freedom 4 software. > ''' > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese
2007年03月14日 星期三 23:15
说错了 是数据库 索引什么的要做好。。 在 07-3-14,散漫酷男孩<17mxnet在gmail.com> 写道: > 瓶颈多半是CPU上 或着网络 > > 在 07-3-14,Zoom. Quiet<zoom.quiet在gmail.com> 写道: > > On 3/13/07, Devin Deng <deng.devin在gmail.com> wrote: > > > 去年写的Quick & Dirty 蜘蛛程序,抓指定网站的, > > > 现在都忘光了,看能不能给大家参考一下。 > > 收集! > > http://wiki.woodpecker.org.cn/moin/MicroProj/2007-03-14 > > > > > > > > > > > # -*- coding: utf-8 -*- > > > from twisted.python import threadable > > > threadable.init() > > > from twisted.internet import reactor, threads > > > > > > import urllib2 > > > import urllib > > > import urlparse > > > import time > > > from sgmllib import SGMLParser > > > > > > > > > from usearch import USearch # 此部分负责数据库操作,无法公布源码 > > > > > > class URLLister(SGMLParser): > > > > > > def reset(self): > > > SGMLParser.reset(self) > > > self.urls = [] > > > > > > def start_a(self, attrs): > > > href = [v for k, v in attrs if k=='href'] > > > if href: > > > self.urls.extend(href) > > > > > > class Filter: > > > > > > def __init__(self, Host, denys=None, allows=None): > > > self.deny_words = denys > > > self.allow_words = allows > > > > > > # Check url is valid or not. > > > def verify(self, url): > > > > > > for k in self.deny_words: > > > if url.find(k) != -1: > > > return False > > > > > > for k in self.allow_words: > > > if url.find(k) !=-1: > > > return True > > > > > > return True > > > > > > > > > > > > class Host: > > > > > > def __init__(self, hostname, entry_url=None, description=None, > > > encoding=None, charset=None): > > > self.hostname = hostname > > > self.entry_url = entry_url > > > self.encoding = encoding > > > self.charset = charset > > > self.description = description > > > > > > def configxml(self): > > > import elementtree.ElementTree as ET > > > > > > root = ET.Element("config") > > > en = ET.SubElement(root, "encoding") > > > en.text = self.encoding > > > > > > ch = ET.SubElement(root, "charset") > > > ch.text = self.charset > > > > > > entry = ET.SubElement(root, "entry_url") > > > entry.text = self.entry_url > > > > > > return ET.tostring(root) > > > > > > def parse_config(self, configstring): > > > import elementtree.ElementTree as ET > > > from StringIO import StringIO > > > tree = ET.parse(StringIO(configstring)) > > > self.encoding = tree.findtext(".//encoding") > > > self.charset = tree.findtext(".//charset") > > > self.entry_url = tree.findtext(".//entry_url") > > > > > > def create(self): > > > u = USearch() > > > self.configs = self.configxml() > > > > > > ret = u.CreateDomain(self.hostname,self.description, self.configs) > > > #print ret > > > > > > def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 == > > > visiting, 2 = visited. > > > # TODO: load domain data from backend database. > > > u = USearch() > > > try: > > > ret = u.ListDomain(flag)['result'] > > > for d in ret: > > > > > > if d.domain == self.hostname: > > > self.parse_config(d.parse_config) > > > self.description = d.description > > > return True > > > except: > > > pass > > > return False > > > > > > > > > class Page: > > > > > > def __init__(self, url, host, description=None): > > > self.url = url > > > self.description = description > > > self.host = host > > > self.page_request = None > > > self.content = None > > > > > > self.status_code = None > > > self.encoding = None > > > self.charset = None > > > self.length = 0 > > > self.md5 = None > > > self.urls = [] > > > > > > # Read web page. > > > def get_page(self, url=None): > > > if not url: url = self.url > > > type = get_type(self.host.hostname,url) > > > if type != 0: return None > > > try: > > > opener = urllib2.build_opener() > > > opener.addheaders = [('User-agent', 'Mozilla/5.0')] > > > self.page_request = opener.open(urllib.unquote(url)) > > > #self.page_request = urllib2.urlopen(url) > > > self.content = self.page_request.read() > > > self.status_code = self.page_request.code > > > return self.status_code > > > except: > > > self.stats_code = 500 > > > print "ERROR READING: %s" % self.url > > > return None > > > > > > > > > def get_header(self): > > > > > > if not self.page_request: > > > self.get_page() > > > header = self.page_request.info() > > > try: > > > self.length = header['Content-Length'] > > > content_type = header['Content-Type'] > > > #if content_type.find('charset') == -1: > > > self.charset = self.host.charset > > > > > > self.encoding = self.host.encoding > > > except: > > > pass > > > > > > > > > def get_urls(self): > > > > > > if not self.page_request: > > > self.get_page() > > > > > > if self.status_code != 200: > > > return > > > > > > parser = URLLister() > > > > > > try: > > > parser.feed(self.content) > > > except: > > > print "ERROR: Parse urls error!" > > > return > > > > > > #print "URLS: ", parser.urls > > > #self.urls = parser.urls > > > if not self.charset: self.charset = "gbk" > > > for i in parser.urls: > > > try: > > > type = get_type(self.host.hostname,i) > > > > > > if type == 4: > > > i = join_url(self.host.hostname, self.url, i) > > > if type == 0 or type ==4: > > > if i: > > > i = urllib.quote(i) > > > self.urls.append(i.decode(self.charset).encode('utf-8')) > > > except: > > > pass > > > > > > parser.close() > > > self.page_request.close() > > > > > > def save_header(self): > > > # Save header info into db. > > > pass > > > > > > def save_current_url(self): > > > save_url = urllib.quote(self.url) > > > usearch = USearch() > > > usearch.CreateUrl( domain=self.host.hostname, url=save_url, > > > length=self.length, status_code=self.status_code) > > > > > > # Set URL's flag > > > def flag_url(self, flag): > > > usearch = USearch() > > > usearch.UpdateUrl(status=flag) > > > > > > def save_urls(self): > > > # Save all the founded urls into db > > > print "RELEATED_URLS:", len(self.urls) > > > usearch = USearch() > > > usearch.CreateRelateUrl(urllib.quote(self.url), self.urls) > > > > > > def save_page(self): > > > usearch = USearch() > > > import cgi > > > > > > try: > > > content = self.content.decode(self.charset).encode('utf-8') > > > usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'), > > > content) > > > except: > > > print "ERROR to save page" > > > return -1 > > > print "SAVE PAGE Done", self.url > > > return 0 > > > > > > > > > > > > def get_type(domain, url): > > > if not url: return 5 > > > import urlparse > > > tup = urlparse.urlparse(url) > > > if tup[0] == "http": > > > # check if the same domain > > > if tup[1] == domain: return 0 > > > else: return 1 # outside link > > > if tup[0] == "javascript": > > > return 2 > > > if tup[0] == "ftp": > > > return 3 > > > if tup[0] == "mailto": > > > return 5 > > > > > > return 4 # internal link > > > > > > def join_url(domain, referral, url): > > > > > > if not url or len(url) ==0: return None > > > tup = urlparse.urlparse(url) > > > if not tup: return None > > > > > > if tup[0] == "javascript" or tup[0] == "ftp": return None > > > > > > > > > else: > > > if url[0] == "/": # means root link begins > > > newurl = "http://%s%s" % ( domain, url) > > > return newurl > > > if url[0] == ".": return None # ignore relative link at first. > > > else: > > > > > > # if referral.rfind("/") != -1: > > > # referral = referral[0:referral.rfind("/")+1] > > > # newurl = "%s%s" % (referral, url) > > > newurl = urlparse.urljoin(referral, url) > > > return newurl > > > > > > if __name__ == '__main__': > > > > > > def done(x): > > > > > > u = USearch() > > > x = urllib.quote(x.decode('gbk').encode('utf-8')) > > > u.SetUrlStatus(x, '2') > > > time.sleep(2) > > > print "DONE: ",x > > > url = next_url(h) > > > if not url: reactor.stop() > > > else:threads.deferToThread(spider, h, url ).addCallback(done) > > > > > > > > > def next_url(host): > > > u = USearch() > > > ret = u.GetTaskUrls(host.hostname,'0',1)['result'] > > > try: > > > url = urllib.unquote(ret[0].url) > > > except: > > > return None > > > > > > if urlparse.urlparse(url)[1] != host.hostname: next_url(host) > > > return urllib.unquote(ret[0].url) > > > > > > def spider(host, surf_url): > > > > > > #surf_url = surf_url.decode(host.charset).encode('utf-8') > > > surf_url = urllib.unquote(surf_url) > > > p = Page(surf_url, host) > > > #try: > > > if not p.get_page(): > > > print "ERROR: GET %s error!" % surf_url > > > return surf_url # Something Wrong! > > > p.get_header() # Get page's header > > > p.get_urls() # Get all the urls in page > > > #print p.urls > > > p.save_current_url() # Save current page's url info into DB > > > p.save_urls() > > > p.save_page() > > > #except: > > > # pass > > > > > > return surf_url > > > > > > > > > import sys > > > #host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312") > > > #host.create() > > > > > > #~ h = Host("www.chilema.cn") > > > #~ h.load() > > > > > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/") > > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/") > > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/") > > > > > > #~ threads.deferToThread(spider, h, > > > "http://www.chilema.cn/Eat/").addCallback(done) > > > > > > #host = Host("www.ziye114.com", "", "Beijing Local", "gb2312") > > > #host.create() > > > > > > hostname = sys.argv[1] > > > entry_url = "" > > > if len(sys.argv) == 3: entry_url = sys.argv[2] > > > > > > h = Host(hostname) > > > hostname_url = "http://%s/%s" % (hostname,entry_url) > > > h.load() > > > threads.deferToThread(spider, h, hostname_url).addCallback(done) > > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > > reactor.run() > > > > > > ------------------------------ > > > > > > Best Regards, > > > > > > Devin Deng > > > _______________________________________________ > > > python-chinese > > > Post: send python-chinese在lists.python.cn > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > > -- > > '''Time is unimportant, only life important! > > http://zoomquiet.org > > blog在http://blog.zoomquiet.org/pyblosxom/ > > wiki在http://wiki.woodpecker.org.cn/moin/ZoomQuiet > > scrap在http://floss.zoomquiet.org > > douban在http://www.douban.com/people/zoomq/ > > ____________________________________ > > Pls. use OpenOffice.org to replace M$ Office. > > http://zh.openoffice.org > > Pls. use 7-zip to replace WinRAR/WinZip. > > http://7-zip.org/zh-cn/ > > You can get the truely Freedom 4 software. > > ''' > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese >
2007年03月15日 星期四 00:05
请给我也发一份源码吧。想学习一下,谢谢 -- sleepy right brain~~犯困的右脑 -------------- 下一部分 -------------- 一个HTML附件被移除... URL: http://python.cn/pipermail/python-chinese/attachments/20070315/48a85a77/attachment.html
2007年03月15日 星期四 10:39
这是邮件组, 要的人直接发到他**本人**邮箱去好了. 往这个这个组发, 一大堆人收到,有何意义? 开这个线索的gashero也不知道怎么想的,想分享代码,直接贴在邮件列表里不是更好? 搞得同那些发垃圾邮件来搜集邮件地址的一样. 在07-3-15,sihao huang <hsihao001在gmail.com> 写道: > > > 请给我也发一份源码吧。想学习一下,谢谢 > > -- > sleepy right brain~~犯困的右脑 > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -------------- 下一部分 -------------- 一个HTML附件被移除... URL: http://python.cn/pipermail/python-chinese/attachments/20070315/c06edbd2/attachment.htm
2007年03月15日 星期四 10:56
ºÇºÇ¡«´ó¼ÒÕâ²»ÊǶ¼¼±×ÅѧϰÂ𣬿´À´³ýÁËÒ»¾ä"Òª´úÂë"£¬Í¬Ê±Ò²ÒªÌ¸Ì¸×Ô¼ºµÄ¸ÐÊÜ£¬ÇдèÏÂgasheroÓöµ½µÄÎÊÌâ²ÅÊÇ°¡¡£Ò²Ðí·½·¨Ç·Í×£¬µ«gasheroµÄ·ÖÏí»¹ÊÇÁîÈËPF°¡¡£ ÔÚ07-3-15£¬Hope <lanyig在gmail.com> дµÀ£º > > ÕâÊÇÓʼþ×é, ÒªµÄÈËÖ±½Ó·¢µ½Ëû**±¾ÈË**ÓÊÏäÈ¥ºÃÁË. ÍùÕâ¸öÕâ¸ö×é·¢, Ò»´ó¶ÑÈËÊÕµ½,ÓкÎÒâÒå? > ¿ªÕâ¸öÏßË÷µÄgasheroÒ²²»ÖªµÀÔõôÏëµÄ,Ïë·ÖÏí´úÂë,Ö±½ÓÌùÔÚÓʼþÁбíÀï²»ÊǸüºÃ? ¸ãµÃͬÄÇЩ·¢À¬»øÓʼþÀ´ËѼ¯ÓʼþµØÖ·µÄÒ»Ñù. > > ÔÚ07-3-15£¬sihao huang < hsihao001在gmail.com> дµÀ£º > > > > > > Çë¸øÎÒÒ²·¢Ò»·ÝÔ´Âë°É¡£ÏëѧϰһÏ£¬Ð»Ð» > > > > -- > > sleepy right brain~~·¸À§µÄÓÒÄÔ > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to > > python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070315/7173f989/attachment.html
Zeuux © 2025
京ICP备05028076号