2007年03月13日 星期二 14:47
偶最近面试douban.com时的初试试题,我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的,后来并发访问问题搞不定,就改用BerkeleyDB了,就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里面写出来的,不要怪我,呵呵。 哪位朋友有兴趣可以发邮件给我,我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2的svn,非常不推荐只看源码,因为调试过程中发现的很多问题我直接写在提交日志里面了。 当前的状态是有2个线程,3000左右URL,速度比较慢。以前使用SQLite的时候速度更慢,不过URL在2万以上了。 另外,希望各位高人也可以看看,我在改用BerkeleyDB之后,在使用threading.Lock()这个锁的时候,时间长了会出毛病,并不是抛出异常,而是Python解释器直接中止。 -- 从前有一只很冷的毛毛虫,他想获得一点温暖。而获得温暖的机会只有从树上掉下来,落进别人的领口。 片刻的温暖,之后便失去生命。而很多同类却连这片刻的温暖都没有得到就.. 我会得到温暖么?小心翼翼的尝试,却还是会受到伤害。 我愿为那一刻的温暖去拼,可是谁愿意接受? 欢迎访问偶的博客: http://blog.csdn.net/gashero
2007年03月13日 星期二 15:10
晕........ 考试考如何写爬虫? gashero 写道: > 偶最近面试douban.com时的初试试题,我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的,后来并发访问问题搞不定,就改用BerkeleyDB了,就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里面写出来的,不要怪我,呵呵。 > 哪位朋友有兴趣可以发邮件给我,我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2的svn,非常不推荐只看源码,因为调试过程中发现的很多问题我直接写在提交日志里面了。 > 当前的状态是有2个线程,3000左右URL,速度比较慢。以前使用SQLite的时候速度更慢,不过URL在2万以上了。 > 另外,希望各位高人也可以看看,我在改用BerkeleyDB之后,在使用threading.Lock()这个锁的时候,时间长了会出毛病,并不是抛出异常,而是Python解释器直接中止。 > >
2007年03月13日 星期二 15:47
想学习一下,谢谢。 在 07-3-13,gashero<harry.python在gmail.com> 写道: > 偶最近面试douban.com时的初试试题,我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的,后来并发访问问题搞不定,就改用BerkeleyDB了,就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里
2007年03月13日 星期二 16:52
很有兴趣学习一下,我现在漫无亩目的的学习感觉收获不大,又想不出来弄点什么。 就是是不知道怎么看你的邮件原始地址,没法发邮件:( 我的邮件 zbbstar在gmail.com 看到了给我来一份,感谢 >偶最近面试douban.com时的初试试题,我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的,后来并发访问问题搞不定,就改用BerkeleyDB了,就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里面写出来的,不要怪我,呵呵。 >哪位朋友有兴趣可以发邮件给我,我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2的svn,非常不推荐只看源码,因为调试过程中发现的很多问题我直接写在提交日志里面了。 >当前的状态是有2个线程,3000左右URL,速度比较慢。以前使用SQLite的时候速度更慢,不过URL在2万以上了。 >另外,希望各位高人也可以看看,我在改用BerkeleyDB之后,在使用threading.Lock()这个锁的时候,时间长了会出毛病,并不是抛出异常,而是Python解释器直接中止。 > >-- >从前有一只很冷的毛毛虫,他想获得一点温暖。而获得温暖的机会只有从树上掉下来,落进别人的领口。 >片刻的温暖,之后便失去生命。而很多同类却连这片刻的温暖都没有得到就.. >我会得到温暖么?小心翼翼的尝试,却还是会受到伤害。 >我愿为那一刻的温暖去拼,可是谁愿意接受? > >欢迎访问偶的博客: >http://blog.csdn.net/gashero >_______________________________________________ >python-chinese >Post: send python-chinese在lists.python.cn >Subscribe: send subscribe to python-chinese-request在lists.python.cn >Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn >Detail Info: http://python.cn/mailman/listinfo/python-chinese
2007年03月13日 星期二 17:10
ÎÒÒ²ºÜÓÐÐËȤѧϰ£¬²»ÖªµÀÄܲ»ÄÜ·¢ËÍÒ»·ÝÔ´´úÂëµ½ÎÒµÄÓÊÏ䣬лл¡£ On 3/13/07, Âܲ· <ebbstar在126.com> wrote: > > ºÜÓÐÐËȤѧϰһÏ£¬ÎÒÏÖÔÚÂþÎÞĶĿµÄµÄѧϰ¸Ð¾õÊÕ»ñ²»´ó£¬ÓÖÏë²»³öÀ´Åªµãʲô¡£ > ¾ÍÊÇÊDz»ÖªµÀÔõô¿´ÄãµÄÓʼþÔʼµØÖ·£¬Ã»·¨·¢Óʼþ:( > ÎÒµÄÓʼþ zbbstar在gmail.com > ¿´µ½Á˸øÎÒÀ´Ò»·Ý£¬¸Ðл > > >ż×î½üÃæÊÔdouban.comʱµÄ³õÊÔÊÔÌ⣬ÎÒ»ØÀ´¾Í¸øÊµÏÖÁËһϡ£ÃæÏòdangdang.comÍøÕ¾µÄÌØ¶¨ÍøÕ¾ÅÀ³æ > ¡£¿ªÊ¼ÊÇʹÓÃpysqlite2Á¬½ÓSQLite×öÊý¾Ý¿âµÄ£¬ºóÀ´²¢·¢·ÃÎÊÎÊÌâ¸ã²»¶¨£¬¾Í¸ÄÓÃBerkeleyDBÁË£¬¾ÍÊÇdbhashÄ£¿é¡£Ê¹ÓÃBerkeleyDBµÄÊý¾Ý¿âÄ£ÐͲ¿·ÖÊÇÔÚµØÏÂÊÒÀïÃæÐ´³öÀ´µÄ£¬²»Òª¹ÖÎÒ£¬ºÇºÇ¡£ > >ÄÄλÅóÓÑÓÐÐËȤ¿ÉÒÔ·¢Óʼþ¸øÎÒ£¬ÎÒ»á»Ø¸´ÕâÌõÅÀ³æµÄSVN°æ±¾¿âѹËõ°ü¡£°æ±¾ÊÇ1.3.2µÄsvn£¬ > ·Ç³£²»ÍƼöÖ»¿´Ô´Â룬ÒòΪµ÷ÊÔ¹ý³ÌÖз¢ÏֵĺܶàÎÊÌâÎÒÖ±½ÓдÔÚÌá½»ÈÕÖ¾ÀïÃæÁË¡£ > >µ±Ç°µÄ״̬ÊÇÓÐ2¸öỊ̈߳¬3000×óÓÒURL£¬ËٶȱȽÏÂý¡£ÒÔǰʹÓÃSQLiteµÄʱºòËٶȸüÂý£¬²»¹ýURLÔÚ2ÍòÒÔÉÏÁË¡£ > >ÁíÍ⣬ϣÍû¸÷λ¸ßÈËÒ²¿ÉÒÔ¿´¿´£¬ÎÒÔÚ¸ÄÓÃBerkeleyDBÖ®ºó£¬ÔÚʹÓÃthreading.Lock > ()Õâ¸öËøµÄʱºò£¬Ê±¼ä³¤ÁË»á³ö벡£¬²¢²»ÊÇÅ׳öÒì³££¬¶øÊÇPython½âÊÍÆ÷Ö±½ÓÖÐÖ¹¡£ > > > >-- > >´ÓǰÓÐÒ»Ö»ºÜÀäµÄëë³æ£¬ËûÏë»ñµÃÒ»µãÎÂů¡£¶ø»ñµÃÎÂůµÄ»ú»áÖ»ÓдÓÊ÷ÉϵôÏÂÀ´£¬Âä½ø±ðÈ˵ÄÁì¿Ú¡£ > >Ƭ¿ÌµÄÎÂů£¬Ö®ºó±ãʧȥÉúÃü¡£¶øºÜ¶àͬÀàÈ´Á¬ÕâÆ¬¿ÌµÄÎÂů¶¼Ã»Óеõ½¾Í.. > >ÎÒ»áµÃµ½ÎÂůô£¿Ð¡ÐÄÒíÒíµÄ³¢ÊÔ£¬È´»¹ÊÇ»áÊܵ½É˺¦¡£ > >ÎÒԸΪÄÇÒ»¿ÌµÄÎÂůȥƴ£¬¿ÉÊÇËÔ¸Òâ½ÓÊÜ? > > > >»¶Ó·ÃÎÊżµÄ²©¿Í£º > >http://blog.csdn.net/gashero > >_______________________________________________ > >python-chinese > >Post: send python-chinese在lists.python.cn > >Subscribe: send subscribe to python-chinese-request在lists.python.cn > >Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > >Detail Info: http://python.cn/mailman/listinfo/python-chinese > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070313/e5e6b36e/attachment.htm
2007年03月13日 星期二 17:25
也给我来一份吧,谢谢了 -- --~--~---------~--~----~------------~-------~--~----~ Best Regards JesseZhao(ZhaoGuang) Blog : Http://JesseZhao.cnblogs.com E-Mail : Prolibertine在gmail.com IM(Live Messager) : Prolibertine在gmail.com --~--~---------~--~----~------------~-------~--~----~
2007年03月13日 星期二 17:27
¸øÎÒÒ²·¢Ò»·ÝÔ´Âë лл ----- Original Message ----- From: Xell Zhang To: python-chinese在lists.python.cn Sent: Tuesday, March 13, 2007 5:10 PM Subject: Re: [python-chinese] µØÏÂÊÒÀïµÄÅÀ³æ ÎÒÒ²ºÜÓÐÐËȤѧϰ£¬²»ÖªµÀÄܲ»ÄÜ·¢ËÍÒ»·ÝÔ´´úÂëµ½ÎÒµÄÓÊÏ䣬лл¡£ On 3/13/07, Âܲ· <ebbstar在126.com> wrote: ºÜÓÐÐËȤѧϰһÏ£¬ÎÒÏÖÔÚÂþÎÞĶĿµÄµÄѧϰ¸Ð¾õÊÕ»ñ²»´ó£¬ÓÖÏë²»³öÀ´Åªµãʲô¡£ ¾ÍÊÇÊDz»ÖªµÀÔõô¿´ÄãµÄÓʼþÔʼµØÖ·£¬Ã»·¨·¢Óʼþ:( ÎÒµÄÓʼþ zbbstar在gmail.com ¿´µ½Á˸øÎÒÀ´Ò»·Ý£¬¸Ðл >ż×î½üÃæÊÔdouban.comʱµÄ³õÊÔÊÔÌ⣬ÎÒ»ØÀ´¾Í¸øÊµÏÖÁËһϡ£ÃæÏòdangdang.comÍøÕ¾µÄÌØ¶¨ÍøÕ¾ÅÀ³æ¡£¿ªÊ¼ÊÇʹÓÃpysqlite2Á¬½ÓSQLite×öÊý¾Ý¿âµÄ£¬ºóÀ´²¢·¢·ÃÎÊÎÊÌâ¸ã²»¶¨£¬¾Í¸ÄÓÃBerkeleyDBÁË£¬¾ÍÊÇdbhashÄ£¿é¡£Ê¹ÓÃBerkeleyDBµÄÊý¾Ý¿âÄ£ÐͲ¿·ÖÊÇÔÚµØÏÂÊÒÀïÃæÐ´³öÀ´µÄ£¬²»Òª¹ÖÎÒ£¬ºÇºÇ¡£ >ÄÄλÅóÓÑÓÐÐËȤ¿ÉÒÔ·¢Óʼþ¸øÎÒ£¬ÎÒ»á»Ø¸´ÕâÌõÅÀ³æµÄSVN°æ±¾¿âѹËõ°ü¡£°æ±¾ÊÇ1.3.2µÄsvn£¬·Ç³£²»ÍƼöÖ»¿´Ô´Â룬ÒòΪµ÷ÊÔ¹ý³ÌÖз¢ÏֵĺܶàÎÊÌâÎÒÖ±½ÓдÔÚÌá½»ÈÕÖ¾ÀïÃæÁË¡£ >µ±Ç°µÄ״̬ÊÇÓÐ2¸öỊ̈߳¬3000×óÓÒURL£¬ËٶȱȽÏÂý¡£ÒÔǰʹÓÃSQLiteµÄʱºòËٶȸüÂý£¬²»¹ýURLÔÚ2ÍòÒÔÉÏÁË¡£ >ÁíÍ⣬ϣÍû¸÷λ¸ßÈËÒ²¿ÉÒÔ¿´¿´£¬ÎÒÔÚ¸ÄÓÃBerkeleyDBÖ®ºó£¬ÔÚʹÓÃthreading.Lock()Õâ¸öËøµÄʱºò£¬Ê±¼ä³¤ÁË»á³ö벡£¬²¢²»ÊÇÅ׳öÒì³££¬¶øÊÇPython½âÊÍÆ÷Ö±½ÓÖÐÖ¹¡£ > >-- >´ÓǰÓÐÒ»Ö»ºÜÀäµÄëë³æ£¬ËûÏë»ñµÃÒ»µãÎÂů¡£¶ø»ñµÃÎÂůµÄ»ú»áÖ»ÓдÓÊ÷ÉϵôÏÂÀ´£¬Âä½ø±ðÈ˵ÄÁì¿Ú¡£ >Ƭ¿ÌµÄÎÂů£¬Ö®ºó±ãʧȥÉúÃü¡£¶øºÜ¶àͬÀàÈ´Á¬ÕâÆ¬¿ÌµÄÎÂů¶¼Ã»Óеõ½¾Í.. >ÎÒ»áµÃµ½ÎÂůô£¿Ð¡ÐÄÒíÒíµÄ³¢ÊÔ£¬È´»¹ÊÇ»áÊܵ½É˺¦¡£ >ÎÒԸΪÄÇÒ»¿ÌµÄÎÂůȥƴ£¬¿ÉÊÇËÔ¸Òâ½ÓÊÜ? > >»¶Ó·ÃÎÊżµÄ²©¿Í£º > http://blog.csdn.net/gashero >_______________________________________________ >python-chinese >Post: send python-chinese在lists.python.cn >Subscribe: send subscribe to python-chinese-request在lists.python.cn >Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn >Detail Info: http://python.cn/mailman/listinfo/python-chinese _______________________________________________ python-chinese Post: send python-chinese在lists.python.cn Subscribe: send subscribe to python-chinese-request在lists.python.cn Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn Detail Info: http://python.cn/mailman/listinfo/python-chinese ------------------------------------------------------------------------------ _______________________________________________ python-chinese Post: send python-chinese在lists.python.cn Subscribe: send subscribe to python-chinese-request在lists.python.cn Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn Detail Info: http://python.cn/mailman/listinfo/python-chinese -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070313/8bc76e14/attachment.html
2007年03月13日 星期二 22:22
¸øÎÒÒ²·¢Ò»·Ý°É£¬Ñ§Ï°Ò»Ï£¬Thanks. -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070313/59bb5b59/attachment.htm
2007年03月13日 星期二 22:29
ÓÐÐËȤ,ллÌṩһ·Ý! ÔÚ07-3-13£¬Gu Yingbo <tensiongyb在gmail.com> дµÀ£º > > ¸øÎÒÒ²·¢Ò»·Ý°É£¬Ñ§Ï°Ò»Ï£¬Thanks. > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -- ºì¿ÍÍøÂçhttp://www.allhonker.com ¹úÄÚÊ×¼ÒÓ¦ÓÃwiki¼¼ÊõµÄºÚ¿ÍÀàÕ¾µã! -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070313/6e6f1b99/attachment.html
2007年03月13日 星期二 23:03
去年写的Quick & Dirty 蜘蛛程序,抓指定网站的,
现在都忘光了,看能不能给大家参考一下。
# -*- coding: utf-8 -*-
from twisted.python import threadable
threadable.init()
from twisted.internet import reactor, threads
import urllib2
import urllib
import urlparse
import time
from sgmllib import SGMLParser
from usearch import USearch # 此部分负责数据库操作,无法公布源码
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
class Filter:
def __init__(self, Host, denys=None, allows=None):
self.deny_words = denys
self.allow_words = allows
# Check url is valid or not.
def verify(self, url):
for k in self.deny_words:
if url.find(k) != -1:
return False
for k in self.allow_words:
if url.find(k) !=-1:
return True
return True
class Host:
def __init__(self, hostname, entry_url=None, description=None,
encoding=None, charset=None):
self.hostname = hostname
self.entry_url = entry_url
self.encoding = encoding
self.charset = charset
self.description = description
def configxml(self):
import elementtree.ElementTree as ET
root = ET.Element("config")
en = ET.SubElement(root, "encoding")
en.text = self.encoding
ch = ET.SubElement(root, "charset")
ch.text = self.charset
entry = ET.SubElement(root, "entry_url")
entry.text = self.entry_url
return ET.tostring(root)
def parse_config(self, configstring):
import elementtree.ElementTree as ET
from StringIO import StringIO
tree = ET.parse(StringIO(configstring))
self.encoding = tree.findtext(".//encoding")
self.charset = tree.findtext(".//charset")
self.entry_url = tree.findtext(".//entry_url")
def create(self):
u = USearch()
self.configs = self.configxml()
ret = u.CreateDomain(self.hostname,self.description, self.configs)
#print ret
def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 ==
visiting, 2 = visited.
# TODO: load domain data from backend database.
u = USearch()
try:
ret = u.ListDomain(flag)['result']
for d in ret:
if d.domain == self.hostname:
self.parse_config(d.parse_config)
self.description = d.description
return True
except:
pass
return False
class Page:
def __init__(self, url, host, description=None):
self.url = url
self.description = description
self.host = host
self.page_request = None
self.content = None
self.status_code = None
self.encoding = None
self.charset = None
self.length = 0
self.md5 = None
self.urls = []
# Read web page.
def get_page(self, url=None):
if not url: url = self.url
type = get_type(self.host.hostname,url)
if type != 0: return None
try:
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
self.page_request = opener.open(urllib.unquote(url))
#self.page_request = urllib2.urlopen(url)
self.content = self.page_request.read()
self.status_code = self.page_request.code
return self.status_code
except:
self.stats_code = 500
print "ERROR READING: %s" % self.url
return None
def get_header(self):
if not self.page_request:
self.get_page()
header = self.page_request.info()
try:
self.length = header['Content-Length']
content_type = header['Content-Type']
#if content_type.find('charset') == -1:
self.charset = self.host.charset
self.encoding = self.host.encoding
except:
pass
def get_urls(self):
if not self.page_request:
self.get_page()
if self.status_code != 200:
return
parser = URLLister()
try:
parser.feed(self.content)
except:
print "ERROR: Parse urls error!"
return
#print "URLS: ", parser.urls
#self.urls = parser.urls
if not self.charset: self.charset = "gbk"
for i in parser.urls:
try:
type = get_type(self.host.hostname,i)
if type == 4:
i = join_url(self.host.hostname, self.url, i)
if type == 0 or type ==4:
if i:
i = urllib.quote(i)
self.urls.append(i.decode(self.charset).encode('utf-8'))
except:
pass
parser.close()
self.page_request.close()
def save_header(self):
# Save header info into db.
pass
def save_current_url(self):
save_url = urllib.quote(self.url)
usearch = USearch()
usearch.CreateUrl( domain=self.host.hostname, url=save_url,
length=self.length, status_code=self.status_code)
# Set URL's flag
def flag_url(self, flag):
usearch = USearch()
usearch.UpdateUrl(status=flag)
def save_urls(self):
# Save all the founded urls into db
print "RELEATED_URLS:", len(self.urls)
usearch = USearch()
usearch.CreateRelateUrl(urllib.quote(self.url), self.urls)
def save_page(self):
usearch = USearch()
import cgi
try:
content = self.content.decode(self.charset).encode('utf-8')
usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'),
content)
except:
print "ERROR to save page"
return -1
print "SAVE PAGE Done", self.url
return 0
def get_type(domain, url):
if not url: return 5
import urlparse
tup = urlparse.urlparse(url)
if tup[0] == "http":
# check if the same domain
if tup[1] == domain: return 0
else: return 1 # outside link
if tup[0] == "javascript":
return 2
if tup[0] == "ftp":
return 3
if tup[0] == "mailto":
return 5
return 4 # internal link
def join_url(domain, referral, url):
if not url or len(url) ==0: return None
tup = urlparse.urlparse(url)
if not tup: return None
if tup[0] == "javascript" or tup[0] == "ftp": return None
else:
if url[0] == "/": # means root link begins
newurl = "http://%s%s" % ( domain, url)
return newurl
if url[0] == ".": return None # ignore relative link at first.
else:
# if referral.rfind("/") != -1:
# referral = referral[0:referral.rfind("/")+1]
# newurl = "%s%s" % (referral, url)
newurl = urlparse.urljoin(referral, url)
return newurl
if __name__ == '__main__':
def done(x):
u = USearch()
x = urllib.quote(x.decode('gbk').encode('utf-8'))
u.SetUrlStatus(x, '2')
time.sleep(2)
print "DONE: ",x
url = next_url(h)
if not url: reactor.stop()
else:threads.deferToThread(spider, h, url ).addCallback(done)
def next_url(host):
u = USearch()
ret = u.GetTaskUrls(host.hostname,'0',1)['result']
try:
url = urllib.unquote(ret[0].url)
except:
return None
if urlparse.urlparse(url)[1] != host.hostname: next_url(host)
return urllib.unquote(ret[0].url)
def spider(host, surf_url):
#surf_url = surf_url.decode(host.charset).encode('utf-8')
surf_url = urllib.unquote(surf_url)
p = Page(surf_url, host)
#try:
if not p.get_page():
print "ERROR: GET %s error!" % surf_url
return surf_url # Something Wrong!
p.get_header() # Get page's header
p.get_urls() # Get all the urls in page
#print p.urls
p.save_current_url() # Save current page's url info into DB
p.save_urls()
p.save_page()
#except:
# pass
return surf_url
import sys
#host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312")
#host.create()
#~ h = Host("www.chilema.cn")
#~ h.load()
#~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/")
#~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/")
#~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/")
#~ threads.deferToThread(spider, h,
"http://www.chilema.cn/Eat/").addCallback(done)
#host = Host("www.ziye114.com", "", "Beijing Local", "gb2312")
#host.create()
hostname = sys.argv[1]
entry_url = ""
if len(sys.argv) == 3: entry_url = sys.argv[2]
h = Host(hostname)
hostname_url = "http://%s/%s" % (hostname,entry_url)
h.load()
threads.deferToThread(spider, h, hostname_url).addCallback(done)
threads.deferToThread(spider, h, next_url(h)).addCallback(done)
threads.deferToThread(spider, h, next_url(h)).addCallback(done)
threads.deferToThread(spider, h, next_url(h)).addCallback(done)
reactor.run()
------------------------------
Best Regards,
Devin Deng
2007年03月14日 星期三 01:30
我觉得楼主愿意分享自己的代码的精神让人敬佩不已,各位回帖求代码的兄弟的求知精神同样让人欣赏,不过如果各位兄弟能直接把请求信发到楼主的email信箱那就更好了。 On 3/13/07, hackergene <hackergene在gmail.com> wrote: > 有兴趣,谢谢提供一份! > > 在07-3-13,Gu Yingbo <tensiongyb在gmail.com> 写道: > > 给我也发一份吧,学习一下,Thanks. > > > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to > python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to > python-chinese-request在lists.python.cn > > Detail Info: > http://python.cn/mailman/listinfo/python-chinese > > > > > > -- > 红客网络http://www.allhonker.com > 国内首家应用wiki技术的黑客类站点! > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to > python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to > python-chinese-request在lists.python.cn > Detail Info: > http://python.cn/mailman/listinfo/python-chinese >
2007年03月14日 星期三 09:21
关键问题是要找到性能瓶颈,如果是数据库的问题,可以换sqlrelay连接池,换mysql 数据库;果是线程数不够,就增加线程数,开启线程池;如果是python里的罗辑比较慢 可以把关键的地方移出写c模块;如果是网络速度慢....... -----邮件原件----- 发件人: python-chinese-bounces在lists.python.cn [mailto:python-chinese-bounces在lists.python.cn] 代表 gashero 发送时间: 2007年3月13日 14:47 收件人: Python中国用户组 主题: [python-chinese] 地下室里的爬虫 偶最近面试douban.com时的初试试题,我回来就给实现了一下。面向dangdang.com网站 的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的,后来并发访问问题搞 不定,就改用BerkeleyDB了,就是dbhash模块。使用BerkeleyDB的数据库模型部分是在 地下室里面写出来的,不要怪我,呵呵。 哪位朋友有兴趣可以发邮件给我,我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2 的svn,非常不推荐只看源码,因为调试过程中发现的很多问题我直接写在提交日志里 面了。 当前的状态是有2个线程,3000左右URL,速度比较慢。以前使用SQLite的时候速度更 慢,不过URL在2万以上了。 另外,希望各位高人也可以看看,我在改用BerkeleyDB之后,在使用threading.Lock() 这个锁的时候,时间长了会出毛病,并不是抛出异常,而是Python解释器直接中止。 -- 从前有一只很冷的毛毛虫,他想获得一点温暖。而获得温暖的机会只有从树上掉下来, 落进别人的领口。 片刻的温暖,之后便失去生命。而很多同类却连这片刻的温暖都没有得到就.. 我会得到温暖么?小心翼翼的尝试,却还是会受到伤害。 我愿为那一刻的温暖去拼,可是谁愿意接受? 欢迎访问偶的博客: http://blog.csdn.net/gashero _______________________________________________ python-chinese Post: send python-chinese在lists.python.cn Subscribe: send subscribe to python-chinese-request在lists.python.cn Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn Detail Info: http://python.cn/mailman/listinfo/python-chinese
2007年03月14日 星期三 09:50
如果虫子太厉害会被目标网站给挂的吧。
2007年03月14日 星期三 10:47
2007/3/14, vicalloy <zbirder在gmail.com>: > > Èç¹û³æ×ÓÌ«À÷º¦»á±»Ä¿±êÍøÕ¾¸ø¹ÒµÄ°É¡£ ÄÇ»¹ÊÇÂýÂýÅÀ°É£¬ÓûËÙÔò²»´ï¡£¡£¡£¹þ¹þ¡£¡£¡£ -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070314/3f9436a4/attachment.html
2007年03月14日 星期三 11:14
ѧϰһÏ£¬Çë·¢¸øÎÒÒ»·Ý Ö±½Ó»Ø¸´Õâ¸öÓʼþ¡£ лл On 3/14/07, Í·Ì«ÔÎ <torrycn在gmail.com> wrote: > > > > 2007/3/14, vicalloy <zbirder在gmail.com>: > > > > Èç¹û³æ×ÓÌ«À÷º¦»á±»Ä¿±êÍøÕ¾¸ø¹ÒµÄ°É¡£ > > > ÄÇ»¹ÊÇÂýÂýÅÀ°É£¬ÓûËÙÔò²»´ï¡£¡£¡£¹þ¹þ¡£¡£¡£ > > > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -- ------------------------------------------------------------------- python-chinese list PythonÖÐÎļ¼ÊõÌÖÂÛÓʼþÁÐ±í ·¢ÑÔ: ·¢Óʼþµ½ python-chinese在lists.python.cn ¶©ÔÄ: ·¢ËÍ subscribe µ½ python-chinese-request在lists.python.cn Í˶©: ·¢ËÍ unsubscribe µ½ python-chinese-request在lists.python.cn Ïêϸ˵Ã÷: http://python.cn/mailman/listinfo/python-chinese -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070314/6948639e/attachment.htm
2007年03月14日 星期三 11:31
On 3/13/07, Devin Deng <deng.devin在gmail.com> wrote: > 去年写的Quick & Dirty 蜘蛛程序,抓指定网站的, > 现在都忘光了,看能不能给大家参考一下。 收集! http://wiki.woodpecker.org.cn/moin/MicroProj/2007-03-14 > > > # -*- coding: utf-8 -*- > from twisted.python import threadable > threadable.init() > from twisted.internet import reactor, threads > > import urllib2 > import urllib > import urlparse > import time > from sgmllib import SGMLParser > > > from usearch import USearch # 此部分负责数据库操作,无法公布源码 > > class URLLister(SGMLParser): > > def reset(self): > SGMLParser.reset(self) > self.urls = [] > > def start_a(self, attrs): > href = [v for k, v in attrs if k=='href'] > if href: > self.urls.extend(href) > > class Filter: > > def __init__(self, Host, denys=None, allows=None): > self.deny_words = denys > self.allow_words = allows > > # Check url is valid or not. > def verify(self, url): > > for k in self.deny_words: > if url.find(k) != -1: > return False > > for k in self.allow_words: > if url.find(k) !=-1: > return True > > return True > > > > class Host: > > def __init__(self, hostname, entry_url=None, description=None, > encoding=None, charset=None): > self.hostname = hostname > self.entry_url = entry_url > self.encoding = encoding > self.charset = charset > self.description = description > > def configxml(self): > import elementtree.ElementTree as ET > > root = ET.Element("config") > en = ET.SubElement(root, "encoding") > en.text = self.encoding > > ch = ET.SubElement(root, "charset") > ch.text = self.charset > > entry = ET.SubElement(root, "entry_url") > entry.text = self.entry_url > > return ET.tostring(root) > > def parse_config(self, configstring): > import elementtree.ElementTree as ET > from StringIO import StringIO > tree = ET.parse(StringIO(configstring)) > self.encoding = tree.findtext(".//encoding") > self.charset = tree.findtext(".//charset") > self.entry_url = tree.findtext(".//entry_url") > > def create(self): > u = USearch() > self.configs = self.configxml() > > ret = u.CreateDomain(self.hostname,self.description, self.configs) > #print ret > > def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 == > visiting, 2 = visited. > # TODO: load domain data from backend database. > u = USearch() > try: > ret = u.ListDomain(flag)['result'] > for d in ret: > > if d.domain == self.hostname: > self.parse_config(d.parse_config) > self.description = d.description > return True > except: > pass > return False > > > class Page: > > def __init__(self, url, host, description=None): > self.url = url > self.description = description > self.host = host > self.page_request = None > self.content = None > > self.status_code = None > self.encoding = None > self.charset = None > self.length = 0 > self.md5 = None > self.urls = [] > > # Read web page. > def get_page(self, url=None): > if not url: url = self.url > type = get_type(self.host.hostname,url) > if type != 0: return None > try: > opener = urllib2.build_opener() > opener.addheaders = [('User-agent', 'Mozilla/5.0')] > self.page_request = opener.open(urllib.unquote(url)) > #self.page_request = urllib2.urlopen(url) > self.content = self.page_request.read() > self.status_code = self.page_request.code > return self.status_code > except: > self.stats_code = 500 > print "ERROR READING: %s" % self.url > return None > > > def get_header(self): > > if not self.page_request: > self.get_page() > header = self.page_request.info() > try: > self.length = header['Content-Length'] > content_type = header['Content-Type'] > #if content_type.find('charset') == -1: > self.charset = self.host.charset > > self.encoding = self.host.encoding > except: > pass > > > def get_urls(self): > > if not self.page_request: > self.get_page() > > if self.status_code != 200: > return > > parser = URLLister() > > try: > parser.feed(self.content) > except: > print "ERROR: Parse urls error!" > return > > #print "URLS: ", parser.urls > #self.urls = parser.urls > if not self.charset: self.charset = "gbk" > for i in parser.urls: > try: > type = get_type(self.host.hostname,i) > > if type == 4: > i = join_url(self.host.hostname, self.url, i) > if type == 0 or type ==4: > if i: > i = urllib.quote(i) > self.urls.append(i.decode(self.charset).encode('utf-8')) > except: > pass > > parser.close() > self.page_request.close() > > def save_header(self): > # Save header info into db. > pass > > def save_current_url(self): > save_url = urllib.quote(self.url) > usearch = USearch() > usearch.CreateUrl( domain=self.host.hostname, url=save_url, > length=self.length, status_code=self.status_code) > > # Set URL's flag > def flag_url(self, flag): > usearch = USearch() > usearch.UpdateUrl(status=flag) > > def save_urls(self): > # Save all the founded urls into db > print "RELEATED_URLS:", len(self.urls) > usearch = USearch() > usearch.CreateRelateUrl(urllib.quote(self.url), self.urls) > > def save_page(self): > usearch = USearch() > import cgi > > try: > content = self.content.decode(self.charset).encode('utf-8') > usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'), > content) > except: > print "ERROR to save page" > return -1 > print "SAVE PAGE Done", self.url > return 0 > > > > def get_type(domain, url): > if not url: return 5 > import urlparse > tup = urlparse.urlparse(url) > if tup[0] == "http": > # check if the same domain > if tup[1] == domain: return 0 > else: return 1 # outside link > if tup[0] == "javascript": > return 2 > if tup[0] == "ftp": > return 3 > if tup[0] == "mailto": > return 5 > > return 4 # internal link > > def join_url(domain, referral, url): > > if not url or len(url) ==0: return None > tup = urlparse.urlparse(url) > if not tup: return None > > if tup[0] == "javascript" or tup[0] == "ftp": return None > > > else: > if url[0] == "/": # means root link begins > newurl = "http://%s%s" % ( domain, url) > return newurl > if url[0] == ".": return None # ignore relative link at first. > else: > > # if referral.rfind("/") != -1: > # referral = referral[0:referral.rfind("/")+1] > # newurl = "%s%s" % (referral, url) > newurl = urlparse.urljoin(referral, url) > return newurl > > if __name__ == '__main__': > > def done(x): > > u = USearch() > x = urllib.quote(x.decode('gbk').encode('utf-8')) > u.SetUrlStatus(x, '2') > time.sleep(2) > print "DONE: ",x > url = next_url(h) > if not url: reactor.stop() > else:threads.deferToThread(spider, h, url ).addCallback(done) > > > def next_url(host): > u = USearch() > ret = u.GetTaskUrls(host.hostname,'0',1)['result'] > try: > url = urllib.unquote(ret[0].url) > except: > return None > > if urlparse.urlparse(url)[1] != host.hostname: next_url(host) > return urllib.unquote(ret[0].url) > > def spider(host, surf_url): > > #surf_url = surf_url.decode(host.charset).encode('utf-8') > surf_url = urllib.unquote(surf_url) > p = Page(surf_url, host) > #try: > if not p.get_page(): > print "ERROR: GET %s error!" % surf_url > return surf_url # Something Wrong! > p.get_header() # Get page's header > p.get_urls() # Get all the urls in page > #print p.urls > p.save_current_url() # Save current page's url info into DB > p.save_urls() > p.save_page() > #except: > # pass > > return surf_url > > > import sys > #host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312") > #host.create() > > #~ h = Host("www.chilema.cn") > #~ h.load() > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/") > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/") > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/") > > #~ threads.deferToThread(spider, h, > "http://www.chilema.cn/Eat/").addCallback(done) > > #host = Host("www.ziye114.com", "", "Beijing Local", "gb2312") > #host.create() > > hostname = sys.argv[1] > entry_url = "" > if len(sys.argv) == 3: entry_url = sys.argv[2] > > h = Host(hostname) > hostname_url = "http://%s/%s" % (hostname,entry_url) > h.load() > threads.deferToThread(spider, h, hostname_url).addCallback(done) > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > reactor.run() > > ------------------------------ > > Best Regards, > > Devin Deng > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese -- '''Time is unimportant, only life important! http://zoomquiet.org blog在http://blog.zoomquiet.org/pyblosxom/ wiki在http://wiki.woodpecker.org.cn/moin/ZoomQuiet scrap在http://floss.zoomquiet.org douban在http://www.douban.com/people/zoomq/ ____________________________________ Pls. use OpenOffice.org to replace M$ Office. http://zh.openoffice.org Pls. use 7-zip to replace WinRAR/WinZip. http://7-zip.org/zh-cn/ You can get the truely Freedom 4 software. '''
2007年03月14日 星期三 23:15
瓶颈多半是CPU上 或着网络 在 07-3-14,Zoom. Quiet<zoom.quiet在gmail.com> 写道: > On 3/13/07, Devin Deng <deng.devin在gmail.com> wrote: > > 去年写的Quick & Dirty 蜘蛛程序,抓指定网站的, > > 现在都忘光了,看能不能给大家参考一下。 > 收集! > http://wiki.woodpecker.org.cn/moin/MicroProj/2007-03-14 > > > > > > > # -*- coding: utf-8 -*- > > from twisted.python import threadable > > threadable.init() > > from twisted.internet import reactor, threads > > > > import urllib2 > > import urllib > > import urlparse > > import time > > from sgmllib import SGMLParser > > > > > > from usearch import USearch # 此部分负责数据库操作,无法公布源码 > > > > class URLLister(SGMLParser): > > > > def reset(self): > > SGMLParser.reset(self) > > self.urls = [] > > > > def start_a(self, attrs): > > href = [v for k, v in attrs if k=='href'] > > if href: > > self.urls.extend(href) > > > > class Filter: > > > > def __init__(self, Host, denys=None, allows=None): > > self.deny_words = denys > > self.allow_words = allows > > > > # Check url is valid or not. > > def verify(self, url): > > > > for k in self.deny_words: > > if url.find(k) != -1: > > return False > > > > for k in self.allow_words: > > if url.find(k) !=-1: > > return True > > > > return True > > > > > > > > class Host: > > > > def __init__(self, hostname, entry_url=None, description=None, > > encoding=None, charset=None): > > self.hostname = hostname > > self.entry_url = entry_url > > self.encoding = encoding > > self.charset = charset > > self.description = description > > > > def configxml(self): > > import elementtree.ElementTree as ET > > > > root = ET.Element("config") > > en = ET.SubElement(root, "encoding") > > en.text = self.encoding > > > > ch = ET.SubElement(root, "charset") > > ch.text = self.charset > > > > entry = ET.SubElement(root, "entry_url") > > entry.text = self.entry_url > > > > return ET.tostring(root) > > > > def parse_config(self, configstring): > > import elementtree.ElementTree as ET > > from StringIO import StringIO > > tree = ET.parse(StringIO(configstring)) > > self.encoding = tree.findtext(".//encoding") > > self.charset = tree.findtext(".//charset") > > self.entry_url = tree.findtext(".//entry_url") > > > > def create(self): > > u = USearch() > > self.configs = self.configxml() > > > > ret = u.CreateDomain(self.hostname,self.description, self.configs) > > #print ret > > > > def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 == > > visiting, 2 = visited. > > # TODO: load domain data from backend database. > > u = USearch() > > try: > > ret = u.ListDomain(flag)['result'] > > for d in ret: > > > > if d.domain == self.hostname: > > self.parse_config(d.parse_config) > > self.description = d.description > > return True > > except: > > pass > > return False > > > > > > class Page: > > > > def __init__(self, url, host, description=None): > > self.url = url > > self.description = description > > self.host = host > > self.page_request = None > > self.content = None > > > > self.status_code = None > > self.encoding = None > > self.charset = None > > self.length = 0 > > self.md5 = None > > self.urls = [] > > > > # Read web page. > > def get_page(self, url=None): > > if not url: url = self.url > > type = get_type(self.host.hostname,url) > > if type != 0: return None > > try: > > opener = urllib2.build_opener() > > opener.addheaders = [('User-agent', 'Mozilla/5.0')] > > self.page_request = opener.open(urllib.unquote(url)) > > #self.page_request = urllib2.urlopen(url) > > self.content = self.page_request.read() > > self.status_code = self.page_request.code > > return self.status_code > > except: > > self.stats_code = 500 > > print "ERROR READING: %s" % self.url > > return None > > > > > > def get_header(self): > > > > if not self.page_request: > > self.get_page() > > header = self.page_request.info() > > try: > > self.length = header['Content-Length'] > > content_type = header['Content-Type'] > > #if content_type.find('charset') == -1: > > self.charset = self.host.charset > > > > self.encoding = self.host.encoding > > except: > > pass > > > > > > def get_urls(self): > > > > if not self.page_request: > > self.get_page() > > > > if self.status_code != 200: > > return > > > > parser = URLLister() > > > > try: > > parser.feed(self.content) > > except: > > print "ERROR: Parse urls error!" > > return > > > > #print "URLS: ", parser.urls > > #self.urls = parser.urls > > if not self.charset: self.charset = "gbk" > > for i in parser.urls: > > try: > > type = get_type(self.host.hostname,i) > > > > if type == 4: > > i = join_url(self.host.hostname, self.url, i) > > if type == 0 or type ==4: > > if i: > > i = urllib.quote(i) > > self.urls.append(i.decode(self.charset).encode('utf-8')) > > except: > > pass > > > > parser.close() > > self.page_request.close() > > > > def save_header(self): > > # Save header info into db. > > pass > > > > def save_current_url(self): > > save_url = urllib.quote(self.url) > > usearch = USearch() > > usearch.CreateUrl( domain=self.host.hostname, url=save_url, > > length=self.length, status_code=self.status_code) > > > > # Set URL's flag > > def flag_url(self, flag): > > usearch = USearch() > > usearch.UpdateUrl(status=flag) > > > > def save_urls(self): > > # Save all the founded urls into db > > print "RELEATED_URLS:", len(self.urls) > > usearch = USearch() > > usearch.CreateRelateUrl(urllib.quote(self.url), self.urls) > > > > def save_page(self): > > usearch = USearch() > > import cgi > > > > try: > > content = self.content.decode(self.charset).encode('utf-8') > > usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'), > > content) > > except: > > print "ERROR to save page" > > return -1 > > print "SAVE PAGE Done", self.url > > return 0 > > > > > > > > def get_type(domain, url): > > if not url: return 5 > > import urlparse > > tup = urlparse.urlparse(url) > > if tup[0] == "http": > > # check if the same domain > > if tup[1] == domain: return 0 > > else: return 1 # outside link > > if tup[0] == "javascript": > > return 2 > > if tup[0] == "ftp": > > return 3 > > if tup[0] == "mailto": > > return 5 > > > > return 4 # internal link > > > > def join_url(domain, referral, url): > > > > if not url or len(url) ==0: return None > > tup = urlparse.urlparse(url) > > if not tup: return None > > > > if tup[0] == "javascript" or tup[0] == "ftp": return None > > > > > > else: > > if url[0] == "/": # means root link begins > > newurl = "http://%s%s" % ( domain, url) > > return newurl > > if url[0] == ".": return None # ignore relative link at first. > > else: > > > > # if referral.rfind("/") != -1: > > # referral = referral[0:referral.rfind("/")+1] > > # newurl = "%s%s" % (referral, url) > > newurl = urlparse.urljoin(referral, url) > > return newurl > > > > if __name__ == '__main__': > > > > def done(x): > > > > u = USearch() > > x = urllib.quote(x.decode('gbk').encode('utf-8')) > > u.SetUrlStatus(x, '2') > > time.sleep(2) > > print "DONE: ",x > > url = next_url(h) > > if not url: reactor.stop() > > else:threads.deferToThread(spider, h, url ).addCallback(done) > > > > > > def next_url(host): > > u = USearch() > > ret = u.GetTaskUrls(host.hostname,'0',1)['result'] > > try: > > url = urllib.unquote(ret[0].url) > > except: > > return None > > > > if urlparse.urlparse(url)[1] != host.hostname: next_url(host) > > return urllib.unquote(ret[0].url) > > > > def spider(host, surf_url): > > > > #surf_url = surf_url.decode(host.charset).encode('utf-8') > > surf_url = urllib.unquote(surf_url) > > p = Page(surf_url, host) > > #try: > > if not p.get_page(): > > print "ERROR: GET %s error!" % surf_url > > return surf_url # Something Wrong! > > p.get_header() # Get page's header > > p.get_urls() # Get all the urls in page > > #print p.urls > > p.save_current_url() # Save current page's url info into DB > > p.save_urls() > > p.save_page() > > #except: > > # pass > > > > return surf_url > > > > > > import sys > > #host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312") > > #host.create() > > > > #~ h = Host("www.chilema.cn") > > #~ h.load() > > > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/") > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/") > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/") > > > > #~ threads.deferToThread(spider, h, > > "http://www.chilema.cn/Eat/").addCallback(done) > > > > #host = Host("www.ziye114.com", "", "Beijing Local", "gb2312") > > #host.create() > > > > hostname = sys.argv[1] > > entry_url = "" > > if len(sys.argv) == 3: entry_url = sys.argv[2] > > > > h = Host(hostname) > > hostname_url = "http://%s/%s" % (hostname,entry_url) > > h.load() > > threads.deferToThread(spider, h, hostname_url).addCallback(done) > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > reactor.run() > > > > ------------------------------ > > > > Best Regards, > > > > Devin Deng > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > -- > '''Time is unimportant, only life important! > http://zoomquiet.org > blog在http://blog.zoomquiet.org/pyblosxom/ > wiki在http://wiki.woodpecker.org.cn/moin/ZoomQuiet > scrap在http://floss.zoomquiet.org > douban在http://www.douban.com/people/zoomq/ > ____________________________________ > Pls. use OpenOffice.org to replace M$ Office. > http://zh.openoffice.org > Pls. use 7-zip to replace WinRAR/WinZip. > http://7-zip.org/zh-cn/ > You can get the truely Freedom 4 software. > ''' > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese
2007年03月14日 星期三 23:15
说错了 是数据库 索引什么的要做好。。 在 07-3-14,散漫酷男孩<17mxnet在gmail.com> 写道: > 瓶颈多半是CPU上 或着网络 > > 在 07-3-14,Zoom. Quiet<zoom.quiet在gmail.com> 写道: > > On 3/13/07, Devin Deng <deng.devin在gmail.com> wrote: > > > 去年写的Quick & Dirty 蜘蛛程序,抓指定网站的, > > > 现在都忘光了,看能不能给大家参考一下。 > > 收集! > > http://wiki.woodpecker.org.cn/moin/MicroProj/2007-03-14 > > > > > > > > > > > # -*- coding: utf-8 -*- > > > from twisted.python import threadable > > > threadable.init() > > > from twisted.internet import reactor, threads > > > > > > import urllib2 > > > import urllib > > > import urlparse > > > import time > > > from sgmllib import SGMLParser > > > > > > > > > from usearch import USearch # 此部分负责数据库操作,无法公布源码 > > > > > > class URLLister(SGMLParser): > > > > > > def reset(self): > > > SGMLParser.reset(self) > > > self.urls = [] > > > > > > def start_a(self, attrs): > > > href = [v for k, v in attrs if k=='href'] > > > if href: > > > self.urls.extend(href) > > > > > > class Filter: > > > > > > def __init__(self, Host, denys=None, allows=None): > > > self.deny_words = denys > > > self.allow_words = allows > > > > > > # Check url is valid or not. > > > def verify(self, url): > > > > > > for k in self.deny_words: > > > if url.find(k) != -1: > > > return False > > > > > > for k in self.allow_words: > > > if url.find(k) !=-1: > > > return True > > > > > > return True > > > > > > > > > > > > class Host: > > > > > > def __init__(self, hostname, entry_url=None, description=None, > > > encoding=None, charset=None): > > > self.hostname = hostname > > > self.entry_url = entry_url > > > self.encoding = encoding > > > self.charset = charset > > > self.description = description > > > > > > def configxml(self): > > > import elementtree.ElementTree as ET > > > > > > root = ET.Element("config") > > > en = ET.SubElement(root, "encoding") > > > en.text = self.encoding > > > > > > ch = ET.SubElement(root, "charset") > > > ch.text = self.charset > > > > > > entry = ET.SubElement(root, "entry_url") > > > entry.text = self.entry_url > > > > > > return ET.tostring(root) > > > > > > def parse_config(self, configstring): > > > import elementtree.ElementTree as ET > > > from StringIO import StringIO > > > tree = ET.parse(StringIO(configstring)) > > > self.encoding = tree.findtext(".//encoding") > > > self.charset = tree.findtext(".//charset") > > > self.entry_url = tree.findtext(".//entry_url") > > > > > > def create(self): > > > u = USearch() > > > self.configs = self.configxml() > > > > > > ret = u.CreateDomain(self.hostname,self.description, self.configs) > > > #print ret > > > > > > def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 == > > > visiting, 2 = visited. > > > # TODO: load domain data from backend database. > > > u = USearch() > > > try: > > > ret = u.ListDomain(flag)['result'] > > > for d in ret: > > > > > > if d.domain == self.hostname: > > > self.parse_config(d.parse_config) > > > self.description = d.description > > > return True > > > except: > > > pass > > > return False > > > > > > > > > class Page: > > > > > > def __init__(self, url, host, description=None): > > > self.url = url > > > self.description = description > > > self.host = host > > > self.page_request = None > > > self.content = None > > > > > > self.status_code = None > > > self.encoding = None > > > self.charset = None > > > self.length = 0 > > > self.md5 = None > > > self.urls = [] > > > > > > # Read web page. > > > def get_page(self, url=None): > > > if not url: url = self.url > > > type = get_type(self.host.hostname,url) > > > if type != 0: return None > > > try: > > > opener = urllib2.build_opener() > > > opener.addheaders = [('User-agent', 'Mozilla/5.0')] > > > self.page_request = opener.open(urllib.unquote(url)) > > > #self.page_request = urllib2.urlopen(url) > > > self.content = self.page_request.read() > > > self.status_code = self.page_request.code > > > return self.status_code > > > except: > > > self.stats_code = 500 > > > print "ERROR READING: %s" % self.url > > > return None > > > > > > > > > def get_header(self): > > > > > > if not self.page_request: > > > self.get_page() > > > header = self.page_request.info() > > > try: > > > self.length = header['Content-Length'] > > > content_type = header['Content-Type'] > > > #if content_type.find('charset') == -1: > > > self.charset = self.host.charset > > > > > > self.encoding = self.host.encoding > > > except: > > > pass > > > > > > > > > def get_urls(self): > > > > > > if not self.page_request: > > > self.get_page() > > > > > > if self.status_code != 200: > > > return > > > > > > parser = URLLister() > > > > > > try: > > > parser.feed(self.content) > > > except: > > > print "ERROR: Parse urls error!" > > > return > > > > > > #print "URLS: ", parser.urls > > > #self.urls = parser.urls > > > if not self.charset: self.charset = "gbk" > > > for i in parser.urls: > > > try: > > > type = get_type(self.host.hostname,i) > > > > > > if type == 4: > > > i = join_url(self.host.hostname, self.url, i) > > > if type == 0 or type ==4: > > > if i: > > > i = urllib.quote(i) > > > self.urls.append(i.decode(self.charset).encode('utf-8')) > > > except: > > > pass > > > > > > parser.close() > > > self.page_request.close() > > > > > > def save_header(self): > > > # Save header info into db. > > > pass > > > > > > def save_current_url(self): > > > save_url = urllib.quote(self.url) > > > usearch = USearch() > > > usearch.CreateUrl( domain=self.host.hostname, url=save_url, > > > length=self.length, status_code=self.status_code) > > > > > > # Set URL's flag > > > def flag_url(self, flag): > > > usearch = USearch() > > > usearch.UpdateUrl(status=flag) > > > > > > def save_urls(self): > > > # Save all the founded urls into db > > > print "RELEATED_URLS:", len(self.urls) > > > usearch = USearch() > > > usearch.CreateRelateUrl(urllib.quote(self.url), self.urls) > > > > > > def save_page(self): > > > usearch = USearch() > > > import cgi > > > > > > try: > > > content = self.content.decode(self.charset).encode('utf-8') > > > usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'), > > > content) > > > except: > > > print "ERROR to save page" > > > return -1 > > > print "SAVE PAGE Done", self.url > > > return 0 > > > > > > > > > > > > def get_type(domain, url): > > > if not url: return 5 > > > import urlparse > > > tup = urlparse.urlparse(url) > > > if tup[0] == "http": > > > # check if the same domain > > > if tup[1] == domain: return 0 > > > else: return 1 # outside link > > > if tup[0] == "javascript": > > > return 2 > > > if tup[0] == "ftp": > > > return 3 > > > if tup[0] == "mailto": > > > return 5 > > > > > > return 4 # internal link > > > > > > def join_url(domain, referral, url): > > > > > > if not url or len(url) ==0: return None > > > tup = urlparse.urlparse(url) > > > if not tup: return None > > > > > > if tup[0] == "javascript" or tup[0] == "ftp": return None > > > > > > > > > else: > > > if url[0] == "/": # means root link begins > > > newurl = "http://%s%s" % ( domain, url) > > > return newurl > > > if url[0] == ".": return None # ignore relative link at first. > > > else: > > > > > > # if referral.rfind("/") != -1: > > > # referral = referral[0:referral.rfind("/")+1] > > > # newurl = "%s%s" % (referral, url) > > > newurl = urlparse.urljoin(referral, url) > > > return newurl > > > > > > if __name__ == '__main__': > > > > > > def done(x): > > > > > > u = USearch() > > > x = urllib.quote(x.decode('gbk').encode('utf-8')) > > > u.SetUrlStatus(x, '2') > > > time.sleep(2) > > > print "DONE: ",x > > > url = next_url(h) > > > if not url: reactor.stop() > > > else:threads.deferToThread(spider, h, url ).addCallback(done) > > > > > > > > > def next_url(host): > > > u = USearch() > > > ret = u.GetTaskUrls(host.hostname,'0',1)['result'] > > > try: > > > url = urllib.unquote(ret[0].url) > > > except: > > > return None > > > > > > if urlparse.urlparse(url)[1] != host.hostname: next_url(host) > > > return urllib.unquote(ret[0].url) > > > > > > def spider(host, surf_url): > > > > > > #surf_url = surf_url.decode(host.charset).encode('utf-8') > > > surf_url = urllib.unquote(surf_url) > > > p = Page(surf_url, host) > > > #try: > > > if not p.get_page(): > > > print "ERROR: GET %s error!" % surf_url > > > return surf_url # Something Wrong! > > > p.get_header() # Get page's header > > > p.get_urls() # Get all the urls in page > > > #print p.urls > > > p.save_current_url() # Save current page's url info into DB > > > p.save_urls() > > > p.save_page() > > > #except: > > > # pass > > > > > > return surf_url > > > > > > > > > import sys > > > #host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312") > > > #host.create() > > > > > > #~ h = Host("www.chilema.cn") > > > #~ h.load() > > > > > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/") > > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/") > > > #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/") > > > > > > #~ threads.deferToThread(spider, h, > > > "http://www.chilema.cn/Eat/").addCallback(done) > > > > > > #host = Host("www.ziye114.com", "", "Beijing Local", "gb2312") > > > #host.create() > > > > > > hostname = sys.argv[1] > > > entry_url = "" > > > if len(sys.argv) == 3: entry_url = sys.argv[2] > > > > > > h = Host(hostname) > > > hostname_url = "http://%s/%s" % (hostname,entry_url) > > > h.load() > > > threads.deferToThread(spider, h, hostname_url).addCallback(done) > > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > > threads.deferToThread(spider, h, next_url(h)).addCallback(done) > > > reactor.run() > > > > > > ------------------------------ > > > > > > Best Regards, > > > > > > Devin Deng > > > _______________________________________________ > > > python-chinese > > > Post: send python-chinese在lists.python.cn > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > > -- > > '''Time is unimportant, only life important! > > http://zoomquiet.org > > blog在http://blog.zoomquiet.org/pyblosxom/ > > wiki在http://wiki.woodpecker.org.cn/moin/ZoomQuiet > > scrap在http://floss.zoomquiet.org > > douban在http://www.douban.com/people/zoomq/ > > ____________________________________ > > Pls. use OpenOffice.org to replace M$ Office. > > http://zh.openoffice.org > > Pls. use 7-zip to replace WinRAR/WinZip. > > http://7-zip.org/zh-cn/ > > You can get the truely Freedom 4 software. > > ''' > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese >
2007年03月15日 星期四 00:05
请给我也发一份源码吧。想学习一下,谢谢 -- sleepy right brain~~犯困的右脑 -------------- 下一部分 -------------- 一个HTML附件被移除... URL: http://python.cn/pipermail/python-chinese/attachments/20070315/48a85a77/attachment.html
2007年03月15日 星期四 10:39
这是邮件组, 要的人直接发到他**本人**邮箱去好了. 往这个这个组发, 一大堆人收到,有何意义? 开这个线索的gashero也不知道怎么想的,想分享代码,直接贴在邮件列表里不是更好? 搞得同那些发垃圾邮件来搜集邮件地址的一样. 在07-3-15,sihao huang <hsihao001在gmail.com> 写道: > > > 请给我也发一份源码吧。想学习一下,谢谢 > > -- > sleepy right brain~~犯困的右脑 > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -------------- 下一部分 -------------- 一个HTML附件被移除... URL: http://python.cn/pipermail/python-chinese/attachments/20070315/c06edbd2/attachment.htm
2007年03月15日 星期四 10:56
ºÇºÇ¡«´ó¼ÒÕâ²»ÊǶ¼¼±×ÅѧϰÂ𣬿´À´³ýÁËÒ»¾ä"Òª´úÂë"£¬Í¬Ê±Ò²ÒªÌ¸Ì¸×Ô¼ºµÄ¸ÐÊÜ£¬ÇдèÏÂgasheroÓöµ½µÄÎÊÌâ²ÅÊǰ¡¡£Ò²Ðí·½·¨Ç·Í×£¬µ«gasheroµÄ·ÖÏí»¹ÊÇÁîÈËPF°¡¡£ ÔÚ07-3-15£¬Hope <lanyig在gmail.com> дµÀ£º > > ÕâÊÇÓʼþ×é, ÒªµÄÈËÖ±½Ó·¢µ½Ëû**±¾ÈË**ÓÊÏäÈ¥ºÃÁË. ÍùÕâ¸öÕâ¸ö×é·¢, Ò»´ó¶ÑÈËÊÕµ½,ÓкÎÒâÒå? > ¿ªÕâ¸öÏßË÷µÄgasheroÒ²²»ÖªµÀÔõôÏëµÄ,Ïë·ÖÏí´úÂë,Ö±½ÓÌùÔÚÓʼþÁбíÀï²»ÊǸüºÃ? ¸ãµÃͬÄÇЩ·¢À¬»øÓʼþÀ´ËѼ¯ÓʼþµØÖ·µÄÒ»Ñù. > > ÔÚ07-3-15£¬sihao huang < hsihao001在gmail.com> дµÀ£º > > > > > > Çë¸øÎÒÒ²·¢Ò»·ÝÔ´Âë°É¡£ÏëѧϰһÏ£¬Ð»Ð» > > > > -- > > sleepy right brain~~·¸À§µÄÓÒÄÔ > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to > > python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070315/7173f989/attachment.html
Zeuux © 2025
京ICP备05028076号