2007年07月06日 星期五 09:22
直接发python-chinese在lists.python.cn竟然不行,不知道为什么,既然写的脚本类似,就借个地方挂在这里 *下载南方周末最新网页并保存成一个txt文件* 每周四傍晚南方周末会把文章发布在网站上,为了方便拷到手机上看,写了一个小脚本:会自动下载最新一期的南方周末并存为文本文件 # down html from zm and save html to txt # -*- coding:utf-8 -*- import htmllib, formatter, urllib, re website = 'http://www.nanfangdaily.com.cn/zm/' f = urllib.urlopen(website) html = f.read().lower() i = html.find('url=') j = html.find('/',i+4) date = html[i+4:j] website += date f = urllib.urlopen(website) p = htmllib.HTMLParser(formatter.NullFormatter()) p.feed(f.read()) p.close() seen = set() for url in p.anchorlist: if url[-3::] == 'asp': if url in seen: continue seen.add(url) urls=list(seen) k=len(urls) doc=open(u'南方周末'.encode('gb18030')+date+'.txt','a') for l, url in enumerate(urls): f = urllib.urlopen(website+url[1:]) html = f.read() i = html.find('#ff0000') i = html.find('>',i+7) j = html.find('<',i+1) doc.write(html[i+1:j]) i = html.find('content01',j+1) i = html.find('>',i+9) j = html.find(']*>',re.IGNORECASE) doc.write(reobj.sub('\n',content)+'\n------------\n') print l+1,'-->',k doc.close() print u'下载结束' -------------- 下一部分 -------------- 一个HTML附件被移除... URL: http://python.cn/pipermail/python-chinese/attachments/20070706/47077fde/attachment.html -------------- 下一部分 -------------- 一个非文本附件被清除... 发信人: %(who)s 主题: %(subject)s 日期: %(date)s 大小: 718 Url: http://python.cn/pipermail/python-chinese/attachments/20070706/47077fde/attachment.zip
2007年07月06日 星期五 09:41
On 7/6/07, kergee!z <kergee在gmail.com> wrote: > 直接发python-chinese在lists.python.cn竟然不行,不知道为什么,既然写的脚本类似,就借个地方挂在这里 > 感谢分享! 收集到: http://wiki.woodpecker.org.cn/moin/MicroProj/2007-07-06 咔咔咔! > 下载南方周末最新网页并保存成一个txt文件 > 每周四傍晚南方周末会把文章发布在网站上,为了方便拷到手机上看,写了一个小脚本:会自动下载最新一期的南方周末并存为文本文件 > > # down html from zm and save html to txt > # -*- coding:utf-8 -*- > import htmllib, formatter, urllib, re > > > website = 'http://www.nanfangdaily.com.cn/zm/' > f = urllib.urlopen(website) > html = f.read ().lower() > i = html.find('url=') > j = html.find('/',i+4) > date = html[i+4:j] > website += date > > f = urllib.urlopen(website) > p = htmllib.HTMLParser(formatter.NullFormatter()) > p.feed(f.read()) > p.close() > seen = set() > for url in p.anchorlist: > if url[-3::] == 'asp': > if url in seen: continue > seen.add(url) > > urls=list(seen) > k=len(urls) > doc=open(u'南方周末'.encode('gb18030')+date+'.txt','a') > for l, url in enumerate(urls): > f = urllib.urlopen(website+url[1:]) > html = f.read() > i = html.find('#ff0000') > i = html.find('>',i+7) > j = html.find('<',i+1) > doc.write(html[i+1:j]) > i = html.find('content01',j+1) > i = html.find('>',i+9) > j = html.find('> content = html[i+1:j] > reobj = re.compile(r']*>',re.IGNORECASE) > doc.write(reobj.sub('\n',content)+'\n------------\n') > print l+1,'-->',k > doc.close() > print u'下载结束' > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to > python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to > python-chinese-request在lists.python.cn > Detail Info: > http://python.cn/mailman/listinfo/python-chinese > > -- '''Time is unimportant, only life important! http://zoomquiet.org blog在http://blog.zoomquiet.org/pyblosxom/ wiki在http://wiki.woodpecker.org.cn/moin/ZoomQuiet scrap在http://floss.zoomquiet.org douban在http://www.douban.com/people/zoomq/ ____________________________________ Pls. use OpenOffice.org to replace M$ Office. http://zh.openoffice.org Pls. use 7-zip to replace WinRAR/WinZip. http://7-zip.org/zh-cn/ You can get the truely Freedom 4 software. '''
Zeuux © 2025
京ICP备05028076号