Python论坛的帖子： - 哲思

Python论坛 - 讨论区

返回群组主页

标题：[python-chinese] 请问如何抓取 https 协议的网页内容？

分享

徐继哲

楼主 2006年07月19日星期三 02:38

Neil chenrong2003 at gmail.com
Wed Jul 19 02:38:50 HKT 2006

我今天用 python 写了一个网页抓取程序，现在发现 https 协议的网页不能够用 urllib.urlopen()
函数获取（报错）。并且我搜索资料也没有找到办法。请指教！
这是我的代码：

# 下载某一项资源
def download_resource(src, src_type):
	url = get_resource_url(src)
	usock = urllib.urlopen(url)
	data = usock.read()
	usock.close()
	# save the file
	fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/', '\\')
	fsock = file(fname, 'w')
	fsock.write(data)
	fsock.close()

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月19日星期三 09:12

netkiller openunix at 163.com
Wed Jul 19 09:12:18 HKT 2006

https握手后．要请求证书．．等操作．

我有java的例子．．python的没写．这几天没时间．．也停工了．哈哈．．

http://netkiller.hikz.com/article/security/book.html#id497614

不知什么时候能写完．

----- Original Message ----- 
From: "Neil" <chenrong2003 at gmail.com>
To: <python-chinese at lists.python.cn>
Sent: Wednesday, July 19, 2006 2:38 AM
Subject: [python-chinese] 请问如何抓取 https 协议的网页内容？


> 我今天用 python 写了一个网页抓取程序，现在发现 https 协议的网页不能够用 
> urllib.urlopen()
> 函数获取（报错）。并且我搜索资料也没有找到办法。请指教！
> 这是我的代码：
>
> # 下载某一项资源
> def download_resource(src, src_type):
> url = get_resource_url(src)
> usock = urllib.urlopen(url)
> data = usock.read()
> usock.close()
> # save the file
> fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/', 
> '\\')
> fsock = file(fname, 'w')
> fsock.write(data)
> fsock.close()
>


--------------------------------------------------------------------------------


> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月19日星期三 10:13

IQDoctor huanghao.c at gmail.com
Wed Jul 19 10:13:36 HKT 2006

curl

在 06-7-19，Neil<chenrong2003 at gmail.com> 写道：
> 我今天用 python 写了一个网页抓取程序，现在发现 https 协议的网页不能够用 urllib.urlopen()
> 函数获取（报错）。并且我搜索资料也没有找到办法。请指教！
> 这是我的代码：
>
> # 下载某一项资源
> def download_resource(src, src_type):
>        url = get_resource_url(src)
>        usock = urllib.urlopen(url)
>        data = usock.read()
>        usock.close()
>        # save the file
>        fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/', '\\')
>        fsock = file(fname, 'w')
>        fsock.write(data)
>        fsock.close()
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>


-- 
Best regrads,
IQDoctor

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月19日星期三 10:18

Xiao Lei Wu xiaoleiw at cn.ibm.com
Wed Jul 19 10:18:16 HKT 2006

httpsÁ¬url¶¼ÊÇÃÜÎÄ¸ñÊ½£¬µ±È»²»ÄÜÖ±½ÓÓÃÆÕÍ¨µÄurlopen
pythonÓ¦¸ÃÓÐÏàÓ¦µÄ¿â£¬µ±È»ssl¿âÊÇ±ØÐëµÄ
Ï¹²ÂµÄ£¬ºÇºÇ

Best Regards,

Zachary Wu (Îâ°~ÀÚ)
Software Engineer, Enterprise Content Management FVT, IBM China Software
Development Lab
Tel: +86 10 82782244-3235. Fax: 82782244-2886 Tie Line: 915-2244-3235
Internet: xiaoleiw at cn.ibm.com
Notes ID: Xiao Lei Wu/China/Contr/IBM at IBMCN
Address: 8/F, Block A, Power Creative Building, No.1, East Road, Shang Di,
Beijing 100085, P.R. China

python-chinese-bounces at lists.python.cn Ð´ÓÚ 2006-07-19 02:38:50:

> ÎÒ½ñÌìÓÃ python Ð´ÁËÒ»¸öÍøÒ³×¥È¡³ÌÐò£¬ÏÖÔÚ·¢ÏÖ https ÐÒéµÄÍøÒ³²»ÄÜ¹»
> ÓÃ urllib.urlopen()
> º¯Êý»ñÈ¡£¨±¨´í£©¡£²¢ÇÒÎÒËÑË÷×ÊÁÏÒ²Ã»ÓÐÕÒµ½°ì·¨¡£ÇëÖ¸½Ì£¡
> ÕâÊÇÎÒµÄ´úÂë£º
>
> # ÏÂÔØÄ³Ò»Ïî×ÊÔ´
> def download_resource(src, src_type):
>    url = get_resource_url(src)
>    usock = urllib.urlopen(url)
>    data = usock.read()
>    usock.close()
>    # save the file
>    fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/',
'\\')
>    fsock = file(fname, 'w')
>    fsock.write(data)
>    fsock.close()
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060719/fa998a40/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月19日星期三 13:14

张骏 zhangj at foreseen-info.com
Wed Jul 19 13:14:19 HKT 2006

在 2006-7-19 2:38:50，Neil <chenrong2003 at gmail.com> 写道：
> 我今天用 python 写了一个网页抓取程序，现在发现 https 协议的网页不能够用 urllib.urlopen()
> 函数获取（报错）。并且我搜索资料也没有找到办法。请指教！
> 这是我的代码：
> 
> # 下载某一项资源
> def download_resource(src, src_type):
> 	url = get_resource_url(src)
> 	usock = urllib.urlopen(url)
> 	data = usock.read()
> 	usock.close()
> 	# save the file
> 	fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/', '\\')
> 	fsock = file(fname, 'w')
> 	fsock.write(data)
> 	fsock.close()

参考Urllib2的帮助文档

>>> o = urllib2.build_opener( urllib2.HTTPSHandler())
>>> a = o.open( 'https://gmail.google.com' )
>>> a
>
>>> a.read()
......

-- 
张骏 <zhangj at foreseen-info.com>

敏捷来自Python
简单源于我们
丰元信信息技术有限公司

Python技术交流群：22507237

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月19日星期三 13:16

bird devdoer devdoer at gmail.com
Wed Jul 19 13:16:13 HKT 2006

httplib  Ã²ËÆ¿´µ½ÁËHTTPS »¹ÓÐSSL

ÔÚ06-7-19£¬Xiao Lei Wu <xiaoleiw at cn.ibm.com> Ð´µÀ£º
>
>  httpsÁ¬url¶¼ÊÇÃÜÎÄ¸ñÊ½£¬µ±È»²»ÄÜÖ±½ÓÓÃÆÕÍ¨µÄurlopen
> pythonÓ¦¸ÃÓÐÏàÓ¦µÄ¿â£¬µ±È»ssl¿âÊÇ±ØÐëµÄ
> Ï¹²ÂµÄ£¬ºÇºÇ
>
> Best Regards,
>
> Zachary Wu (Îâ°~ÀÚ)
> Software Engineer, Enterprise Content Management FVT, IBM China Software
> Development Lab
> Tel: +86 10 82782244-3235. Fax: 82782244-2886 Tie Line: 915-2244-3235
> Internet: xiaoleiw at cn.ibm.com
> Notes ID: Xiao Lei Wu/China/Contr/IBM at IBMCN
> Address: 8/F, Block A, Power Creative Building, No.1, East Road, Shang Di,
> Beijing 100085, P.R. China
>
> python-chinese-bounces at lists.python.cn Ð´ÓÚ 2006-07-19 02:38:50:
>
>
> > ÎÒ½ñÌìÓÃ python Ð´ÁËÒ»¸öÍøÒ³×¥È¡³ÌÐò£¬ÏÖÔÚ·¢ÏÖ https ÐÒéµÄÍøÒ³²»ÄÜ¹»
> > ÓÃ urllib.urlopen()
> > º¯Êý»ñÈ¡£¨±¨´í£©¡£²¢ÇÒÎÒËÑË÷×ÊÁÏÒ²Ã»ÓÐÕÒµ½°ì·¨¡£ÇëÖ¸½Ì£¡
> > ÕâÊÇÎÒµÄ´úÂë£º
> >
> > # ÏÂÔØÄ³Ò»Ïî×ÊÔ´
> > def download_resource(src, src_type):
> >    url = get_resource_url(src)
> >    usock = urllib.urlopen(url)
> >    data = usock.read()
> >    usock.close()
> >    # save the file
> >    fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/',
> '\\')
> >    fsock = file(fname, 'w')
> >    fsock.write(data)
> >    fsock.close()
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese at lists.python.cn
> > Subscribe: send subscribe to python-chinese-request at lists.python.cn
> > Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>


-- 
devdoer
devdoer at gmail.com
http://project.mytianwang.cn/cgi-bin/blog
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060719/b55809fb/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月19日星期三 14:25

Neil chenrong2003 at gmail.com
Wed Jul 19 14:25:28 HKT 2006

thanks all,
张俊的代码解决了问题。

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！

Zeuux © 2025

京ICP备05028076号