Python论坛  - 讨论区

标题:[python-chinese] 请问如何抓取 https 协议的网页内容?

2006年07月19日 星期三 02:38

Neil chenrong2003 at gmail.com
Wed Jul 19 02:38:50 HKT 2006

我今天用 python 写了一个网页抓取程序,现在发现 https 协议的网页不能够用 urllib.urlopen()
函数获取(报错)。并且我搜索资料也没有找到办法。请指教!
这是我的代码:

# 下载某一项资源
def download_resource(src, src_type):
	url = get_resource_url(src)
	usock = urllib.urlopen(url)
	data = usock.read()
	usock.close()
	# save the file
	fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/', '\\')
	fsock = file(fname, 'w')
	fsock.write(data)
	fsock.close()

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2006年07月19日 星期三 09:12

netkiller openunix at 163.com
Wed Jul 19 09:12:18 HKT 2006

https握手后.要请求证书..等操作.

我有java的例子..python的没写.这几天没时间..也停工了.哈哈..

http://netkiller.hikz.com/article/security/book.html#id497614

不知什么时候能写完.

----- Original Message ----- 
From: "Neil" <chenrong2003 at gmail.com>
To: <python-chinese at lists.python.cn>
Sent: Wednesday, July 19, 2006 2:38 AM
Subject: [python-chinese] 请问如何抓取 https 协议的网页内容?


> 我今天用 python 写了一个网页抓取程序,现在发现 https 协议的网页不能够用 
> urllib.urlopen()
> 函数获取(报错)。并且我搜索资料也没有找到办法。请指教!
> 这是我的代码:
>
> # 下载某一项资源
> def download_resource(src, src_type):
> url = get_resource_url(src)
> usock = urllib.urlopen(url)
> data = usock.read()
> usock.close()
> # save the file
> fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/', 
> '\\')
> fsock = file(fname, 'w')
> fsock.write(data)
> fsock.close()
>


--------------------------------------------------------------------------------


> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese 




[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2006年07月19日 星期三 10:13

IQDoctor huanghao.c at gmail.com
Wed Jul 19 10:13:36 HKT 2006

curl

在 06-7-19,Neil<chenrong2003 at gmail.com> 写道:
> 我今天用 python 写了一个网页抓取程序,现在发现 https 协议的网页不能够用 urllib.urlopen()
> 函数获取(报错)。并且我搜索资料也没有找到办法。请指教!
> 这是我的代码:
>
> # 下载某一项资源
> def download_resource(src, src_type):
>        url = get_resource_url(src)
>        usock = urllib.urlopen(url)
>        data = usock.read()
>        usock.close()
>        # save the file
>        fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/', '\\')
>        fsock = file(fname, 'w')
>        fsock.write(data)
>        fsock.close()
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>


-- 
Best regrads,
IQDoctor

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2006年07月19日 星期三 10:18

Xiao Lei Wu xiaoleiw at cn.ibm.com
Wed Jul 19 10:18:16 HKT 2006

httpsÁ¬url¶¼ÊÇÃÜÎĸñʽ£¬µ±È»²»ÄÜÖ±½ÓÓÃÆÕͨµÄurlopen
pythonÓ¦¸ÃÓÐÏàÓ¦µÄ¿â£¬µ±È»ssl¿âÊDZØÐëµÄ
Ϲ²ÂµÄ£¬ºÇºÇ

Best Regards,

Zachary Wu (Îâ°~ÀÚ)
Software Engineer, Enterprise Content Management FVT, IBM China Software
Development Lab
Tel: +86 10 82782244-3235. Fax: 82782244-2886 Tie Line: 915-2244-3235
Internet: xiaoleiw at cn.ibm.com
Notes ID: Xiao Lei Wu/China/Contr/IBM at IBMCN
Address: 8/F, Block A, Power Creative Building, No.1, East Road, Shang Di,
Beijing 100085, P.R. China

python-chinese-bounces at lists.python.cn дÓÚ 2006-07-19 02:38:50:

> ÎÒ½ñÌìÓà python дÁËÒ»¸öÍøҳץȡ³ÌÐò£¬ÏÖÔÚ·¢ÏÖ https ЭÒéµÄÍøÒ³²»Äܹ»
> ÓÃ urllib.urlopen()
> º¯Êý»ñÈ¡£¨±¨´í£©¡£²¢ÇÒÎÒËÑË÷×ÊÁÏҲûÓÐÕÒµ½°ì·¨¡£ÇëÖ¸½Ì£¡
> ÕâÊÇÎҵĴúÂ룺
>
> # ÏÂÔØijһÏî×ÊÔ´
> def download_resource(src, src_type):
>    url = get_resource_url(src)
>    usock = urllib.urlopen(url)
>    data = usock.read()
>    usock.close()
>    # save the file
>    fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/',
'\\')
>    fsock = file(fname, 'w')
>    fsock.write(data)
>    fsock.close()
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060719/fa998a40/attachment.htm

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2006年07月19日 星期三 13:14

张骏 zhangj at foreseen-info.com
Wed Jul 19 13:14:19 HKT 2006

在 2006-7-19 2:38:50,Neil <chenrong2003 at gmail.com> 写道:
> 我今天用 python 写了一个网页抓取程序,现在发现 https 协议的网页不能够用 urllib.urlopen()
> 函数获取(报错)。并且我搜索资料也没有找到办法。请指教!
> 这是我的代码:
> 
> # 下载某一项资源
> def download_resource(src, src_type):
> 	url = get_resource_url(src)
> 	usock = urllib.urlopen(url)
> 	data = usock.read()
> 	usock.close()
> 	# save the file
> 	fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/', '\\')
> 	fsock = file(fname, 'w')
> 	fsock.write(data)
> 	fsock.close()

参考Urllib2的帮助文档

>>> o = urllib2.build_opener( urllib2.HTTPSHandler())
>>> a = o.open( 'https://gmail.google.com' )
>>> a
>
>>> a.read()
......

-- 
张骏 <zhangj at foreseen-info.com>

敏捷来自Python
简单源于我们
丰元信信息技术有限公司

Python技术交流群:22507237



[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2006年07月19日 星期三 13:16

bird devdoer devdoer at gmail.com
Wed Jul 19 13:16:13 HKT 2006

httplib  òËÆ¿´µ½ÁËHTTPS »¹ÓÐSSL

ÔÚ06-7-19£¬Xiao Lei Wu <xiaoleiw at cn.ibm.com> дµÀ£º
>
>  httpsÁ¬url¶¼ÊÇÃÜÎĸñʽ£¬µ±È»²»ÄÜÖ±½ÓÓÃÆÕͨµÄurlopen
> pythonÓ¦¸ÃÓÐÏàÓ¦µÄ¿â£¬µ±È»ssl¿âÊDZØÐëµÄ
> Ϲ²ÂµÄ£¬ºÇºÇ
>
> Best Regards,
>
> Zachary Wu (Îâ°~ÀÚ)
> Software Engineer, Enterprise Content Management FVT, IBM China Software
> Development Lab
> Tel: +86 10 82782244-3235. Fax: 82782244-2886 Tie Line: 915-2244-3235
> Internet: xiaoleiw at cn.ibm.com
> Notes ID: Xiao Lei Wu/China/Contr/IBM at IBMCN
> Address: 8/F, Block A, Power Creative Building, No.1, East Road, Shang Di,
> Beijing 100085, P.R. China
>
> python-chinese-bounces at lists.python.cn дÓÚ 2006-07-19 02:38:50:
>
>
> > ÎÒ½ñÌìÓà python дÁËÒ»¸öÍøҳץȡ³ÌÐò£¬ÏÖÔÚ·¢ÏÖ https ЭÒéµÄÍøÒ³²»Äܹ»
> > ÓÃ urllib.urlopen()
> > º¯Êý»ñÈ¡£¨±¨´í£©¡£²¢ÇÒÎÒËÑË÷×ÊÁÏҲûÓÐÕÒµ½°ì·¨¡£ÇëÖ¸½Ì£¡
> > ÕâÊÇÎҵĴúÂ룺
> >
> > # ÏÂÔØijһÏî×ÊÔ´
> > def download_resource(src, src_type):
> >    url = get_resource_url(src)
> >    usock = urllib.urlopen(url)
> >    data = usock.read()
> >    usock.close()
> >    # save the file
> >    fname = sys.path[0] + '\\' + get_new_link(src, src_type).replace('/',
> '\\')
> >    fsock = file(fname, 'w')
> >    fsock.write(data)
> >    fsock.close()
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese at lists.python.cn
> > Subscribe: send subscribe to python-chinese-request at lists.python.cn
> > Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>


-- 
devdoer
devdoer at gmail.com
http://project.mytianwang.cn/cgi-bin/blog
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060719/b55809fb/attachment.html

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2006年07月19日 星期三 14:25

Neil chenrong2003 at gmail.com
Wed Jul 19 14:25:28 HKT 2006

thanks all,
张俊的代码解决了问题。

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

如下红色区域有误,请重新填写。

    你的回复:

    请 登录 后回复。还没有在Zeuux哲思注册吗?现在 注册 !

    Zeuux © 2025

    京ICP备05028076号