Python论坛的帖子： - 哲思

Python论坛 - 讨论区

返回群组主页

标题：[python-chinese] 求助，关于网页内容分离

分享

徐继哲

楼主 2004年09月07日星期二 02:15

Wang Chao cnw at vip.sina.com
Tue Sep 7 02:15:39 HKT 2004

>>> import urllib
>>> def getpage(url):
 f=urllib.urlopen(url)
 s=f.read()
 print s

>>>

现在已经根据URL从远方服务器取得了一个htm文件，不使用任何内置的lib，如何把htm得所有标签内()内部分得html代码和标签外部分得文本分别分离出来？在分离出文本后，如何把文本里的单词计数，看哪个单词出现的次数最多。偶这样写似乎取得的是一个一个的字母，根本没办法按单词操作，分离更不行了，复杂的偶又不会。

偶第四天学Python，大虾们不要笑话我，HOHO

希望得到各位的指点，谢谢
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20040906/1fd2f7ba/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2004年09月07日星期二 02:28

Qiangning Hong hongqn at gmail.com
Tue Sep 7 02:28:42 HKT 2004

为什么不用内置的lib？用httplib或者sgmllib多方便啊。




----- Original Message -----
From: Wang Chao <cnw at vip.sina.com>
Date: Mon, 6 Sep 2004 11:15:39 -0700
Subject: [python-chinese] 求助，关于网页内容分离
To: python-chinese at lists.python.cn

 >>> import urllib
>>> def getpage(url):
 f=urllib.urlopen(url)
 s=f.read()
 print s 
  
>>> 
  
现在已经根据URL从远方服务器取得了一个htm文件，不使用任何内置的lib，如何把htm得所有标签内()内部分得html代码和标签外部分得文本分别分离出来？在分离出文本后，如何把文本里的单词计数，看哪个单词出现的次数最多。偶这样写似乎取得的是一个一个的字母，根本没办法按单词操作，分离更不行了，复杂的偶又不会。
  
偶第四天学Python，大虾们不要笑话我，HOHO 
  
希望得到各位的指点，谢谢 
  

_______________________________________________
python-chinese list
python-chinese at lists.python.cn
http://python.cn/mailman/listinfo/python-chinese

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2004年09月07日星期二 02:49

Qiangning Hong hongqn at gmail.com
Tue Sep 7 02:49:46 HKT 2004

先不管你怎么分离出文本，计数的功能也很简单：
假设s是已经分离出来的文本：

计算有多少个单词：
print len(s.split())

找出现次数最多的单词：
d = {}
for w in s.split():
    d[w] = d.get(w, 0) + 1
words = d.keys()
counts = d.values()
max_counts = max(counts)
index = counts.index(max_counts)
print words[index], max_counts

应该有更好的方法的，大家讨论吧

On Tue, 7 Sep 2004 02:28:42 +0800, Qiangning Hong <hongqn at gmail.com> wrote:
> 为什么不用内置的lib？用httplib或者sgmllib多方便啊。
> 
> 
> 
> 
> ----- Original Message -----
> From: Wang Chao <cnw at vip.sina.com>
> Date: Mon, 6 Sep 2004 11:15:39 -0700
> Subject: [python-chinese] 求助，关于网页内容分离
> To: python-chinese at lists.python.cn
> 
>  >>> import urllib
> >>> def getpage(url):
>  f=urllib.urlopen(url)
>  s=f.read()
>  print s
> 
> >>>
> 
> 现在已经根据URL从远方服务器取得了一个htm文件，不使用任何内置的lib，如何把htm得所有标签内()内部分得html代码和标签外部分得文本分别分离出来？在分离出文本后，如何把文本里的单词计数，看哪个单词出现的次数最多。偶这样写似乎取得的是一个一个的字母，根本没办法按单词操作，分离更不行了，复杂的偶又不会。
> 
> 偶第四天学Python，大虾们不要笑话我，HOHO
> 
> 希望得到各位的指点，谢谢
> 
> 
> _______________________________________________
> python-chinese list
> python-chinese at lists.python.cn
> http://python.cn/mailman/listinfo/python-chinese
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2004年09月07日星期二 08:22

Zoom.Quiet zoomq at infopro.cn
Tue Sep 7 08:22:02 HKT 2004

Hollo Wang:

  嘿嘿嘿！正则表达式！！！！
http://wiki.woodpecker.org.cn/moin.cgi/Zoom_2eQuiet?action=show#head-a6eec1a9841eae9c097ec88b317a7397adb3e1f7

两个小工具，都是通过分析HTML 再进行处理的…………


/******** [2004-09-07]08:08:00 ; Wang wrote:

>>>> import urllib
>>>> def getpage(url):
Wang Chao>  f=urllib.urlopen(url)
Wang Chao>  s=f.read()
Wang Chao>  print s

>>>>

Wang Chao> 现在已经根据URL从远方服务器取得了一个htm文件，不使用任何内置的lib，如何把htm得所有标签内()内部分得html代码和标签外部分得文本分别分离出来？在分离出文本后，如何把文本里的单词计数，看哪个单词出现的次数最多。偶这样写似乎取得的是一个一个的字母，根本没办法按单词操作，分离更不行了，复杂的偶又不会。

Wang Chao> 偶第四天学Python，大虾们不要笑话我，HOHO

Wang Chao> 希望得到各位的指点，谢谢


********************************************/

-- 
Free as in Freedom

 Zoom.Quiet                           

#=========================================#
]Time is unimportant, only life important![
#=========================================#

sender is the Bat!2.12.00

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2004年09月07日星期二 08:22

Zoom.Quiet zoomq at infopro.cn
Tue Sep 7 08:22:33 HKT 2004

Hollo Qiangning:

  这个………………含中文的就不成了哪！


/******** [2004-09-07]08:22:17 ; Qiangning wrote:

Qiangning Hong> 先不管你怎么分离出文本，计数的功能也很简单：
Qiangning Hong> 假设s是已经分离出来的文本：

Qiangning Hong> 计算有多少个单词：
Qiangning Hong> print len(s.split())

Qiangning Hong> 找出现次数最多的单词：
Qiangning Hong> d = {}
Qiangning Hong> for w in s.split():
Qiangning Hong>     d[w] = d.get(w, 0) + 1
Qiangning Hong> words = d.keys()
Qiangning Hong> counts = d.values()
Qiangning Hong> max_counts = max(counts)
Qiangning Hong> index = counts.index(max_counts)
Qiangning Hong> print words[index], max_counts

Qiangning Hong> 应该有更好的方法的，大家讨论吧

Qiangning Hong> On Tue, 7 Sep 2004 02:28:42 +0800,
Qiangning Hong> Qiangning Hong <hongqn at gmail.com> wrote:
>> 为什么不用内置的lib？用httplib或者sgmllib多方便啊。
>> 
>> 
>> 
>> 
>> ----- Original Message -----
>> From: Wang Chao <cnw at vip.sina.com>
>> Date: Mon, 6 Sep 2004 11:15:39 -0700
>> Subject: [python-chinese] 求助，关于网页内容分离
>> To: python-chinese at lists.python.cn
>> 
>>  >>> import urllib
>> >>> def getpage(url):
>>  f=urllib.urlopen(url)
>>  s=f.read()
>>  print s
>> 
>> >>>
>> 
>> 现在已经根据URL从远方服务器取得了一个htm文件，不使用任何内置的lib，如何把htm得所有标签内()内部分得html代码和标签外部分得文本分别分离出来？在分离出文本后，如何把文本里的单词计数，看哪个单词出现的次数最多。偶这样写似乎取得的是一个一个的字母，根本没办法按单词操作，分离更不行了，复杂的偶又不会。
>> 
>> 偶第四天学Python，大虾们不要笑话我，HOHO
>> 
>> 希望得到各位的指点，谢谢
>> 
>> 
>> _______________________________________________
>> python-chinese list
>> python-chinese at lists.python.cn
>> http://python.cn/mailman/listinfo/python-chinese
>>
Qiangning Hong> _______________________________________________
Qiangning Hong> python-chinese list
Qiangning Hong> python-chinese at lists.python.cn
Qiangning Hong> http://python.cn/mailman/listinfo/python-chinese
 


********************************************/

-- 
Free as in Freedom

 Zoom.Quiet                           

#=========================================#
]Time is unimportant, only life important![
#=========================================#

sender is the Bat!2.12.00

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2004年09月07日星期二 09:37

Wang Chao cnw at vip.sina.com
Tue Sep 7 09:37:52 HKT 2004

Hong , 谢谢你的解答。:)


是一个人工智能课的练习作业，要求不要用内置lib .要求这样写的。 

Use urllib to get html document ,then print out
a) only html tag
b) everything but html tag
assume everything in

 is a tag .
you may use xxx.py -t http://www.123.com to get tag , use xxx.py -nt http://www.123.com to get text .
Do not use any of the included parser libraries ; this is easy enough to do by hand . If you want ,you can use the regular expression module , but this is not require .

Then , restore each of word and number of appearances in a dictionary .Sort the dictionary so that the highest-frequency word is first ,and the lowest-frequency word is last .



我能想出来的方案是，首先打开一个空txt文件，设置参数T，遍历取得的文档，挨个字符比较，当遍历的字符= = "<"的时候,T=1 , 遍历的字符= = ">"的时候,T=0 .  当T=0的时候，把当前遍历的字符写入txt，当T=1的时候，不执行任何操作，继续循环。

感觉好像是非常白痴的算法，而且挨个字符比较，好像慢了点。但我现在的水平也只能想出这个么东西了。这样能实现么？

下面的那个程序我试过了，
我在 http://218.57.8.101:6000/temp1/test.txt 放了个测试文档，输出结果是 lin 5 .嘿嘿，上午这段程序还不懂呢，看了一天网上查的东西，现在能看懂了耶.

再次感谢 ：）


----- Original Message ----- 
From: "Qiangning Hong" <hongqn at gmail.com>
To: <python-chinese at lists.python.cn>
Sent: Monday, September 06, 2004 11:49 AM
Subject: [python-chinese] Re: [python-chin ese] 求助， 关于网页 内容分离


先不管你怎么分离出文本，计数的功能也很简单：
假设s是已经分离出来的文本：

计算有多少个单词：
print len(s.split())

找出现次数最多的单词：
d = {}
for w in s.split():
    d[w] = d.get(w, 0) + 1
words = d.keys()
counts = d.values()
max_counts = max(counts)
index = counts.index(max_counts)
print words[index], max_counts

应该有更好的方法的，大家讨论吧

On Tue, 7 Sep 2004 02:28:42 +0800, Qiangning Hong <hongqn at gmail.com> wrote:
> 为什么不用内置的lib？用httplib或者sgmllib多方便啊。
> 
> 
> 
> 
> ----- Original Message -----
> From: Wang Chao <cnw at vip.sina.com>
> Date: Mon, 6 Sep 2004 11:15:39 -0700
> Subject: [python-chinese] 求助，关于网页内容分离
> To: python-chinese at lists.python.cn
> 
>  >>> import urllib
> >>> def getpage(url):
>  f=urllib.urlopen(url)
>  s=f.read()
>  print s
> 
> >>>
> 
> 现在已经根据URL从远方服务器取得了一个htm文件，不使用任何内置的lib，如何把htm得所有标签内()内部分得html代码和标签外部分得文本分别分离出来？在分离出文本后，如何把文本里的单词计数，看哪个单词出现的次数最多。偶这样写似乎取得的是一个一个的字母，根本没办法按单词操作，分离更不行了，复杂的偶又不会。
> 
> 偶第四天学Python，大虾们不要笑话我，HOHO
> 
> 希望得到各位的指点，谢谢
> 
> 
> _______________________________________________
> python-chinese list
> python-chinese at lists.python.cn
> http://python.cn/mailman/listinfo/python-chinese
>
_______________________________________________
python-chinese list
python-chinese at lists.python.cn
http://python.cn/mailman/listinfo/python-chinese

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2004年09月07日星期二 09:40

Wang Chao cnw at vip.sina.com
Tue Sep 7 09:40:31 HKT 2004

赫赫，你这么一说，我才想起来作业描述里的 regular expression 是正则表达式的意思，之前还以为是常规表达模块的意思哩。
那句这么写的
If you want ,you can use the regular expression module , but this is not require .

8过你说的那两个小工具我没发现~~

thanks

----- Original Message ----- 
From: "Zoom.Quiet" <zoomq at infopro.cn>
To: "Wang Chao" <python-chinese at lists.python.cn>
Sent: Monday, September 06, 2004 5:22 PM
Subject: Re: [python-chinese] 求助，关于网页内容分离


> Hollo Wang:
> 
>   嘿嘿嘿！正则表达式！！！！
> http://wiki.woodpecker.org.cn/moin.cgi/Zoom_2eQuiet?action=show#head-a6eec1a9841eae9c097ec88b317a7397adb3e1f7
> 
> 两个小工具，都是通过分析HTML 再进行处理的…………
> 
> 
> /******** [2004-09-07]08:08:00 ; Wang wrote:
> 
> >>>> import urllib
> >>>> def getpage(url):
> Wang Chao>  f=urllib.urlopen(url)
> Wang Chao>  s=f.read()
> Wang Chao>  print s
> 
> >>>>
> 
> Wang Chao> 现在已经根据URL从远方服务器取得了一个htm文件，不使用任何内置的lib，如何把htm得所有标签内()内部分得html代码和标签外部分得文本分别分离出来？在分离出文本后，如何把文本里的单词计数，看哪个单词出现的次数最多。偶这样写似乎取得的是一个一个的字母，根本没办法按单词操作，分离更不行了，复杂的偶又不会。
> 
> Wang Chao> 偶第四天学Python，大虾们不要笑话我，HOHO
> 
> Wang Chao> 希望得到各位的指点，谢谢
> 
> 
> ********************************************/
> 
> -- 
> Free as in Freedom
> 
>  Zoom.Quiet                           
> 
> #=========================================#
> ]Time is unimportant, only life important![
> #=========================================#
> 
> sender is the Bat!2.12.00
> 
> _______________________________________________
> python-chinese list
> python-chinese at lists.python.cn
> http://python.cn/mailman/listinfo/python-chinese
> 
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！

Zeuux © 2025

京ICP备05028076号