Python论坛的帖子： - 哲思

Python论坛 - 讨论区

返回群组主页

标题：[python-chinese] list如何实现distinct的功能？

分享

楼主 2005年08月17日星期三 01:23

马踏飞燕 honeyday.mj at gmail.com
Wed Aug 17 01:23:09 HKT 2005

我有一个很大的列表，100万条以上的数据，其中有大量的数据是重复的。
我想实现像sql语句里面的select distinct的功能，就是删除重复的数据。
我现在能想到的就是建一个临时列表，然后对原列表遍历，再与临时表进行比较，如果重复就跳过。但是这样会形成巨大的循环量，请问有没有更先进的做法呢？

  fin = open('d:\\ppp.txt','r')
    fout = open('d:\\ppp2.txt','w')
    lines = fin.readlines()
    distinct_line = []
    for line in lines:
        if line not in distinct_line:
            distinct_line.append(line)
            fout.write(line)
    
    fout.close()
    fin.close()

其中，ppp.txt有100万行，我运行了一下，似乎程序永远也不会停止了。。。。5分钟都没有运算完，最后只好强制结束。

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2005年08月17日星期三 02:47

panhudie nirvana117 at gmail.com
Wed Aug 17 02:47:55 HKT 2005

def do(): 
 fin=open('chap1.txt')
 lines = [line for line in fin]
 slines=set(line)
 print 'lines :', len(lines)
print 'sline :', len(slines)
 这个'chap1.txt'是4m
%time do()
lines : 102996
slines : 4303
CPU times: user 0.20 s, sys: 0.00 s, total: 0.20 s
Wall time: 0.20

这个'chap1.txt'是16m
%time do()
lines : 411984
slines : 4303
CPU times: user 0.80 s, sys: 0.00 s, total: 0.80 s
Wall time: 0.80
 
 On 8/17/05, 马踏飞燕 <honeyday.mj at gmail.com> wrote: 
> 
> 我有一个很大的列表，100万条以上的数据，其中有大量的数据是重复的。
> 我想实现像sql语句里面的select distinct的功能，就是删除重复的数据。
> 我现在能想到的就是建一个临时列表，然后对原列表遍历，再与临时表进行比较，如果重复就跳过。但是这样会形成巨大的循环量，请问有没有更先进的做法呢？
> 
> fin = open('d:\\ppp.txt','r')
> fout = open('d:\\ppp2.txt','w')
> lines = fin.readlines()
> distinct_line = []
> for line in lines:
> if line not in distinct_line:
> distinct_line.append(line)
> fout.write(line)
> 
> fout.close()
> fin.close()
> 
> 其中，ppp.txt有100万行，我运行了一下，似乎程序永远也不会停止了。。。。5分钟都没有运算完，最后只好强制结束。
> 
> _______________________________________________
> python-chinese list
> python-chinese at lists.python.cn
> http://python.cn/mailman/listinfo/python-chinese
> 
> 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20050817/d0db0be7/attachment-0001.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2005年08月17日星期三 09:53

马踏飞燕 honeyday.mj at gmail.com
Wed Aug 17 09:53:17 HKT 2005

在 05-8-17，panhudie<nirvana117 at gmail.com> 写道：
> def  do(): 
>     fin=open('chap1.txt')
>     lines = [line for line in fin]
>     slines=set(line)
>     print 'lines :', len(lines)
>     print 'sline :', len(slines)
>  
> 这个'chap1.txt'是4m
> %time do()
> lines : 102996
> slines :  4303
> CPU times: user 0.20 s, sys: 0.00 s, total: 0.20 s
> Wall time: 0.20
> 
> 这个'chap1.txt'是16m
> %time do()
> lines : 411984
> slines :  4303
> CPU times: user 0.80 s, sys: 0.00 s, total: 0.80 s
> Wall time: 0.80
>  

高！实在是高！！
处理我的数据1秒就搞定了！加上写文件也仅仅1秒多一点点，快啊！！
不过可能是你的手误吧，
>     slines=set(line)
应该是
sline=set(lines)

set以前没有用过，看来处理巨量数据还是用这种没有索引的集合最快了！
还是没搞明白set(lines)的运行原理。。。google也没有goo到。。。
能讲一下吗？

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2005年08月17日星期三 10:30

panhudie nirvana117 at gmail.com
Wed Aug 17 10:30:48 HKT 2005

好像是hash算法,跟dict差不多,查了下PEP 218, *Adding a Built-In Set Object
Type*<http://www.python.org/peps/pep-0218.html>
可能用 frozenset() 还要快一些


 On 8/17/05, 马踏飞燕 <honeyday.mj at gmail.com> wrote: 
> 
> 在 05-8-17，panhudie<nirvana117 at gmail.com> 写道：
> > def do():
> > fin=open('chap1.txt')
> > lines = [line for line in fin]
> > slines=set(line)
> > print 'lines :', len(lines)
> > print 'sline :', len(slines)
> >
> > 这个'chap1.txt'是4m
> > %time do()
> > lines : 102996
> > slines : 4303
> > CPU times: user 0.20 s, sys: 0.00 s, total: 0.20 s
> > Wall time: 0.20
> >
> > 这个'chap1.txt'是16m
> > %time do()
> > lines : 411984
> > slines : 4303
> > CPU times: user 0.80 s, sys: 0.00 s, total: 0.80 s
> > Wall time: 0.80
> >
> 
> 高！实在是高！！
> 处理我的数据1秒就搞定了！加上写文件也仅仅1秒多一点点，快啊！！
> 不过可能是你的手误吧，
> > slines=set(line)
> 应该是
> sline=set(lines)
> 
> set以前没有用过，看来处理巨量数据还是用这种没有索引的集合最快了！
> 还是没搞明白set(lines)的运行原理。。。google也没有goo到。。。
> 能讲一下吗？
> 
> _______________________________________________
> python-chinese list
> python-chinese at lists.python.cn
> http://python.cn/mailman/listinfo/python-chinese
> 
> 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20050817/cd38c7c8/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2005年08月17日星期三 11:21

马踏飞燕 honeyday.mj at gmail.com
Wed Aug 17 11:21:17 HKT 2005

在 05-8-17，panhudie<nirvana117 at gmail.com> 写道：
> 好像是hash算法,跟dict差不多,查了下PEP 218, Adding a Built-In Set Object Type
> 可能用 frozenset() 还要快一些
> 

我查了一下set和frozenset函数说明
set( [iterable]) 

Return a set whose elements are taken from iterable. The elements must
be immutable. To represent sets of sets, the inner sets should be
frozenset objects. If iterable is not specified, returns a new empty
set, set([]). New in version 2.4.

frozenset( [iterable]) 

Return a frozenset object whose elements are taken from iterable.
Frozensets are sets that have no update methods but can be hashed and
used as members of other sets or as dictionary keys. The elements of a
frozenset must be immutable themselves. To represent sets of sets, the
inner sets should also be frozenset objects. If iterable is not
specified, returns a new empty set, frozenset([]). New in version 2.4.

理论上frozenset去掉了set的update方法，应该会快一点点。而set的特性就是没有重复的项目，正好符合了需求。

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！

Zeuux © 2025

京ICP备05028076号