Python论坛的帖子： - 哲思

Python论坛 - 讨论区

返回群组主页

标题：[python-chinese] 请教一个计数算法

分享

徐继哲

楼主 2007年01月04日星期四 16:25

daifei bonycamel在gmail.com
星期四一月 4 16:25:57 HKT 2007

¸÷Î»ºÃ£¬ÎÒÒªÊµÏÖÒÔÏÂÐèÇó£¬¸Ð¾õpython¿ÉÄÜÓÐÏÖ³ÉµÄ·½·¨£¬Çë¸÷Î»Ö¸½Ì¡£

 

ÔÚÒ»¸ölistÖÐ£¬ÓÐ´óÁ¿ÕûÊý£¨10000¸ö»ò¸ü¶à£©£¬ÀýÈç£º
[1,1,3,3,9999,9,3,512,1,8,9,¡¡]

ÎÒÐèÒªÐ´Ò»¸öËã·¨£¬Í³¼ÆÕâ¸öÊý×ÖlistÖÐ³öÏÖ´ÎÊý×î¶àµÄ10¸öÕûÊý£¬·Ö±ð´æµ½±äÁ¿
i1-i10¡£

 

ÓÃÒ»°ãµÄÑ»·Ó¦¸Ã¿ÉÒÔÊµÏÖ£¬µ«ÊÇÐ§ÂÊ±È½Ï²î£¬Ò²±È½ÏÂé·³£¬ÇëÎÊpythonÓÐÊ²Ã´ÏÖ³ÉµÄ
ÊµÏÖ·½Ê½Âð£¿Ð»Ð»¡£

-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070104/b851a05c/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

李迎辉

0楼 2007年01月04日星期四 16:40

limodou limodou在gmail.com
星期四一月 4 16:40:01 HKT 2007

On 1/4/07, daifei <bonycamel在gmail.com> wrote:
>
>
>
>
>
> 各位好，我要实现以下需求，感觉python可能有现成的方法，请各位指教。
>
>
>
> 在一个list中，有大量整数（10000个或更多），例如：[1,1,3,3,9999,9,3,512,1,8,9,……]
>
> 我需要写一个算法，统计这个数字list中出现次数最多的10个整数，分别存到变量i1-i10。
>
>
>
> 用一般的循环应该可以实现，但是效率比较差，也比较麻烦，请问python有什么现成的实现方式吗？谢谢。

循环效率低吗？


-- 
I like python!
UliPad <>: http://wiki.woodpecker.org.cn/moin/UliPad
My Blog: http://www.donews.net/limodou

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月04日星期四 16:47

Leo Jay python.leojay在gmail.com
星期四一月 4 16:47:08 HKT 2007

On 1/4/07, daifei <bonycamel在gmail.com> wrote:
>
> 各位好，我要实现以下需求，感觉python可能有现成的方法，请各位指教。
>
> 在一个list中，有大量整数（10000个或更多），例如：[1,1,3,3,9999,9,3,512,1,8,9,……]
>
> 我需要写一个算法，统计这个数字list中出现次数最多的10个整数，分别存到变量i1-i10。
>
> 用一般的循环应该可以实现，但是效率比较差，也比较麻烦，请问python有什么现成的实现方式吗？谢谢。


10000个数不叫大量，这种应用，1百万个数以内应该都是瞬间能出结果的。
你就大胆的用循环吧。


-- 
Best Regards,
Leo Jay

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月04日星期四 17:00

尹祥龙 yinxianglong在gmail.com
星期四一月 4 17:00:46 HKT 2007

遍历一遍list的值，就可以得到结果了。

On 1/4/07, Leo Jay <python.leojay at gmail.com> wrote:
>
> On 1/4/07, daifei <bonycamel at gmail.com> wrote:
> >
> > 各位好，我要实现以下需求，感觉python可能有现成的方法，请各位指教。
> >
> > 在一个list中，有大量整数（10000个或更多），例如：[1,1,3,3,9999,9,3,512,1,8,9,……]
> >
> > 我需要写一个算法，统计这个数字list中出现次数最多的10个整数，分别存到变量i1-i10。
> >
> > 用一般的循环应该可以实现，但是效率比较差，也比较麻烦，请问python有什么现成的实现方式吗？谢谢。
>
>
> 10000个数不叫大量，这种应用，1百万个数以内应该都是瞬间能出结果的。
> 你就大胆的用循环吧。
>
>
> --
> Best Regards,
> Leo Jay
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20070104/7f81bce2/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

tocer

0楼 2007年01月04日星期四 20:20

tocer tocer.deng在gmail.com
星期四一月 4 20:20:30 HKT 2007

你先把你的写出来，再让大家帮你优化一下，如何？

daifei wrote::
> 各位好，我要实现以下需求，感觉python可能有现成的方法，请各位指教。
> 
>  
> 
> 在一个list中，有大量整数（10000个或更多），例如：
> [1,1,3,3,9999,9,3,512,1,8,9,……]
> 
> 我需要写一个算法，统计这个数字list中出现次数最多的10个整数，分别存到变量
> i1-i10。
> 
>  
> 
> 用一般的循环应该可以实现，但是效率比较差，也比较麻烦，请问python有什么现成的
> 实现方式吗？谢谢。
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese

-- 
Vim 中文 Google 论坛 http://groups.google.com/group/Vim-cn

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月04日星期四 22:26

batfree batfreelist在gmail.com
星期四一月 4 22:26:40 HKT 2007

至少要遍历一遍的，遍历的同时计数并同时对计数进行排序。
或者是遍历一遍计数，然后对计数结果进行排序。这两者那个效率高不一定。
如果是稀疏数列的话后者效率会高。
如果是密集数列，前者效率会高。不过如果是在千万级别之内，速度应该看不出差别。

daifei 写道:
>
>     各位好，我要实现以下需求，感觉python可能有现成的方法，请各位指教。
>
>     在一个list中，有大量整数（10000个或更多），例如：
>     [1,1,3,3,9999,9,3,512,1,8,9,……]
>
>     我需要写一个算法，统计这个数字list中出现次数最多的10个整数，分别存
>     到变量i1-i10。
>
>     用一般的循环应该可以实现，但是效率比较差，也比较麻烦，请问python有
>     什么现成的实现方式吗？谢谢。
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

安雁同

0楼 2007年01月05日星期五 11:57

twinsant twinsant在gmail.com
星期五一月 5 11:57:39 HKT 2007

> 在一个list中，有大量整数（10000个或更多），例如：[1,1,3,3,9999,9,3,512,1,8,9,……]
>
> 我需要写一个算法，统计这个数字list中出现次数最多的10个整数，分别存到变量i1-i10。
>

顺手写了一个，
贴一下代码和性能测试结果：
def most_common(hist, num=1):
    '''
    Sort a dictionary. Taken from book ?... I forget...
    '''
    pairs = []
    for pair in hist.items():
        pairs.append((pair[1],pair[0]))
    pairs.sort()
    pairs.reverse()
    return pairs[:num]

def top10(l):
    # Count top 10
    d = {}
    for i in l:
        if d.has_key(i):
            d[i] += 1
        else:
            d[i] = 1
    #print d
    print most_common(d, 10)

def prof(num):
    # Init random list
    l = [random.randint(0, x) for x in xrange(num)]
    #print l

    top10(l)


def main():
    LEN = 10000
    prof(LEN)
    # 10,000
    # 1    0.051    0.051    0.143    0.143 q070105.py:22(top10)
    # 100,000
    # 1    0.503    0.503    1.468    1.468 q070105.py:22(top10)
    # 1,000,000
    # 1    5.286    5.286   20.165   20.165 q070105.py:22(top10)
    # 10,000,000
    # ... long long long long long ... time after, i kill it.

if __name__ == '__main__':
    main()

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月05日星期五 13:34

Leo Jay python.leojay在gmail.com
星期五一月 5 13:34:22 HKT 2007

On 1/5/07, twinsant <twinsant在gmail.com> wrote:
> 顺手写了一个，
> 贴一下代码和性能测试结果：
> def most_common(hist, num=1):
>     '''
>     Sort a dictionary. Taken from book ?... I forget...
>     '''
>     pairs = []
>     for pair in hist.items():
>         pairs.append((pair[1],pair[0]))
>     pairs.sort()
>     pairs.reverse()
>     return pairs[:num]
>
> def top10(l):
>     # Count top 10
>     d = {}
>     for i in l:
>         if d.has_key(i):
>             d[i] += 1
>         else:
>             d[i] = 1
>     #print d
>     print most_common(d, 10)
>
> def prof(num):
>     # Init random list
>     l = [random.randint(0, x) for x in xrange(num)]
>     #print l
>
>     top10(l)
>
>
> def main():
>     LEN = 10000
>     prof(LEN)
>     # 10,000
>     # 1    0.051    0.051    0.143    0.143 q070105.py:22(top10)
>     # 100,000
>     # 1    0.503    0.503    1.468    1.468 q070105.py:22(top10)
>     # 1,000,000
>     # 1    5.286    5.286   20.165   20.165 q070105.py:22(top10)
>     # 10,000,000
>     # ... long long long long long ... time after, i kill it.
>
> if __name__ == '__main__':
>     main()


你在循环中放了一个if判断，这会严重影响速度的。
你的程序在我这里的profile：
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.554    3.554    6.287    6.287 t.py:15(top10)

如果把top10改为：
def top10(l):
   # Count top 10
   d = {}
   for i in set(l):
      d[i] = 1
   for i in l:
      d[i] += 1
   #print d
   print most_common(d, 10)
就可以让速度大为提高：
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.441    0.441    0.567    0.567 t.py:15(top10)



-- 
Best Regards,
Leo Jay

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月05日星期五 14:52

Julius F. Huang fozzold在gmail.com
星期五一月 5 14:52:30 HKT 2007

不错, set是个好东西, 谢谢Leo Jay!

d = {}
for x in set(l):
    d[x] = l.count(x)

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月05日星期五 15:08

Leo Jay python.leojay在gmail.com
星期五一月 5 15:08:47 HKT 2007

On 1/5/07, Julius F. Huang <fozzold在gmail.com> wrote:
> 不错, set是个好东西, 谢谢Leo Jay!
>
> d = {}
> for x in set(l):
>     d[x] = l.count(x)


不客气，
btw, 你后面附的那个程序，复杂度都冲到O(n^2)去了。会很慢很慢的。
而且数字越多越是慢得厉害。

-- 
Best Regards,
Leo Jay

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月05日星期五 15:15

Julius F. Huang fozzold在gmail.com
星期五一月 5 15:15:39 HKT 2007

daifei不是想问python的现成方法吗? 这个看上去比较简单, 性能就没考虑了.
:P

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月07日星期日 01:10

Rocky Lee nuffin在gmail.com
星期日一月 7 01:10:46 HKT 2007

如果数据集合特别大，且重复度不高的话，这个 set 岂不是耗费了双倍的内存，一个 dict 和一个
set？想个办法把这两块内存合而为一比较好。我这边有个类似的应用，会占用 1G 内存，所以这个内存优化就不能不考虑了。

C ++ 里可以用 hash_map （相当于 dict），"值"的部分是个简单的 struct，但是 count 会初始化为 0，象这样：

struct v
{
    int count;
    v () : count(0) {}
};

访问 d[x].count 时，如果节点不存在，它会自动创建，我用的 hash_map
实现应该是可以不用查找第二次的。这样实际上和你的做法差不多（都是设好初始值）。正在想 python 怎么按照这种方式实现，可以省去 set
耗费的内存。

BTW:
for x in set(l):
    d[x] = 0

这样才对，否则计数都比实际的大了 1

On 1/5/07, Leo Jay <python.leojay在gmail.com> wrote:
> 你在循环中放了一个if判断，这会严重影响速度的。
> 你的程序在我这里的profile：
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1    3.554    3.554    6.287    6.287 t.py:15(top10)
>
> 如果把top10改为：
> def top10(l):
>    # Count top 10
>    d = {}
>    for i in set(l):
>       d[i] = 1
>    for i in l:
>       d[i] += 1
>    #print d
>    print most_common(d, 10)
> 就可以让速度大为提高：
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1    0.441    0.441    0.567    0.567 t.py:15(top10)
>
>
>
> --
> Best Regards,
> Leo Jay
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

0楼 2007年01月07日星期日 01:47

刘鑫 march.liu在gmail.com
星期日一月 7 01:47:57 HKT 2007

PythonÀïÕâ¸öÎÊÌâ²»´ó£¬ÒòÎªÓÐ¶ÔÏó³ØµÄ´æÔÚ£¬ÕâÖÖ·³ÄÕ½»¸øVMÈ¥ºÃÁË£º£©¡£

2007/1/7, Rocky Lee <nuffin在gmail.com>:
>
> Èç¹ûÊý¾Ý¼¯ºÏÌØ±ð´ó£¬ÇÒÖØ¸´¶È²»¸ßµÄ»°£¬Õâ¸ö set Æñ²»ÊÇºÄ·ÑÁËË«±¶µÄÄÚ´æ£¬Ò»¸ö dict ºÍÒ»¸ö
> set£¿



-- 
Blog°á¼ÒÁË

ÁõöÎ
March.Liu
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070107/e7c71830/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月08日星期一 11:04

Julius F. Huang fozzold在gmail.com
星期一一月 8 11:04:30 HKT 2007

这样可不可以?

d = {}
for i in l:
    d[i] = d.get(i, 0) + 1

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

何宇祺

0楼 2007年01月08日星期一 19:31

heyuqi yuqi.he在gmail.com
星期一一月 8 19:31:39 HKT 2007

On 1/8/07, Julius F. Huang <fozzold at gmail.com> wrote:
>
> 这样可不可以?
>
> d = {}
> for i in l:
>     d[i] = d.get(i, 0) + 1
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to   python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese


在排序的部分改为堆排序，这个排序在最坏时候的复杂度为
O(nlog2n)，要只想得到10次数最多的最大整数，则只做前面10次排序，即O(10log2n)=O(log2n)

想看看用 python 是如何实现的，而我对 python 不太熟悉，如今又忙着考试，只有 C++ 的实现，附上：

//------------------------------------------------------------------------------
// 堆排序
//------------------------------------------------------------------------------
class Element {
    public:
        // ...
        bool operator<( Element e ) { return this.key < e.key; }
        void operator=( Element e ) { key = e.key; }
    private:
        int key;
}

void build_heap( Element* tree, const int root, const int n ) {
    int half_n = n/2;
    int max = -1;
    int i = root;
    int left, right;        // 左右孩子

    while( i <= half_n ) {

            left = 2*i;
            right = left+1;

            // 记录下左右孩子中值最大的下标
            if( left < n && ( tree[left] < tree[right] ) ) {
                    max = right;
            } else {
                    max = left;
            }

            // 对父结点及其孩子中值最大的结点进行排序
            if( tree[i] < tree[max] ) {
                    swap( tree, i, max );
                    i = max;
            } else {
            // 子树是大顶堆，以 root 为父结点的树也满足大顶堆的要求，退出循环
                    break;
            }
    }
}

void HeapSort( Element* root, const int n ) {

        // 初始建堆
        for( int i = n/2; i >= i; --i ) {
                build_heap( root, i, n );
        }

        // 这是对全部数据排序，如果只想得到最大10个数据的，则写成
        // for( int i = 10; i > 1; --i )
        for( int i = n; i > 1; --i ) {
                swap( root, 1, i );
                build_heap( root, 1, i-1 );
        }
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20070108/a0093eac/attachment-0001.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月09日星期二 00:33

Leo Jay python.leojay在gmail.com
星期二一月 9 00:33:50 HKT 2007

On 1/8/07, heyuqi <yuqi.he在gmail.com> wrote:
>
> 在排序的部分改为堆排序，这个排序在最坏时候的复杂度为
> O(nlog2n)，要只想得到10次数最多的最大整数，则只做前面10次排序，即O(10log2n)=O(log2n)
>
> 想看看用 python 是如何实现的，而我对 python 不太熟悉，如今又忙着考试，只有 C++ 的实现，附上：
>
> //------------------------------------------------------------------------------
> // 堆排序
> //------------------------------------------------------------------------------
> class Element {
>     public:
>         // ...
>         bool operator<( Element e ) { return this.key < e.key; }
>         void operator=( Element e ) { key = e.key; }
>     private:
>         int key;
>  }
>
> void build_heap( Element* tree, const int root, const int n ) {
>     int half_n = n/2;
>     int max = -1;
>     int i = root;
>     int left, right;        // 左右孩子
>
>     while( i <= half_n ) {
>
>             left = 2*i;
>             right = left+1;
>
>             // 记录下左右孩子中值最大的下标
>             if( left < n && ( tree[left] < tree[right] ) ) {
>                     max = right;
>             } else {
>                     max = left;
>             }
>
>             // 对父结点及其孩子中值最大的结点进行排序
>             if( tree[i] < tree[max] ) {
>                     swap( tree, i, max );
>                     i = max;
>             } else {
>             // 子树是大顶堆，以 root 为父结点的树也满足大顶堆的要求，退出循环
>                     break;
>             }
>     }
> }
>
> void HeapSort( Element* root, const int n ) {
>
>         // 初始建堆
>         for( int i = n/2; i >= i; --i ) {
>                 build_heap( root, i, n );
>         }
>
>         // 这是对全部数据排序，如果只想得到最大10个数据的，则写成
>         // for( int i = 10; i > 1; --i )
>         for( int i = n; i > 1; --i ) {
>                 swap( root, 1, i );
>                 build_heap( root, 1, i-1 );
>         }
> }
>


修正你程序的几处问题：
首先，看到你写bool operator<( Element e ) { return this.key < e.key;
}我想你的程序没有编译过。至少我没见过this.key的用法。
其次，堆排的时间复杂度是O(nlogn)没有错，但是，取前10个的时间复杂度并不是O(10logn)
而应该是O(n)，确切的说大概是O((n-10)*log(10))，注：所有log均以2为底
最后，如果要用c++的话，不用自己实现堆排的，现成的就有，partial_sort就可以实现把前k个值排出来

另，我自己写了个c++版本的程序用g++4.1试了一下，在我AMD Barton
2500+的机器上，1kw个数据，包括生成数据和出结果的所有时间在内，6秒左右。
不过这跟数据有关的，我生成的1kw数据用的是0到5000000之内的随机数，如果上限再小一些的话，重复的数据就会多很多，速度就会更快了。

-- 
Best Regards,
Leo Jay

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

何宇祺

0楼 2007年01月09日星期二 08:39

heyuqi yuqi.he在gmail.com
星期二一月 9 08:39:47 HKT 2007

On 1/9/07, Leo Jay <python.leojay at gmail.com> wrote:
>
> On 1/8/07, heyuqi <yuqi.he at gmail.com> wrote:
> >
> > 在排序的部分改为堆排序，这个排序在最坏时候的复杂度为
> > O(nlog2n)，要只想得到10次数最多的最大整数，则只做前面10次排序，即O(10log2n)=O(log2n)
> >
> > 想看看用 python 是如何实现的，而我对 python 不太熟悉，如今又忙着考试，只有 C++ 的实现，附上：
> >
> >
> //------------------------------------------------------------------------------
> > // 堆排序
> >
> //------------------------------------------------------------------------------
> > class Element {
> >     public:
> >         // ...
> >         bool operator<( Element e ) { return this.key < e.key; }
> >         void operator=( Element e ) { key = e.key; }
> >     private:
> >         int key;
> >  }
> >
> > void build_heap( Element* tree, const int root, const int n ) {
> >     int half_n = n/2;
> >     int max = -1;
> >     int i = root;
> >     int left, right;        // 左右孩子
> >
> >     while( i <= half_n ) {
> >
> >             left = 2*i;
> >             right = left+1;
> >
> >             // 记录下左右孩子中值最大的下标
> >             if( left < n && ( tree[left] < tree[right] ) ) {
> >                     max = right;
> >             } else {
> >                     max = left;
> >             }
> >
> >             // 对父结点及其孩子中值最大的结点进行排序
> >             if( tree[i] < tree[max] ) {
> >                     swap( tree, i, max );
> >                     i = max;
> >             } else {
> >             // 子树是大顶堆，以 root 为父结点的树也满足大顶堆的要求，退出循环
> >                     break;
> >             }
> >     }
> > }
> >
> > void HeapSort( Element* root, const int n ) {
> >
> >         // 初始建堆
> >         for( int i = n/2; i >= i; --i ) {
> >                 build_heap( root, i, n );
> >         }
> >
> >         // 这是对全部数据排序，如果只想得到最大10个数据的，则写成
> >         // for( int i = 10; i > 1; --i )
> >         for( int i = n; i > 1; --i ) {
> >                 swap( root, 1, i );
> >                 build_heap( root, 1, i-1 );
> >         }
> > }
> >
>
>
> 修正你程序的几处问题：
> 首先，看到你写bool operator<( Element e ) { return this.key < e.key;
> }我想你的程序没有编译过。至少我没见过this.key的用法。
> 其次，堆排的时间复杂度是O(nlogn)没有错，但是，取前10个的时间复杂度并不是O(10logn)
> 而应该是O(n)，确切的说大概是O((n-10)*log(10))，注：所有log均以2为底
> 最后，如果要用c++的话，不用自己实现堆排的，现成的就有，partial_sort就可以实现把前k个值排出来
>
> 另，我自己写了个c++版本的程序用g++4.1试了一下，在我AMD Barton
> 2500+的机器上，1kw个数据，包括生成数据和出结果的所有时间在内，6秒左右。
> 不过这跟数据有关的，我生成的1kw数据用的是0到5000000之内的随机数，如果上限再小一些的话，重复的数据就会多很多，速度就会更快了。
>
> --
> Best Regards,
> Leo Jay
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese


确实没编译过，这是给同学考前背的，之前把

>         void operator=( Element e ) { key = e.key; }

编译了，没问题

>         bool operator<( Element e ) { return this.key < e.key; }

是后加上去的。要说明一点的是，用算符重构只是为了记忆方便，不知道实际效果好不好，Leo Jay ，你研究过吗？要追求速度的程序中用好不好？

关于

>         bool operator<( Element e ) { return this.key < e.key; }

应该是 bool operator<( Element e ) { return this->key < e.key; }
或者
bool operator<( Element e ) { return key < e.key; }

this 是指针，犯了一个低级错误。
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20070109/77a711e2/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月09日星期二 10:02

Yunfeng Tao taoyfeng在gmail.com
星期二一月 9 10:02:20 HKT 2007

python 2.5自带的heapq模块已经实现了。

在 07-1-8，heyuqi<yuqi.he在gmail.com> 写道：
>
>
> On 1/8/07, Julius F. Huang <fozzold在gmail.com > wrote:
> > 这样可不可以?
> >
> > d = {}
> > for i in l:
> >     d[i] = d.get(i, 0) + 1
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> > Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>
> 在排序的部分改为堆排序，这个排序在最坏时候的复杂度为
> O(nlog2n)，要只想得到10次数最多的最大整数，则只做前面10次排序，即O(10log2n)=O(log2n)
>
> 想看看用 python 是如何实现的，而我对 python 不太熟悉，如今又忙着考试，只有 C++ 的实现，附上：
>
> //------------------------------------------------------------------------------
> // 堆排序
> //------------------------------------------------------------------------------
> class Element {
>     public:
>         // ...
>         bool operator<( Element e ) { return this.key < e.key; }
>         void operator=( Element e ) { key = e.key; }
>     private:
>         int key;
>  }
>
> void build_heap( Element* tree, const int root, const int n ) {
>     int half_n = n/2;
>     int max = -1;
>     int i = root;
>     int left, right;        // 左右孩子
>
>     while( i <= half_n ) {
>
>             left = 2*i;
>             right = left+1;
>
>             // 记录下左右孩子中值最大的下标
>             if( left < n && ( tree[left] < tree[right] ) ) {
>                     max = right;
>             } else {
>                     max = left;
>             }
>
>             // 对父结点及其孩子中值最大的结点进行排序
>             if( tree[i] < tree[max] ) {
>                     swap( tree, i, max );
>                     i = max;
>             } else {
>             // 子树是大顶堆，以 root 为父结点的树也满足大顶堆的要求，退出循环
>                     break;
>             }
>     }
> }
>
> void HeapSort( Element* root, const int n ) {
>
>         // 初始建堆
>         for( int i = n/2; i >= i; --i ) {
>                 build_heap( root, i, n );
>         }
>
>         // 这是对全部数据排序，如果只想得到最大10个数据的，则写成
>         // for( int i = 10; i > 1; --i )
>         for( int i = n; i > 1; --i ) {
>                 swap( root, 1, i );
>                 build_heap( root, 1, i-1 );
>         }
> }
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月09日星期二 10:09

Yunfeng Tao taoyfeng在gmail.com
星期二一月 9 10:09:24 HKT 2007

n长序列求前k大元素，根本不必排序。用半个快速排序就能搞定，期望时间复杂度O(n)，系数非常小，而且容易实现。

在 07-1-8，heyuqi<yuqi.he在gmail.com> 写道：
>
>
> On 1/8/07, Julius F. Huang <fozzold在gmail.com > wrote:
> > 这样可不可以?
> >
> > d = {}
> > for i in l:
> >     d[i] = d.get(i, 0) + 1
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> > Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>
> 在排序的部分改为堆排序，这个排序在最坏时候的复杂度为
> O(nlog2n)，要只想得到10次数最多的最大整数，则只做前面10次排序，即O(10log2n)=O(log2n)
>
> 想看看用 python 是如何实现的，而我对 python 不太熟悉，如今又忙着考试，只有 C++ 的实现，附上：
>
> //------------------------------------------------------------------------------
> // 堆排序
> //------------------------------------------------------------------------------
> class Element {
>     public:
>         // ...
>         bool operator<( Element e ) { return this.key < e.key; }
>         void operator=( Element e ) { key = e.key; }
>     private:
>         int key;
>  }
>
> void build_heap( Element* tree, const int root, const int n ) {
>     int half_n = n/2;
>     int max = -1;
>     int i = root;
>     int left, right;        // 左右孩子
>
>     while( i <= half_n ) {
>
>             left = 2*i;
>             right = left+1;
>
>             // 记录下左右孩子中值最大的下标
>             if( left < n && ( tree[left] < tree[right] ) ) {
>                     max = right;
>             } else {
>                     max = left;
>             }
>
>             // 对父结点及其孩子中值最大的结点进行排序
>             if( tree[i] < tree[max] ) {
>                     swap( tree, i, max );
>                     i = max;
>             } else {
>             // 子树是大顶堆，以 root 为父结点的树也满足大顶堆的要求，退出循环
>                     break;
>             }
>     }
> }
>
> void HeapSort( Element* root, const int n ) {
>
>         // 初始建堆
>         for( int i = n/2; i >= i; --i ) {
>                 build_heap( root, i, n );
>         }
>
>         // 这是对全部数据排序，如果只想得到最大10个数据的，则写成
>         // for( int i = 10; i > 1; --i )
>         for( int i = n; i > 1; --i ) {
>                 swap( root, 1, i );
>                 build_heap( root, 1, i-1 );
>         }
> }
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年01月09日星期二 12:34

wkssy delux25在gmail.com
星期二一月 9 12:34:06 HKT 2007

Ö÷ÌùËµµÄÊÇÏëÒª³öÏÖ´ÎÊý×î¶àµÄÊý£¬²»ÊÇÇ°K´óÔªËØ
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070109/e20fcec4/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

何宇祺

0楼 2007年01月09日星期二 17:37

heyuqi yuqi.he在gmail.com
星期二一月 9 17:37:21 HKT 2007

On 1/9/07, wkssy <delux25 at gmail.com> wrote:
>
> 主贴说的是想要出现次数最多的数，不是前K大元素


是的，主贴是这么说，但是这个问题应该先对出现过的数进行统计，然后对出现的次数进行排序吧。要不然怎么能知道哪几个数出现的次数最多。而第二步也就了求前
k  大的元素了。



On 1/9/07, Leo Jay <python.leojay at gmail.com> wrote:
>
>
> 修正你程序的几处问题：
> 首先，看到你写bool operator<( Element e ) { return this.key < e.key;
> }我想你的程序没有编译过。至少我没见过this.key的用法。


写错了应该是 this->key


其次，堆排的时间复杂度是O(nlogn)没有错，但是，取前10个的时间复杂度并不是O(10logn)
> 而应该是O(n)，确切的说大概是O((n-10)*log(10))，注：所有log均以2为底


不对吧，若一个数组 array[n] 是堆，从中删除一个数据，并使用后来的 array[n-1]仍是堆，这种算法的时候复杂度为 O(logn)

这是我们课后练习的一道题，以前证过，现在一时想不起了。只能把结果写出来


On 1/9/07, Yunfeng Tao <taoyfeng at gmail.com> wrote:
n长序列求前k大元素，根本不必排序。用半个快速排序就能搞定
>
> ，期望时间复杂度O(n)，系数非常小，而且容易实现。


半个快速排序？可以实现需求吗？

在我这里快速排序也叫分划排序，每一趟排序都确定一个数的位置，使它左边的数都小于它，右边的数都大于它。而一趟排序的时间复杂度应该是
O(n)，每个数都要扫描一遍。

假如
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20070109/696f9fbd/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！

Zeuux © 2025

京ICP备05028076号