Python论坛的帖子： - 哲思

Python论坛 - 讨论区

返回群组主页

标题：[python-chinese] 存放比较大的字典结构最好的解决方案是什么？

分享

孙君意

楼主 2006年12月08日星期五 11:32

junyi sun ccnusjy在gmail.com
星期五十二月 8 11:32:51 HKT 2006

我用过cPickle，缺点是load的时候要很长时间
后改用shelve，缺点是插入随着字典规模变大性能指数下降
后改用dbhash，性能比shelve稍强一点
后改用bsddb，发现它的btopen()比hashopen()要快，但是规模一大，性能下降的很快

------请各位高手指教

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

0楼 2006年12月08日星期五 11:38

刘鑫 march.liu在gmail.com
星期五十二月 8 11:38:36 HKT 2006

BDB¶¼²»ÐÐ£¿Õâ¿É¾ÍÄÑÁË£¬ÄãÓÐÃ»ÓÐ¸øBDBÎÄ¼þ¼ÓË÷Òý£¿¼ÓÁË»á¿ìÐ©¡£

2006/12/8, junyi sun <ccnusjy在gmail.com>:
>
> ÎÒÓÃ¹ýcPickle£¬È±µãÊÇloadµÄÊ±ºòÒªºÜ³¤Ê±¼ä
> ºó¸ÄÓÃshelve£¬È±µãÊÇ²åÈëËæ×Å×Öµä¹æÄ£±ä´óÐÔÄÜÖ¸ÊýÏÂ½µ
> ºó¸ÄÓÃdbhash£¬ÐÔÄÜ±ÈshelveÉÔÇ¿Ò»µã
> ºó¸ÄÓÃbsddb£¬·¢ÏÖËüµÄbtopen()±Èhashopen()Òª¿ì£¬µ«ÊÇ¹æÄ£Ò»´ó£¬ÐÔÄÜÏÂ½µµÄºÜ¿ì
>
> ------Çë¸÷Î»¸ßÊÖÖ¸½Ì
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese




-- 
»¶Ó·ÃÎÊ£º
http://blog.csdn.net/ccat

ÁõöÎ
March.Liu
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20061208/7453a012/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年12月08日星期五 11:41

bird devdoer devdoer在gmail.com
星期五十二月 8 11:41:08 HKT 2006

dbhash,shelve,hashopen 是同一个东西吧 都是用的berkeley db的hash method
你多大数据量
用bsddb还是很快的

2006/12/8, junyi sun <ccnusjy at gmail.com>:
>
> 我用过cPickle，缺点是load的时候要很长时间
> 后改用shelve，缺点是插入随着字典规模变大性能指数下降
> 后改用dbhash，性能比shelve稍强一点
> 后改用bsddb，发现它的btopen()比hashopen()要快，但是规模一大，性能下降的很快
>
> ------请各位高手指教
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese




-- 
devdoer
devdoer at gmail.com
http://devdoer.blog.sohu.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20061208/02b62d10/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

孙君意

0楼 2006年12月08日星期五 11:42

junyi sun ccnusjy在gmail.com
星期五十二月 8 11:42:41 HKT 2006

给BDB文件加索引？
是怎样做，是用python吗？谢谢！

On 12/8/06, 刘鑫 <march.liu在gmail.com> wrote:
> BDB都不行？这可就难了，你有没有给BDB文件加索引？加了会快些。
>
> 2006/12/8, junyi sun <ccnusjy在gmail.com>:
> >
> > 我用过cPickle，缺点是load的时候要很长时间
> > 后改用shelve，缺点是插入随着字典规模变大性能指数下降
> > 后改用dbhash，性能比shelve稍强一点
> > 后改用bsddb，发现它的btopen()比hashopen()要快，但是规模一大，性能下降的很快
> >
> > ------请各位高手指教
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> > Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>
>
>
> --
> 欢迎访问：
> http://blog.csdn.net/ccat
>
> 刘鑫
> March.Liu
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

0楼 2006年12月08日星期五 11:44

刘鑫 march.liu在gmail.com
星期五十二月 8 11:44:35 HKT 2006

Äã¿´¿´Õâ·ÝÎÄµµ£¬Ó¦¸Ã¶ÔÄãÓÐÓÃ
http://pybsddb.sourceforge.net/

2006/12/8, junyi sun <ccnusjy在gmail.com>:
>
> ¸øBDBÎÄ¼þ¼ÓË÷Òý£¿
> ÊÇÔõÑù×ö£¬ÊÇÓÃpythonÂð£¿Ð»Ð»£¡
>
> On 12/8/06, ÁõöÎ <march.liu在gmail.com> wrote:
> > BDB¶¼²»ÐÐ£¿Õâ¿É¾ÍÄÑÁË£¬ÄãÓÐÃ»ÓÐ¸øBDBÎÄ¼þ¼ÓË÷Òý£¿¼ÓÁË»á¿ìÐ©¡£
> >
> > 2006/12/8, junyi sun <ccnusjy在gmail.com>:
> > >
> > > ÎÒÓÃ¹ýcPickle£¬È±µãÊÇloadµÄÊ±ºòÒªºÜ³¤Ê±¼ä
> > > ºó¸ÄÓÃshelve£¬È±µãÊÇ²åÈëËæ×Å×Öµä¹æÄ£±ä´óÐÔÄÜÖ¸ÊýÏÂ½µ
> > > ºó¸ÄÓÃdbhash£¬ÐÔÄÜ±ÈshelveÉÔÇ¿Ò»µã
> > > ºó¸ÄÓÃbsddb£¬·¢ÏÖËüµÄbtopen()±Èhashopen()Òª¿ì£¬µ«ÊÇ¹æÄ£Ò»´ó£¬ÐÔÄÜÏÂ½µµÄºÜ¿ì
> > >
> > > ------Çë¸÷Î»¸ßÊÖÖ¸½Ì
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese在lists.python.cn
> > > Subscribe: send subscribe to
> > python-chinese-request在lists.python.cn
> > > Unsubscribe: send unsubscribe to
> > python-chinese-request在lists.python.cn
> > > Detail Info:
> > http://python.cn/mailman/listinfo/python-chinese
> >
> >
> >
> > --
> > »¶Ó·ÃÎÊ£º
> > http://blog.csdn.net/ccat
> >
> > ÁõöÎ
> > March.Liu
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to
> > python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> > python-chinese-request在lists.python.cn
> > Detail Info:
> > http://python.cn/mailman/listinfo/python-chinese
> >
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese




-- 
»¶Ó·ÃÎÊ£º
http://blog.csdn.net/ccat

ÁõöÎ
March.Liu
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20061208/01a9b7b8/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年12月08日星期五 11:48

lu_zi_2000 lu_zi_2000在163.com
星期五十二月 8 11:48:44 HKT 2006

×Ô¼ºÐ´·ûºÏ×Ô¼ºÒªÇóµÄc++Ä£¿éÈ»ºóÓÃpythonÀ´µ÷ÓÃ




lu_zi_2000
2006-12-08



·¢¼þÈË£º junyi sun
·¢ËÍÊ±¼ä£º 2006-12-08 11:33:35
ÊÕ¼þÈË£º python-chinese在lists.python.cn
³ËÍ£º 
Ö÷Ìâ£º [python-chinese] ´æ·Å±È½Ï´óµÄ×Öµä½á¹¹×îºÃµÄ½â¾ö·½°¸ÊÇÊ²Ã´£¿

ÎÒÓÃ¹ýcPickle£¬È±µãÊÇloadµÄÊ±ºòÒªºÜ³¤Ê±¼ä
ºó¸ÄÓÃshelve£¬È±µãÊÇ²åÈëËæ×Å×Öµä¹æÄ£±ä´óÐÔÄÜÖ¸ÊýÏÂ½µ
ºó¸ÄÓÃdbhash£¬ÐÔÄÜ±ÈshelveÉÔÇ¿Ò»µã
ºó¸ÄÓÃbsddb£¬·¢ÏÖËüµÄbtopen()±Èhashopen()Òª¿ì£¬µ«ÊÇ¹æÄ£Ò»´ó£¬ÐÔÄÜÏÂ½µµÄºÜ¿ì

------Çë¸÷Î»¸ßÊÖÖ¸½Ì
_______________________________________________
python-chinese
Post: send python-chinese在lists.python.cn
Subscribe: send subscribe to python-chinese-request在lists.python.cn
Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20061208/600471ec/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

0楼 2006年12月08日星期五 12:16

Zoom.Quiet zoom.quiet在gmail.com
星期五十二月 8 12:16:42 HKT 2006

try http://buzhug.sourceforge.net/
usage pure py obj. DB to usage dict ...
Karrigell is choice buzhug as default DB embedding

On 12/8/06, lu_zi_2000 <lu_zi_2000在163.com> wrote:
>
>
> 自己写符合自己要求的c++模块然后用python来调用
>
>
>  ________________________________
>
> lu_zi_2000
> 2006-12-08
>  ________________________________
>
> 发件人： junyi sun
> 发送时间： 2006-12-08 11:33:35
> 收件人： python-chinese在lists.python.cn
> 抄送：
> 主题： [python-chinese] 存放比较大的字典结构最好的解决方案是什么？
>
>
>
> 我用过cPickle，缺点是load的时候要很长时间
> 后改用shelve，缺点是插入随着字典规模变大性能指数下降
> 后改用dbhash，性能比shelve稍强一点
> 后改用bsddb，发现它的btopen()比hashopen()要快，但是规模一大，性能下降的很快
>
> ------请各位高手指教
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>


-- 
'''Time is unimportant, only life important!
blog@  http://blog.zoomquiet.org/pyblosxom/
wiki@    http://wiki.woodpecker.org.cn/moin/ZoomQuiet
douban@ http://www.douban.com/people/zoomq/
____________________________________
Please use OpenOffice.org to replace M$ office.
     http://zh.openoffice.org
Please use 7-zip to replace WinRAR/WinZip.
     http://7-zip.org/zh-cn/
You can get the truely Freedom from software.
'''

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

孙君意

0楼 2006年12月08日星期五 12:27

junyi sun ccnusjy在gmail.com
星期五十二月 8 12:27:46 HKT 2006

谢谢各位了，我再多试试。

我是想实验一下全文检索的倒排原理，把网页抓下来，去除html及javascript代码后，
把得到的纯文本及其他信息存在"page.dat"文件里，每一块记录的格式为
FMT="256s256sLf50000s" #存储格式{url,title,size,updatetime,content}

然后运行索引程序，从page.dat中挨个读出每块记录，并且对文本进行分词，然后就要用到字典了，即一个词对应一个集合，集合里面是含有这个词的"块"的偏移址。
如{'中国':set([55666L,54323L,1234234L])}。这个大字典保存在index.idx里面

当用户输入一个"句子"进行查询的时候，把这个句子拆成单词，然后在字典里进行
匹配，会得到几个集合，然后对集合求交集（不一定存在，按相交度排序）。最后在根据集合里的"块"偏移址读出信息。

以上只是初步设想，实现起来还有诸多困难，以后还要多向各位请教。





On 12/8/06, lu_zi_2000 <lu_zi_2000在163.com> wrote:
>
> 自己写符合自己要求的c++模块然后用python来调用
>
> ________________________________
>
> lu_zi_2000
> 2006-12-08
> ________________________________
>
> 发件人： junyi sun
> 发送时间： 2006-12-08 11:33:35
> 收件人： python-chinese在lists.python.cn
> 抄送：
> 主题： [python-chinese] 存放比较大的字典结构最好的解决方案是什么？
>
>
> 我用过cPickle，缺点是load的时候要很长时间
> 后改用shelve，缺点是插入随着字典规模变大性能指数下降
> 后改用dbhash，性能比shelve稍强一点
> 后改用bsddb，发现它的btopen()比hashopen()要快，但是规模一大，性能下降的很快
>
> ------请各位高手指教
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年12月08日星期五 12:41

bird devdoer devdoer在gmail.com
星期五十二月 8 12:41:11 HKT 2006

几千万文档没问题的，放心吧
我也是这么干的 呵呵 倒排索引放到bdb里面  不过我是分块的 用了多个倒排索引，这样可以分布式存储和处理


2006/12/8, junyi sun <ccnusjy at gmail.com>:
>
> 谢谢各位了，我再多试试。
>
> 我是想实验一下全文检索的倒排原理，把网页抓下来，去除html及javascript代码后，
> 把得到的纯文本及其他信息存在"page.dat"文件里，每一块记录的格式为
> FMT="256s256sLf50000s" #存储格式{url,title,size,updatetime,content}
>
> 然后运行索引程序，从page.dat中挨个读出每块记录，
> 并且对文本进行分词，然后就要用到字典了，即一个词对应一个集合，集合里面是含有这个词的"块"的偏移址。
> 如{'中国':set([55666L,54323L,1234234L])}。这个大字典保存在index.idx里面
>
> 当用户输入一个"句子"进行查询的时候，把这个句子拆成单词，然后在字典里进行
> 匹配，会得到几个集合，然后对集合求交集（不一定存在，按相交度排序）。最后在根据集合里的"块"偏移址读出信息。
>
> 以上只是初步设想，实现起来还有诸多困难，以后还要多向各位请教。
>
>
>
>
>
> On 12/8/06, lu_zi_2000 <lu_zi_2000 at 163.com> wrote:
> >
> > 自己写符合自己要求的c++模块然后用python来调用
> >
> > ________________________________
> >
> > lu_zi_2000
> > 2006-12-08
> > ________________________________
> >
> > 发件人： junyi sun
> > 发送时间： 2006-12-08 11:33:35
> > 收件人： python-chinese at lists.python.cn
> > 抄送：
> > 主题： [python-chinese] 存放比较大的字典结构最好的解决方案是什么？
> >
> >
> > 我用过cPickle，缺点是load的时候要很长时间
> > 后改用shelve，缺点是插入随着字典规模变大性能指数下降
> > 后改用dbhash，性能比shelve稍强一点
> > 后改用bsddb，发现它的btopen()比hashopen()要快，但是规模一大，性能下降的很快
> >
> > ------请各位高手指教
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese at lists.python.cn
> > Subscribe: send subscribe to
> > python-chinese-request at lists.python.cn
> > Unsubscribe: send unsubscribe to
> > python-chinese-request at lists.python.cn
> > Detail Info:
> > http://python.cn/mailman/listinfo/python-chinese
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese at lists.python.cn
> > Subscribe: send subscribe to
> > python-chinese-request at lists.python.cn
> > Unsubscribe: send unsubscribe to
> > python-chinese-request at lists.python.cn
> > Detail Info:
> > http://python.cn/mailman/listinfo/python-chinese
> >
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese




-- 
devdoer
devdoer at gmail.com
http://devdoer.blog.sohu.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20061208/78308682/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

孙君意

0楼 2006年12月08日星期五 13:27

junyi sun ccnusjy在gmail.com
星期五十二月 8 13:27:09 HKT 2006

谢谢bird devdoer ：
    能不能讲一讲："分块的 用了多个倒排索引，这样可以分布式存储和处理 "是什么样的策略？

On 12/8/06, bird devdoer <devdoer在gmail.com> wrote:
> 几千万文档没问题的，放心吧
> 我也是这么干的 呵呵 倒排索引放到bdb里面  不过我是分块的 用了多个倒排索引，这样可以分布式存储和处理
>
>
> 2006/12/8, junyi sun <ccnusjy在gmail.com>:
> >
> > 谢谢各位了，我再多试试。
> >
> > 我是想实验一下全文检索的倒排原理，把网页抓下来，去除html及javascript代码后，
> > 把得到的纯文本及其他信息存在"page.dat"文件里，每一块记录的格式为
> > FMT="256s256sLf50000s"
> #存储格式{url,title,size,updatetime,content}
> >
> >
> 然后运行索引程序，从page.dat中挨个读出每块记录，并且对文本进行分词，然后就要用到字典了，即一个词对应一个集合，集合里面是含有这个词的"块"的偏移址。
> > 如{'中国':set([55666L,54323L,1234234L])}。这个大字典保存在index.idx里面
> >
> > 当用户输入一个"句子"进行查询的时候，把这个句子拆成单词，然后在字典里进行
> > 匹配，会得到几个集合，然后对集合求交集（不一定存在，按相交度排序）。最后在根据集合里的"块"偏移址读出信息。
> >
> > 以上只是初步设想，实现起来还有诸多困难，以后还要多向各位请教。
> >
> >
> >
> >
> >
> > On 12/8/06, lu_zi_2000 < lu_zi_2000在163.com> wrote:
> > >
> > > 自己写符合自己要求的c++模块然后用python来调用
> > >
> > > ________________________________
> > >
> > > lu_zi_2000
> > > 2006-12-08
> > > ________________________________
> > >
> > > 发件人： junyi sun
> > > 发送时间： 2006-12-08 11:33:35
> > > 收件人： python-chinese在lists.python.cn
> > > 抄送：
> > > 主题： [python-chinese] 存放比较大的字典结构最好的解决方案是什么？
> > >
> > >
> > > 我用过cPickle，缺点是load的时候要很长时间
> > > 后改用shelve，缺点是插入随着字典规模变大性能指数下降
> > > 后改用dbhash，性能比shelve稍强一点
> > > 后改用bsddb，发现它的btopen()比hashopen()要快，但是规模一大，性能下降的很快
> > >
> > > ------请各位高手指教
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese在lists.python.cn
> > > Subscribe: send subscribe to
> > > python-chinese-request在lists.python.cn
> > > Unsubscribe: send unsubscribe to
> > > python-chinese-request在lists.python.cn
> > > Detail Info:
> > > http://python.cn/mailman/listinfo/python-chinese
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese在lists.python.cn
> > > Subscribe: send subscribe to
> > > python-chinese-request在lists.python.cn
> > > Unsubscribe: send unsubscribe to
> > > python-chinese-request在lists.python.cn
> > > Detail Info:
> > > http://python.cn/mailman/listinfo/python-chinese
> > >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> > Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>
>
>
> --
> devdoer
> devdoer在gmail.com
> http://devdoer.blog.sohu.com/
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年12月08日星期五 14:07

bird devdoer devdoer在gmail.com
星期五十二月 8 14:07:21 HKT 2006

恩 就是分成多个不相交的文档集 分别建索引，这样得到的多个索引和多个文档集都可以分布在多台机器上。

2006/12/8, junyi sun <ccnusjy at gmail.com>:
>
> 谢谢bird devdoer ：
>    能不能讲一讲："分块的 用了多个倒排索引，这样可以分布式存储和处理 "是什么样的策略？
>
> On 12/8/06, bird devdoer <devdoer at gmail.com> wrote:
> > 几千万文档没问题的，放心吧
> > 我也是这么干的 呵呵 倒排索引放到bdb里面  不过我是分块的 用了多个倒排索引，这样可以分布式存储和处理
> >
> >
> > 2006/12/8, junyi sun <ccnusjy at gmail.com>:
> > >
> > > 谢谢各位了，我再多试试。
> > >
> > > 我是想实验一下全文检索的倒排原理，把网页抓下来，去除html及javascript代码后，
> > > 把得到的纯文本及其他信息存在"page.dat"文件里，每一块记录的格式为
> > > FMT="256s256sLf50000s"
> > #存储格式{url,title,size,updatetime,content}
> > >
> > >
> > 然后运行索引程序，从page.dat中挨个读出每块记录，
> 并且对文本进行分词，然后就要用到字典了，即一个词对应一个集合，集合里面是含有这个词的"块"的偏移址。
> > > 如{'中国':set([55666L,54323L,1234234L])}。这个大字典保存在index.idx里面
> > >
> > > 当用户输入一个"句子"进行查询的时候，把这个句子拆成单词，然后在字典里进行
> > > 匹配，会得到几个集合，然后对集合求交集（不一定存在，按相交度排序）。最后在根据集合里的"块"偏移址读出信息。
> > >
> > > 以上只是初步设想，实现起来还有诸多困难，以后还要多向各位请教。
> > >
> > >
> > >
> > >
> > >
> > > On 12/8/06, lu_zi_2000 < lu_zi_2000 at 163.com> wrote:
> > > >
> > > > 自己写符合自己要求的c++模块然后用python来调用
> > > >
> > > > ________________________________
> > > >
> > > > lu_zi_2000
> > > > 2006-12-08
> > > > ________________________________
> > > >
> > > > 发件人： junyi sun
> > > > 发送时间： 2006-12-08 11:33:35
> > > > 收件人： python-chinese at lists.python.cn
> > > > 抄送：
> > > > 主题： [python-chinese] 存放比较大的字典结构最好的解决方案是什么？
> > > >
> > > >
> > > > 我用过cPickle，缺点是load的时候要很长时间
> > > > 后改用shelve，缺点是插入随着字典规模变大性能指数下降
> > > > 后改用dbhash，性能比shelve稍强一点
> > > > 后改用bsddb，发现它的btopen()比hashopen()要快，但是规模一大，性能下降的很快
> > > >
> > > > ------请各位高手指教
> > > > _______________________________________________
> > > > python-chinese
> > > > Post: send python-chinese at lists.python.cn
> > > > Subscribe: send subscribe to
> > > > python-chinese-request at lists.python.cn
> > > > Unsubscribe: send unsubscribe to
> > > > python-chinese-request at lists.python.cn
> > > > Detail Info:
> > > > http://python.cn/mailman/listinfo/python-chinese
> > > > _______________________________________________
> > > > python-chinese
> > > > Post: send python-chinese at lists.python.cn
> > > > Subscribe: send subscribe to
> > > > python-chinese-request at lists.python.cn
> > > > Unsubscribe: send unsubscribe to
> > > > python-chinese-request at lists.python.cn
> > > > Detail Info:
> > > > http://python.cn/mailman/listinfo/python-chinese
> > > >
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese at lists.python.cn
> > > Subscribe: send subscribe to
> > python-chinese-request at lists.python.cn
> > > Unsubscribe: send unsubscribe to
> > python-chinese-request at lists.python.cn
> > > Detail Info:
> > http://python.cn/mailman/listinfo/python-chinese
> >
> >
> >
> > --
> > devdoer
> > devdoer at gmail.com
> > http://devdoer.blog.sohu.com/
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese at lists.python.cn
> > Subscribe: send subscribe to
> > python-chinese-request at lists.python.cn
> > Unsubscribe: send unsubscribe to
> > python-chinese-request at lists.python.cn
> > Detail Info:
> > http://python.cn/mailman/listinfo/python-chinese
> >
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese




-- 
devdoer
devdoer at gmail.com
http://devdoer.blog.sohu.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20061208/ba569184/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！

Zeuux © 2025

京ICP备05028076号