2013年12月18日 星期三 11:32
站内信息全文搜索是现代互联网应用的必备组件之一,实现方案也很多,比如Sphinx、Solr、ElasticSearch等等。
我今天介绍的是一种最简单最快速的方法来实现中文全文搜索。具体方案是使用Sphinx搜索引擎配合N-Grams算法。这种方案的优点就是配置简单、运行高效,无需复杂的中文分词程序库,无需到处寻找词典词库,创建索引也是无与伦比的快速。搜索结果质量也还可以(不要试图与专业的分词搜索系统对比),适合弥补企业搜索业务“从无到有”这一阶段的空白。
以下安装过程的环境为CentOS 6 X86_64系统。
一、下载安装最新版本Sphinx RPM安装包:
rpm -ivh http://sphinxsearch.com/files/sphinx-2.1.3-1.rhel6.x86_64.rpm
二、配置文件(/etc/sphinx/sphinx.conf):
source comment
{
type = mysql
sql_host = hostname
sql_user = user
sql_pass = pass
sql_db = database
sql_port = 3306
sql_query_pre = SET NAMES utf8
sql_query = select id, sku, comment, UNIX_TIMESTAMP(UpDT) as date_added from UserCmt
sql_attr_uint = sku
sql_attr_timestamp = date_added
}
index cmt
{
source = comment
path = /var/lib/sphinx/okaybuy_cmt
docinfo = extern
charset_type = utf-8
ngram_chars = U+3400..U+4DB5, U+4E00..U+9FA5, U+20000..U+2A6D6,U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
ngram_len = 1
}
indexer
{
mem_limit = 512M
}
searchd
{
listen = 127.0.0.1:9312
listen = 9306:mysql41
log = /var/log/sphinx/searchd.log
query_log = /var/log/sphinx/query.log
read_timeout = 5
max_children = 30
pid_file = /var/run/sphinx/searchd.pid
max_matches = 1000
seamless_rotate = 1
preopen_indexes = 1
unlink_old = 1
workers = threads # for RT to work
binlog_path = /var/lib/sphinx
}
中文搜索的关键部分在于ngram_chars和ngram_len配置,其他配置选项请根据自己应用的情况进行调整。
三、创建索引
indexer --all --rotate
四、启动搜索引擎
service searchd start
五、测试搜索:
search 中文关键词
如果运行一切正常并且搜索结果也让你满意,那么恭喜你,搜索系统可以上线了。
参考文档:
http://sphinxsearch.com/docs/2.1.3/
2013年12月19日 星期四 13:17
还是用solr吧,谁用谁知道。
2013年12月19日 星期四 13:58
o(∩∩)o...哈哈,同意楼上的观点,:)
Zeuux © 2024
京ICP备05028076号