常规操作elasticSearch分词和ik安装分词器以及配置
  9OVhFvwkhDei 2023年12月13日 40 0


常规操作elasticSearch分词和安装分词器

分词:

POST _analyze
{
  "analyzer": "standard",
  "text": "Today is what sunny."
}
{
  "tokens" : [
    {
      "token" : "today",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "what",
      "start_offset" : 9,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "sunny",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

默认分词器针对英文:

POST _analyze
{
  "analyzer": "standard",
  "text": "我是中国人."
}
如下:中文词语没有分出
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "中",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "国",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "人",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}

借助插件:ik分词器

下载ik分词器:

https://github.com/medcl/elasticsearch-analysis-ik/releases

下载zip文件后,解压,放置elasticsearch 目录下的plugins的文件中

把这个文件夹 授权

chmod -R 777  ik

进入容器(如果你是docker安装的es的话,非docker安装的直接跟着下一步走)

ik分词器是个插件哦 我们检查ik的安装情况:

找到 elasticsearch-plugin :docker容器安装的此文件位置在:

/usr/share/elasticsearch/bin/elasticsearch-plugin

操作:

执行:elasticsearch-plugin

得到:
Option         Description        
------         -----------        
-h, --help     show help          
-s, --silent   show minimal output
-v, --verbose  show verbose output
ERROR: Missing command
[root@6a850788e223 bin]# elasticsearch-plugin  -h
A tool for managing installed elasticsearch plugins

查阅帮助文档:
elasticsearch-plugin  -h
ommands
--------
list - Lists installed elasticsearch plugins
install - Install a plugin
remove - removes a plugin from Elasticsearch


执行查看插件列表:
[root@6a850788e223 bin]# elasticsearch-plugin  list
ik

通过运行结果得知:安装成功

安装完毕后,重启服务

测试ik分词器:

智能分词:
POST _analyze
{
  "analyzer": "ik_smart",
  "text": "我是中国人."
}
结果:
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}
最大单词组合:
POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "我是中国人."
}
结果:
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "国人",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}
创建自定义分词:

附录:搞一个Nginx服务

docker run -p 80:80 --name nginx -d nginx:1.10
docker container cp nginx:/etc/nginx  /mydata/nginx/conf
docker stop nginx
:docker rm $ContainerId

docker run -p 80:80 --name nginx1.10  \
-v /mydata/nginx1.10/html:/usr/share/nginx/html \
-v /mydata/nginx1.10/logs:/var/log/nginx \
-v /mydata/nginx1.10/conf:/etc/nginx \
-d nginx:1.10

在nginx服务目录 模拟放置分词资源

/mydata/nginx1.10/html/es
[root@bogon es]# ls
fenci.txt

赵一
钱二
孙三
李四

ik分词器的配置:

elasticsearch的plugins的ik目录中: config

[root@bogon ik]# pwd
/mydata/elasticsearch/plugins/ik
[root@bogon ik]# ls
commons-codec-1.9.jar    config                               httpclient-4.5.2.jar  plugin-descriptor.properties
commons-logging-1.2.jar  elasticsearch-analysis-ik-7.4.2.jar  httpcore-4.4.4.jar    plugin-security.policy


[root@bogon ik]# cd config/
[root@bogon config]# ls
extra_main.dic         extra_single_word_full.dic      extra_stopword.dic  main.dic         quantifier.dic  suffix.dic
extra_single_word.dic  extra_single_word_low_freq.dic  IKAnalyzer.cfg.xml  preposition.dic  stopword.dic    surname.dic

编辑配置文件:

vim IKAnalyzer.cfg.xml

配置文件源文件:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict"></entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

修改远程配置分词:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict"></entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
         <entry key="remote_ext_dict">http://192.168.31.125/es/fenci.txt</entry>
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

修改完毕保存 重启elasticSearch

[root@bogon config]# docker restart elasticsearch 
elasticsearch

测试ik分词器自定义分词:

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "赵一钱二孙三李四."
}
结果:
{
  "tokens" : [
    {
      "token" : "赵一",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "一钱",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "一",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "TYPE_CNUM",
      "position" : 2
    },
    {
      "token" : "钱二",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "钱",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "COUNT",
      "position" : 4
    },
    {
      "token" : "二",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "TYPE_CNUM",
      "position" : 5
    },
    {
      "token" : "孙三",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "三",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "TYPE_CNUM",
      "position" : 7
    },
    {
      "token" : "李四",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "四",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "TYPE_CNUM",
      "position" : 9
    }
  ]
}

更新完成后,es 只会对新增的数据用新词分词。历史数据是不会重新分词的。如果想要历史数据重新分词。需要执行:

POST my_index/_update_by_query?conflicts=proceed


【版权声明】本文内容来自摩杜云社区用户原创、第三方投稿、转载,内容版权归原作者所有。本网站的目的在于传递更多信息,不拥有版权,亦不承担相应法律责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@moduyun.com

  1. 分享:
最后一次编辑于 2023年12月13日 0

暂无评论

推荐阅读
9OVhFvwkhDei