bulk批量操作的json格式解析-摩杜云开发者社区

3.17 bulk批量操作的json格式解析

bulk的格式：

{action:{metadata}}\n

{requstbody}\n

为什么不使用如下格式：

[{

"action": {

"data": {

}

}]

这种方式可读性好，但是内部处理就麻烦了：

1.将json数组解析为JSONArray对象，在内存中就需要有一份json文本的拷贝，另外还有一个JSONArray对象。

2.解析json数组里的每个json，对每个请求中的document进行路由

3.为路由到同一个shard上的多个请求，创建一个请求数组

4.将这个请求数组序列化

5.将序列化后的请求数组发送到对应的节点上去

耗费更多内存，增加java虚拟机开销

1.不用将其转换为json对象，直接按照换行符切割json，内存中不需要json文本的拷贝

2.对每两个一组的json，读取meta，进行document路由

3.直接将对应的json发送到node上去

3.18 查询结果分析

{ "took": 419, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": 3, "max_score": 0.6931472, "hits": [ { "_index": "lib3", "_type": "user", "_id": "3", "_score": 0.6931472, "_source": { "address": "bei jing hai dian qu qing he zhen", "name": "lisi" } }, { "_index": "lib3", "_type": "user", "_id": "2", "_score": 0.47000363, "_source": { "address": "bei jing hai dian qu qing he zhen", "name": "zhaoming" } }

took：查询耗费的时间，单位是毫秒

_shards：共请求了多少个shard

total：查询出的文档总个数

max_score：本次查询中，相关度分数的最大值，文档和此次查询的匹配度越高，_score的值越大，排位越靠前

hits：默认查询前10个文档

timed_out：

GET /lib3/user/_search?timeout=10ms { "_source": ["address","name"], "query": { "match": { "interests": "changge" } } }

3.19 多index，多type查询模式

GET _search

GET /lib/_search

GET /lib,lib3/_search

GET /*3,*4/_search

GET /lib/user/_search

GET /lib,lib4/user,items/_search

GET /_all/_search

GET /_all/user,items/_search

3.20 分页查询中的deep paging问题

GET /lib3/user/_search { "from":0, "size":2, "query":{ "terms":{ "interests": ["hejiu","changge"] } } }

GET /_search?from=0&size=3

deep paging:查询的很深，比如一个索引有三个primary shard，分别存储了6000条数据，我们要得到第100页的数据(每页10条)，类似这种情况就叫deep paging

如何得到第100页的10条数据？

在每个shard中搜索990到999这10条数据，然后用这30条数据排序，排序之后取10条数据就是要搜索的数据，这种做法是错的，因为3个shard中的数据的_score分数不一样，可能这某一个shard中第一条数据的_score分数比另一个shard中第1000条都要高，所以在每个shard中搜索990到999这10条数据然后排序的做法是不正确的。

正确的做法是每个shard把0到999条数据全部搜索出来（按排序顺序），然后全部返回给coordinate node，由coordinate node按_score分数排序后，取出第100页的10条数据，然后返回给客户端。

deep paging性能问题

1.耗费网络带宽，因为搜索过深的话，各shard要把数据传送给coordinate node，这个过程是有大量数据传递的，消耗网络，

2.消耗内存，各shard要把数据传送给coordinate node，这个传递回来的数据，是被coordinate node保存在内存中的，这样会大量消耗内存。

3.消耗cpu coordinate node要把传回来的数据进行排序，这个排序过程很消耗cpu.

鉴于deep paging的性能问题，所以应尽量减少使用。

3.21 query string查询及copy_to解析

GET /lib3/user/_search?q=interests:changge

GET /lib3/user/_search?q=+interests:changge

GET /lib3/user/_search?q=-interests:changge

copy_to字段是把其它字段中的值，以空格为分隔符组成一个大字符串，然后被分析和索引，但是不存储，也就是说它能被查询，但不能被取回显示。

注意:copy_to指向的字段字段类型要为：text

当没有指定field时，就会从copy_to字段中查询 GET /lib3/user/_search?q=changge

3.22字符串排序问题

对一个字符串类型的字段进行排序通常不准确，因为已经被分词成多个词条了

解决方式：对字段索引两次，一次索引分词（用于搜索），一次索引不分词(用于排序)

GET /lib3/_search

GET /lib3/user/_search { "query": { "match_all": {} }, "sort": [ { "interests": { "order": "desc" } } ] }

GET /lib3/user/_search { "query": { "match_all": {} }, "sort": [ { "interests.raw": { "order": "asc" } } ] }

DELETE lib3

PUT /lib3 { "settings":{ "number_of_shards" : 3, "number_of_replicas" : 0 }, "mappings":{ "user":{ "properties":{ "name": {"type":"text"}, "address": {"type":"text"}, "age": {"type":"integer"}, "birthday": {"type":"date"}, "interests": { "type":"text", "fields": { "raw":{ "type": "keyword" } }, "fielddata": true } } } } }

3.23 如何计算相关度分数

使用的是TF/IDF算法(Term Frequency&Inverse Document Frequency)

1.Term Frequency:我们查询的文本中的词条在document本中出现了多少次，出现次数越多，相关度越高

搜索内容： hello world

Hello，I love china.

Hello world,how are you!

2.Inverse Document Frequency：我们查询的文本中的词条在索引的所有文档中出现了多少次，出现的次数越多，相关度越低

搜索内容：hello world

hello，what are you doing?

I like the world.

hello 在索引的所有文档中出现了500次，world出现了100次

3.Field-length(字段长度归约) norm:field越长，相关度越低

搜索内容：hello world

{"title":"hello,what's your name?","content":{"owieurowieuolsdjflk"}}

{"title":"hi,good morning","content":{"lkjkljkj.......world"}}

查看分数是如何计算的：

GET /lib3/user/_search?explain=true { "query":{ "match":{ "interests": "duanlian,changge" } } }

查看一个文档能否匹配上某个查询：

GET /lib3/user/2/_explain { "query":{ "match":{ "interests": "duanlian,changge" } } }