Elasticsearch

AI摘要

北海のAI

ES是一个开源的分布式搜索和分析引擎，主要用于快速存储、搜索和分析海量数据。它是目前最流行的企业级搜索解决方案之一。

1、安装配置启动

ES下载链接：https://www.elastic.co/downloads/past-releases

# 修改config目录下elasticsearch.yml
xpack.security.enabled: false
xpack.security.enrollment.enabled: false

# 修改jvm.options，调整一下堆内存大小，设置字符编码防止控制台乱码
-Xms2g
-Xmx2g
-Dfile.encoding=UTF-8

IK分词器下载链接：https://release.infinilabs.com/analysis-ik/stable/

1
2
3

plugins/ik/
  ├── plugin-descriptor.properties  ← 直接在 ik 下
  └── ...

启动
1
bin\elasticsearch.bat

2、倒排索引

与Mysql中的B+Tree不同，ES中采用的是倒排索引，即先对数据进行分词，分词后根据词条去找文档

文档（document）：每条数据就是一个文档
词条（term）：文档按照语义分成的词语

原始数据（正向索引）                    倒排索引（Inverted Index）
┌────┬──────────────┬───────┐           ┌────────┬──────────┐
│ id │ title        │ price │           │ term   │ 文档id   │
├────┼──────────────┼───────┤           ├────────┼──────────┤
│ 1  │ 小米手机     │ 3499  │    →       │ 小米   │ 1, 3, 4  │
│ 2  │ 华为手机     │ 4999  │            │ 手机   │ 1, 2     │
│ 3  │ 华为小米充电器│ 49   │             │ 华为   │ 2, 3     │
│ 4  │ 小米手环     │ 299   │            │ 充电器 │ 3        │
└────┴──────────────┴───────┘           │ 手环   │ 4        │
                                        └────────┴──────────┘

比如来了一条查询，先对查询条件进行分词然后去找倒排索引，并且去重最终的文档id列表，再去原始数据中去找到所有的id下的行数据

3、IK分词器

分词分析器：ik_smart、ik_max_word

POST /_analyze
{
	"analyzer":"ik_smart",
	"text": "黑马程序员学习java太棒了"
}

#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{
  "tokens" : [
    {
      "token" : "黑马",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "程序员",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "学习",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "java",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "太棒了",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

拓展词典：对原始分词的词典进行扩展，直接写xx.dic文件即可plugins/ik/config/IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">yyh_dict.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords">yyh_stop.dic</entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

4、基础概念

索引（index）：将相同类型的文档形成的集合，类似mysql中的表
映射（mapping）：索引中文档的字段约束信息，类似mysql中表结构的约束
DSL：Json风格的请求格式，和mysql中SQL类似

5、索引库操作

Mapping映射属性

type：字段数据类型，常见的简单类型有：
    字符串：text（可分词的文本）、keyword（精确值，例如：品牌、国家、ip地址）
    数值：long、integer、short、byte、double、float、
    布尔：boolean
    日期：date
    对象：object

index：是否创建索引，默认为true

analyzer：使用哪种分词器

properties：该字段的子字段

6、

Kibana

ES的可视化界面

1. 安装配置启动

下载地址：https://www.elastic.co/downloads/past-releases?product=kibana

配置（config/kibana.yml）

# 服务端口
server.port: 5601

# 允许外部访问（或指定 IP）
server.host: "0.0.0.0"

# 连接 Elasticsearch
elasticsearch.hosts: ["http://localhost:9200"]

# 中文界面
i18n.locale: "zh-CN"

启动
1
bin\kibana.bat