Prometheus监控应用

Prometheus 应用

安装

Prometheus的单机安装比较简单,这里采用的是单机进行安装。Prometheus的相关插件下载地址

1
2
3
4
5
6
# 下载Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.17.1/prometheus-2.17.1.linux-amd64.tar.gz
tar zxc prometheus-2.17.1.linux-amd64.tar.gz && mv prometheus-2.17.1.linux-amd64 /opt/prometheus

# 创建prometheus 数据存放目录
mkdir -p /opt/prometheus/data

创建prometheus启动文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
cat >/usr/lib/systemd/system/prometheus.service<<EOF
[Unit]
Description=Prometheus
Documentation=https://prometheus.io/docs
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml --storage.tsdb.path=/opt/prometheus/data
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# 启动prometheus
systemctl restart prometheus.service &&systemctl status prometheus.service

# 默认端口9090

部署node_exporter

主要用来监控服务器的基础信息,如: cpu、内存、磁盘、网卡。

1
2
3
# 下载node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.0-rc.0/node_exporter-1.0.0-rc.0.linux-amd64.tar.gz
tar node_exporter-1.0.0-rc.0.linux-amd64.tar.gz && mv node_exporter-1.0.0-rc.0.linux-amd64/node_exporter /usr/bin/ && rm -rf node_exporter-1.0.0-rc.0.linux-amd64*

设置node_exporter开机启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
cat >/usr/lib/systemd/system/node_exporter.service<<EOF
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/docs
After=network.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/usr/bin/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# 启动 node_exporter
systemctl start node_exporter.service && systemctl status node_exporter.service

# 默认端口9100

安装mysql_exporter

主要监控mysql数据库的信息

1
2
3
# 下载mysql_exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz
&& tar zxf mysqld_exporter-0.12.1.linux-amd64.tar.gz && mv mysqld_exporter-0.12.1.linux-amd64/mysqld_exporter && rm -rf mysqld_exporter-0.12.1.linux-amd64*

创建msql的连接权限

mysqld_exporter需要连接Mysql,首先为它创建用户并赋予所需要的权限:

1
2
3
4
5
6
7
8
9
10
GRANT REPLICATION CLIENT, PROCESS ON . TO 'exporter'@'localhost' identified by '123456';
GRANT SELECT ON performance_schema.* TO 'exporter'@'localhost';
flush privileges;

# 创建.my.cnf文件
在当前的用户目录(可变更)创建.my.cnf文件
cat > .my.cnf<<EOF
[client]
user=exporter
password=123456

设置mysql_exporter开启启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
cat >/usr/lib/systemd/system/mysql_exporter.service<<EOF
[Unit]
Description=mysqld_exporter
Documentation=https://prometheus.io/docs
After=network.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/usr/bin/mysqld_exporter \
--collect.info_schema.processlist \
--collect.info_schema.innodb_tablespaces \
--collect.info_schema.innodb_metrics \
--collect.perf_schema.tableiowaits \
--collect.perf_schema.indexiowaits \
--collect.perf_schema.tablelocks \
--collect.engine_innodb_status \
--collect.perf_schema.file_events \
--collect.binlog_size \
--collect.info_schema.clientstats \
--collect.perf_schema.eventswaits \
--config.my-cnf=/root/.my.cnf
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# 启动 mysql_exporter
systemctl start mysql_exporter.service && systemctl status mysql_exporter.service

# 默认端口9104

使用granafa给 MySQLD_Exporter添加监控图表:

  • 主从主群监控(模板7371):
  • 相关mysql 状态监控7362:
  • 缓冲池状态7365

Prometheus基于文件的动态加载

基于文件的服务发现是最通用的方式。这种方式不需要依赖于任何的平台或者第三方服务。对于Prometheus而言也不可能支持所有的平台或者环境。通过基于文件的服务发现方式下,Prometheus会定时从文件中读取最新的Target信息,可以通过任意的方式将监控Target的信息写入即可。
Prometheus 可以通过JSON或者YAML格式的文件,定义所有的监控目标。下面我是通过yaml的文件格式来进行配置监控。在添加实例的时候添加了一些额外的标签信息。如: env、service、group等,实例中采集到的样本信息将包含这些标签信息,从而可以通过该标签按照环境对数据进行统计。

修改prometheus.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
cat prometheus.yml 
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_timeout: 10s
# scrape_timeout is set to the global default (10s).
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
scrape_configs:
- job_name: 'kxl_docker'
file_sd_configs:
- files:
- /opt/prometheus/sd_config/docker.yml
refresh_interval: 5s
- job_name: 'kxl_vm'
file_sd_configs:
- files:
- /opt/prometheus/sd_config/vm.yml
refresh_interval: 5s
- job_name: 'kxl_mysql'
file_sd_configs:
- files:
- /opt/prometheus/sd_config/mysql.yml
refresh_interval: 5s

scrape_configs 这里我定义了三组,分别是监控docker、vm、mysql的,每一个组对应一个yml文件。对应的服务到对应的文件进行增加即可。还可以增加(zk、es、ng、redis)等服务。

创建被扫描的文件

1
mkdir -p /opt/prometheus/sd_config && cd /opt/prometheus/sd_config
  • docker.yml
1
2
3
4
5
6
- labels:
service: docker
env: test
group: docker
targets:
- 172.21.1.30:8080
  • vm.yml
1
2
3
4
5
6
7
8
- labels:
env: test
group: linux_node
service: vm
targets:
- 172.21.1.30:9100
- 172.21.1.52:9100
- 172.21.1.52:9100
  • mysql.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
- labels:
service: mysql
env: test
group: mysql
targets:
- 172.21.1.30:9104

- labels:
service: mysql
env: dev
group: mysql
targets:
- 172.21.1.52:9104

在Prometheus UI的Targets下就可以看到当前定义的yml文件中动态获取到实例信息以及监控任务的采集状态,同时在Labels列下会包含用户添加的自定义标签:
img
img

在Prometheus UI的service-discovery下可以看到我们定义的job类型
img

alertmanager 部署

1
2
3
4
5
wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz
tar zxf alertmanager-0.20.0.linux-amd64.tar.gz && mv alertmanager-0.20.0.linux-amd64 /opt/alertmanager && rm -rf alertmanager-0.20.0.linux-amd64*

# Alermanager会将数据保存到本地中,默认的存储路径为data/。因此,在启动Alertmanager之前需要创建相应的目录
mkdir -p /opt/alertmanager/data

设置alertmanager开机启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
cat >/usr/lib/systemd/system/alertmanager.service<<EOF
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/docs
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml --storage.path=/opt/alertmanager/data
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# 启动
systemctl start alertmanager && systemctl status alertmanager

修改prometheus配置用于加载alertmanager和alertmanager rules

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
cat prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_timeout: 10s
alerting:
alertmanagers:
- static_configs:
- targets:
- 172.21.1.30:9093
rule_files:
- 'rules/*.rules'
scrape_configs:
- job_name: 'kxl_promethes'
file_sd_configs:
- files:
- /opt/prometheus/sd_config/data.yml
refresh_interval: 5s
- job_name: 'kxl_docker'
file_sd_configs:
- files:
- /opt/prometheus/sd_config/docker.yml
refresh_interval: 5s
- job_name: 'kxl_vm'
file_sd_configs:
- files:
- /opt/prometheus/sd_config/vm.yml
refresh_interval: 5s
- job_name: 'kxl_mysql'
file_sd_configs:
- files:
- /opt/prometheus/sd_config/mysql.yml
refresh_interval: 5s

# 重启prometheus
systemctl restart prometheus

新建rules规则

  • node 规则

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    mkdir -p /opt/prometheus/rules

    cat >node.rules<<EOF
    groups:
    - name: kxl_Instances
    rules:
    - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
    severity: page
    # Prometheus templates apply here in the annotation and label fields of the alert.
    annotations:
    description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
    summary: 'Instance {{ $labels.instance }} down'

    - alert: 内存使用率过高
    expr: 100-(node_memory_Buffers_bytes+node_memory_Cached_bytes+node_memory_MemFree_bytes)/node_memory_MemTotal_bytes*100 > 30
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} 内存使用率过高"
    description: "{{ $labels.instance }} of job {{$labels.job}}内存使用率超过80%,当前使用率[{{ $value }}]."

    - alert: cpu使用率过高
    expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 0
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} cpu使用率过高"
    description: "{{ $labels.instance }} of job {{$labels.job}}cpu使用率超过80%,当前使用率[{{ $value }}]."
    EOF
  • mysql 规则

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    cat > mysql.rules <<EOF
    groups:
    - name: MySQLStatsAlert
    rules:
    - alert: MySQL is down
    expr: mysql_up == 0
    for: 1m
    labels:
    severity: critical
    annotations:
    summary: "Instance {{ $labels.instance }} MySQL is down"
    description: "MySQL database is down. This requires immediate action!"
    - alert: open files high
    expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} open files high"
    description: "Open files is high. Please consider increasing open_files_limit."
    - alert: Read buffer size is bigger than max. allowed packet size
    expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size"
    description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication."
    - alert: Sort buffer possibly missconfigured
    expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} Sort buffer possibly missconfigured"
    description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M."
    - alert: Thread stack size is too small
    expr: mysql_global_variables_thread_stack <196608
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} Thread stack size is too small"
    description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
    - alert: Used more than 80% of max connections limited
    expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} Used more than 80% of max connections limited"
    description: "Used more than 80% of max connections limited"
    - alert: InnoDB Force Recovery is enabled
    expr: mysql_global_variables_innodb_force_recovery != 0
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} InnoDB Force Recovery is enabled"
    description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data."
    - alert: InnoDB Log File size is too small
    expr: mysql_global_variables_innodb_log_file_size < 16777216
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} InnoDB Log File size is too small"
    description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts."
    - alert: InnoDB Flush Log at Transaction Commit
    expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} InnoDB Flush Log at Transaction Commit"
    description: "InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure."
    - alert: Table definition cache too small
    expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache
    for: 1m
    labels:
    severity: page
    annotations:
    summary: "Instance {{ $labels.instance }} Table definition cache too small"
    description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!"
    - alert: Table open cache too small
    expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100
    for: 1m
    labels:
    severity: page
    annotations:
    summary: "Instance {{ $labels.instance }} Table open cache too small"
    description: "Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!"
    - alert: Thread stack size is possibly too small
    expr: mysql_global_variables_thread_stack < 262144
    for: 1m
    labels:
    severity: page
    annotations:
    summary: "Instance {{ $labels.instance }} Thread stack size is possibly too small"
    description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
    - alert: InnoDB Buffer Pool Instances is too small
    expr: mysql_global_variables_innodb_buffer_pool_instances == 1
    for: 1m
    labels:
    severity: page
    annotations:
    summary: "Instance {{ $labels.instance }} InnoDB Buffer Pool Instances is too small"
    description: "If you are using MySQL 5.5 and higher you should use several InnoDB Buffer Pool Instances for performance reasons. Some rules are: InnoDB Buffer Pool Instance should be at least 1 Gbyte in size. InnoDB Buffer Pool Instances you can set equal to the number of cores of your machine."
    - alert: InnoDB Plugin is enabled
    expr: mysql_global_variables_ignore_builtin_innodb == 1
    for: 1m
    labels:
    severity: page
    annotations:
    summary: "Instance {{ $labels.instance }} InnoDB Plugin is enabled"
    description: "InnoDB Plugin is enabled"
    - alert: Binary Log is disabled
    expr: mysql_global_variables_log_bin != 1
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} Binary Log is disabled"
    description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)."
    - alert: Binlog Cache size too small
    expr: mysql_global_variables_binlog_cache_size < 1048576
    for: 1m
    labels:
    severity: page
    annotations:
    env: "{{ $labels.env }}"
    summary: "Instance {{ $labels.instance }} Binlog Cache size too small"
    description: "Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK."
    - alert: Binlog Statement Cache size too small
    expr: mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0
    for: 1m
    labels:
    severity: page
    annotations:
    summary: "Instance {{ $labels.instance }} Binlog Statement Cache size too small"
    description: "Binlog Statement Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
    - alert: Binlog Transaction Cache size too small
    expr: mysql_global_variables_binlog_cache_size <1048576
    for: 1m
    labels:
    severity: page
    annotations:
    summary: "Instance {{ $labels.instance }} Binlog Transaction Cache size too small"
    description: "Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
    - alert: Sync Binlog is enabled
    expr: mysql_global_variables_sync_binlog == 1
    for: 1m
    labels:
    severity: page
    annotations:
    summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"
    description: "Sync Binlog is enabled. This leads to higher data security but on the cost of write performance."
    - alert: IO thread stopped
    expr: mysql_slave_status_slave_io_running != 1
    for: 1m
    labels:
    severity: critical
    annotations:
    summary: "Instance {{ $labels.instance }} IO thread stopped"
    description: "IO thread has stopped. This is usually because it cannot connect to the Master any more."
    - alert: SQL thread stopped
    expr: mysql_slave_status_slave_sql_running == 0
    for: 1m
    labels:
    severity: critical
    annotations:
    summary: "Instance {{ $labels.instance }} SQL thread stopped"
    description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
    - alert: SQL thread stopped
    expr: mysql_slave_status_slave_sql_running != 1
    for: 1m
    labels:
    severity: critical
    annotations:
    summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"
    description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
    - alert: Slave lagging behind Master
    expr: rate(mysql_slave_status_seconds_behind_master[1m]) >30
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} Slave lagging behind Master"
    description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"
    - alert: Slave is NOT read only(Please ignore this warning indicator.)
    expr: mysql_global_variables_read_only != 0
    for: 1m
    labels:
    severity: page
    annotations:
    summary: "Instance {{ $labels.instance }} Slave is NOT read only"
    description: "Slave is NOT set to read only. You can accidentally manipulate data on the slave and get inconsistencies..."
    EOF

配置告警策略

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
cat alertmanager.yml 
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.exmail.qq.com:465'
smtp_from: 'zxc@xxlaila.cn.com'
smtp_auth_username: 'zxc@xxlaila.cn.com'
smtp_auth_password: '123456'
smtp_require_tls: true
hipchat_api_url: 'https://hipchat.foobar.org/'
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_secret: 'KJfj93r21389usdas0i--234jsnjkhf23sjkfjsfs' # 企业微信Secret
wechat_api_corp_id: 'wwa98423u9skdnkjahs' # 企业微信CorpId

templates:
- 'template/*.tmpl' 告警信息模版

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
#receiver: 'web.hook'
receiver: default
routes:
- receiver: 'wechat'
continue: true

receivers:
#- name: 'web.hook'
- name: 'default'
email_configs:
- to: 'cq_xxlaila@163.com'
html: '{{ template "test.html" . }}'
headers: { Subject: "[WARN] email"}
send_resolved: true
webhook_configs:
- url: 'http://127.0.0.1:5001/'
- name: 'wechat'
wechat_configs:
- send_resolved: true
to_user: '@all' # 接受人,都是all
to_party: '4' # 接收组的id
agent_id: '1000002' # 企业微信自定义应用的id
corp_id: 'wwa98457kdsnkdnsadmsdnas' # 企业微信CorpId
message: '{{ template "test_wechat.html" . }}' # 发送消息的模版

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

Alertmanager主要负责对Prometheus产生的告警进行统一处理,因此在Alertmanager配置中一般会包含以下几个主要部分:

  • 全局配置(global):用于定义一些全局的公共参数,如全局的SMTP配置,Slack配置等内容;
  • 模板(templates):用于定义告警通知时的模板,如HTML模板,邮件模板等;
  • 告警路由(route):根据标签匹配,确定当前告警应该如何处理;
  • 接收人(receivers):接收人是一个抽象的概念,它可以是一个邮箱也可以是微信,Slack或者Webhook等,接收人一般配合告警路由使用;
  • 抑制规则(inhibit_rules):合理设置抑制规则可以减少垃圾告警的产生

.tmpl模板的配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# 创建.tmpl模版存放目录
mkdir /opt/alertmanager/template && cd /opt/alertmanager/template

# 企业微信
cat >test_wechat.tmpl <<EOF
{{ define "test_wechat.html" }}
{{ range $i, $alert := .Alerts.Firing }}
[报警项]:{{ index $alert.Labels "alertname" }}
[环境]: {{ index $alert.Labels "env" }}
[实例]:{{ index $alert.Labels "instance" }}
[级别]: {{ index $alert.Labels "severity" }}
[报警阀值]: {{ index $alert.Annotations "summary" }}
[报警描述]: {{ index $alert.Annotations "description" }}
[开始时间]: {{ $alert.StartsAt }}
{{ end }}
{{ end }}
EOF

# 邮件告警
cat >test.tmpl <<EOF
{{ define "test.html" }}
<table border="1">
<tr>
<td>报警项</td>
<td>环境</td>
<td>实例</td>
<td>级别</td>
<td>报警阀值</td>
<td>报警描述</td>
<td>开始时间</td>
</tr>
{{ range $i, $alert := .Alerts }}
<tr>
<td>{{ index $alert.Labels "alertname" }}</td>
<td>{{ index $alert.Labels "env"}}</td>
<td>{{ index $alert.Labels "instance" }}</td>
<td>{{ index $alert.Labels "severity" }}</td>
<td>{{ index $alert.Annotations "summary" }}</td>
<td>{{ index $alert.Annotations "description" }}</td>
<td>{{ $alert.StartsAt }}</td>
</tr>
{{ end }}
</table>
{{ end }}
EOF

# 重启alertmanager
systemctl restart alertmanager

企业微信截图
img

坚持原创技术分享,您的支持将鼓励我继续创作!
0%