Prometheus监控应用

Prometheus 应用

安装

Prometheus的单机安装比较简单，这里采用的是单机进行安装。Prometheus的相关插件下载地址

# 下载Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.17.1/prometheus-2.17.1.linux-amd64.tar.gz
tar zxc prometheus-2.17.1.linux-amd64.tar.gz && mv prometheus-2.17.1.linux-amd64 /opt/prometheus

# 创建prometheus 数据存放目录
mkdir -p /opt/prometheus/data

创建prometheus启动文件

cat >/usr/lib/systemd/system/prometheus.service<<EOF
[Unit]
Description=Prometheus
Documentation=https://prometheus.io/docs
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml --storage.tsdb.path=/opt/prometheus/data
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# 启动prometheus
systemctl  restart prometheus.service &&systemctl  status prometheus.service

# 默认端口9090

部署node_exporter

主要用来监控服务器的基础信息，如: cpu、内存、磁盘、网卡。

1
2
3

# 下载node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.0-rc.0/node_exporter-1.0.0-rc.0.linux-amd64.tar.gz
tar node_exporter-1.0.0-rc.0.linux-amd64.tar.gz && mv node_exporter-1.0.0-rc.0.linux-amd64/node_exporter /usr/bin/ && rm -rf node_exporter-1.0.0-rc.0.linux-amd64*

设置node_exporter开机启动

cat >/usr/lib/systemd/system/node_exporter.service<<EOF
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/docs
After=network.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/usr/bin/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# 启动 node_exporter
systemctl  start node_exporter.service && systemctl status node_exporter.service

# 默认端口9100

安装mysql_exporter

主要监控mysql数据库的信息

1
2
3

# 下载mysql_exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz
&& tar zxf mysqld_exporter-0.12.1.linux-amd64.tar.gz && mv mysqld_exporter-0.12.1.linux-amd64/mysqld_exporter && rm -rf mysqld_exporter-0.12.1.linux-amd64*

创建msql的连接权限

mysqld_exporter需要连接Mysql，首先为它创建用户并赋予所需要的权限：

GRANT REPLICATION CLIENT, PROCESS ON . TO 'exporter'@'localhost' identified by '123456';
GRANT SELECT ON performance_schema.* TO 'exporter'@'localhost';
flush privileges;

# 创建.my.cnf文件
在当前的用户目录(可变更)创建.my.cnf文件
cat > .my.cnf<<EOF
[client]
user=exporter
password=123456

设置mysql_exporter开启启动

cat >/usr/lib/systemd/system/mysql_exporter.service<<EOF
[Unit]
Description=mysqld_exporter
Documentation=https://prometheus.io/docs
After=network.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/usr/bin/mysqld_exporter \
         --collect.info_schema.processlist \
         --collect.info_schema.innodb_tablespaces \
         --collect.info_schema.innodb_metrics  \
         --collect.perf_schema.tableiowaits \
         --collect.perf_schema.indexiowaits \
         --collect.perf_schema.tablelocks \
         --collect.engine_innodb_status \
         --collect.perf_schema.file_events \
         --collect.binlog_size \
         --collect.info_schema.clientstats \
         --collect.perf_schema.eventswaits \
         --config.my-cnf=/root/.my.cnf
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# 启动 mysql_exporter
systemctl  start mysql_exporter.service && systemctl  status mysql_exporter.service

# 默认端口9104

使用granafa给 MySQLD_Exporter添加监控图表:

主从主群监控(模板7371)：
相关mysql 状态监控7362：
缓冲池状态7365

Prometheus基于文件的动态加载

基于文件的服务发现是最通用的方式。这种方式不需要依赖于任何的平台或者第三方服务。对于Prometheus而言也不可能支持所有的平台或者环境。通过基于文件的服务发现方式下，Prometheus会定时从文件中读取最新的Target信息，可以通过任意的方式将监控Target的信息写入即可。
Prometheus 可以通过JSON或者YAML格式的文件，定义所有的监控目标。下面我是通过yaml的文件格式来进行配置监控。在添加实例的时候添加了一些额外的标签信息。如: env、service、group等，实例中采集到的样本信息将包含这些标签信息，从而可以通过该标签按照环境对数据进行统计。

修改prometheus.yml

cat prometheus.yml 
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  scrape_timeout: 10s
  # scrape_timeout is set to the global default (10s).
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
scrape_configs:
  - job_name: 'kxl_docker'
    file_sd_configs:
      - files:
          - /opt/prometheus/sd_config/docker.yml
        refresh_interval: 5s
  - job_name: 'kxl_vm'
    file_sd_configs:
       - files:
            - /opt/prometheus/sd_config/vm.yml
         refresh_interval: 5s
  - job_name: 'kxl_mysql'
    file_sd_configs:
      - files:
          - /opt/prometheus/sd_config/mysql.yml
        refresh_interval: 5s

scrape_configs 这里我定义了三组，分别是监控docker、vm、mysql的，每一个组对应一个yml文件。对应的服务到对应的文件进行增加即可。还可以增加(zk、es、ng、redis)等服务。

创建被扫描的文件

1	mkdir -p /opt/prometheus/sd_config && cd /opt/prometheus/sd_config

docker.yml

- labels:
    service: docker
    env: test
    group: docker
  targets:
    - 172.21.1.30:8080

vm.yml

- labels:
    env: test
    group: linux_node
    service: vm
  targets:
    - 172.21.1.30:9100
    - 172.21.1.52:9100
    - 172.21.1.52:9100

mysql.yml

- labels:
    service: mysql
    env: test
    group: mysql
  targets:
    - 172.21.1.30:9104
    
- labels:
    service: mysql
    env: dev
    group: mysql
  targets:
    - 172.21.1.52:9104

在Prometheus UI的Targets下就可以看到当前定义的yml文件中动态获取到实例信息以及监控任务的采集状态，同时在Labels列下会包含用户添加的自定义标签:

在Prometheus UI的service-discovery下可以看到我们定义的job类型

alertmanager 部署

wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz
tar zxf alertmanager-0.20.0.linux-amd64.tar.gz && mv alertmanager-0.20.0.linux-amd64 /opt/alertmanager && rm -rf alertmanager-0.20.0.linux-amd64*

# Alermanager会将数据保存到本地中，默认的存储路径为data/。因此，在启动Alertmanager之前需要创建相应的目录
mkdir -p /opt/alertmanager/data

设置alertmanager开机启动

cat >/usr/lib/systemd/system/alertmanager.service<<EOF
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/docs
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml --storage.path=/opt/alertmanager/data
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# 启动
systemctl start alertmanager && systemctl status alertmanager

修改prometheus配置用于加载alertmanager和alertmanager rules

cat prometheus.yml
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  scrape_timeout: 10s
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 172.21.1.30:9093
rule_files:
  - 'rules/*.rules'
scrape_configs:
  - job_name: 'kxl_promethes'
    file_sd_configs:
      - files:
           - /opt/prometheus/sd_config/data.yml
        refresh_interval: 5s
  - job_name: 'kxl_docker'
    file_sd_configs:
      - files:
          - /opt/prometheus/sd_config/docker.yml
        refresh_interval: 5s
  - job_name: 'kxl_vm'
    file_sd_configs:
       - files:
            - /opt/prometheus/sd_config/vm.yml
         refresh_interval: 5s
  - job_name: 'kxl_mysql'
    file_sd_configs:
      - files:
          - /opt/prometheus/sd_config/mysql.yml
        refresh_interval: 5s

# 重启prometheus
systemctl  restart prometheus

新建rules规则

node 规则

mkdir -p /opt/prometheus/rules

cat >node.rules<<EOF
groups:
- name: kxl_Instances 
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: page
    # Prometheus templates apply here in the annotation and label fields of the alert.
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
      summary: 'Instance {{ $labels.instance }} down'
      
  - alert: 内存使用率过高
    expr: 100-(node_memory_Buffers_bytes+node_memory_Cached_bytes+node_memory_MemFree_bytes)/node_memory_MemTotal_bytes*100 > 30 
    for: 1m 
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} 内存使用率过高"
      description: "{{ $labels.instance }} of job {{$labels.job}}内存使用率超过80%,当前使用率[{{ $value }}]."

  - alert: cpu使用率过高
    expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} cpu使用率过高"
      description: "{{ $labels.instance }} of job {{$labels.job}}cpu使用率超过80%,当前使用率[{{ $value }}]."
EOF

mysql 规则

cat > mysql.rules <<EOF
groups:
- name: MySQLStatsAlert
  rules:
  - alert: MySQL is down
    expr: mysql_up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} MySQL is down"
      description: "MySQL database is down. This requires immediate action!"
  - alert: open files high
    expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} open files high"
      description: "Open files is high. Please consider increasing open_files_limit."
  - alert: Read buffer size is bigger than max. allowed packet size
    expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size"
      description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication."
  - alert: Sort buffer possibly missconfigured
    expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Sort buffer possibly missconfigured"
      description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M."
  - alert: Thread stack size is too small
    expr: mysql_global_variables_thread_stack <196608
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Thread stack size is too small"
      description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
  - alert: Used more than 80% of max connections limited 
    expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Used more than 80% of max connections limited"
      description: "Used more than 80% of max connections limited"
  - alert: InnoDB Force Recovery is enabled
    expr: mysql_global_variables_innodb_force_recovery != 0 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Force Recovery is enabled"
      description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data."
  - alert: InnoDB Log File size is too small
    expr: mysql_global_variables_innodb_log_file_size < 16777216 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Log File size is too small"
      description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts."
  - alert: InnoDB Flush Log at Transaction Commit
    expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Flush Log at Transaction Commit"
      description: "InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure."
  - alert: Table definition cache too small
    expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Table definition cache too small"
      description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!"
  - alert: Table open cache too small
    expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Table open cache too small"
      description: "Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!"
  - alert: Thread stack size is possibly too small
    expr: mysql_global_variables_thread_stack < 262144
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Thread stack size is possibly too small"
      description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
  - alert: InnoDB Buffer Pool Instances is too small
    expr: mysql_global_variables_innodb_buffer_pool_instances == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Buffer Pool Instances is too small"
      description: "If you are using MySQL 5.5 and higher you should use several InnoDB Buffer Pool Instances for performance reasons. Some rules are: InnoDB Buffer Pool Instance should be at least 1 Gbyte in size. InnoDB Buffer Pool Instances you can set equal to the number of cores of your machine."
  - alert: InnoDB Plugin is enabled
    expr: mysql_global_variables_ignore_builtin_innodb == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Plugin is enabled"
      description: "InnoDB Plugin is enabled"
  - alert: Binary Log is disabled
    expr: mysql_global_variables_log_bin != 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Binary Log is disabled"
      description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)."
  - alert: Binlog Cache size too small
    expr: mysql_global_variables_binlog_cache_size < 1048576
    for: 1m
    labels:
      severity: page
    annotations:
      env: "{{ $labels.env }}"
      summary: "Instance {{ $labels.instance }} Binlog Cache size too small"
      description: "Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK."
  - alert: Binlog Statement Cache size too small
    expr: mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Binlog Statement Cache size too small"
      description: "Binlog Statement Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
  - alert: Binlog Transaction Cache size too small
    expr: mysql_global_variables_binlog_cache_size  <1048576
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Binlog Transaction Cache size too small"
      description: "Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
  - alert: Sync Binlog is enabled
    expr: mysql_global_variables_sync_binlog == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"
      description: "Sync Binlog is enabled. This leads to higher data security but on the cost of write performance."
  - alert: IO thread stopped
    expr: mysql_slave_status_slave_io_running != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} IO thread stopped"
      description: "IO thread has stopped. This is usually because it cannot connect to the Master any more."
  - alert: SQL thread stopped 
    expr: mysql_slave_status_slave_sql_running == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} SQL thread stopped"
      description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
  - alert: SQL thread stopped
    expr: mysql_slave_status_slave_sql_running != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"
      description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
  - alert: Slave lagging behind Master
    expr: rate(mysql_slave_status_seconds_behind_master[1m]) >30 
    for: 1m
    labels:
      severity: warning 
    annotations:
      summary: "Instance {{ $labels.instance }} Slave lagging behind Master"
      description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"
  - alert: Slave is NOT read only(Please ignore this warning indicator.)
    expr: mysql_global_variables_read_only != 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Slave is NOT read only"
      description: "Slave is NOT set to read only. You can accidentally manipulate data on the slave and get inconsistencies..."
EOF

配置告警策略

cat alertmanager.yml 
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.exmail.qq.com:465'
  smtp_from: 'zxc@xxlaila.cn.com'
  smtp_auth_username: 'zxc@xxlaila.cn.com'
  smtp_auth_password: '123456'
  smtp_require_tls: true
  hipchat_api_url: 'https://hipchat.foobar.org/'
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_secret: 'KJfj93r21389usdas0i--234jsnjkhf23sjkfjsfs'    # 企业微信Secret
  wechat_api_corp_id: 'wwa98423u9skdnkjahs'    # 企业微信CorpId

templates:
  - 'template/*.tmpl'   告警信息模版

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  #receiver: 'web.hook'
  receiver: default
  routes:
  - receiver: 'wechat'
    continue: true

receivers:
#- name: 'web.hook'
  - name: 'default'
    email_configs:
    - to: 'cq_xxlaila@163.com'
      html: '{{ template "test.html" . }}'
      headers: { Subject: "[WARN] email"}
      send_resolved: true
    webhook_configs:
    - url: 'http://127.0.0.1:5001/'
  - name: 'wechat'
    wechat_configs:
    - send_resolved: true
      to_user: '@all'              # 接受人，都是all
      to_party: '4'                # 接收组的id
      agent_id: '1000002'          # 企业微信自定义应用的id
      corp_id: 'wwa98457kdsnkdnsadmsdnas'   # 企业微信CorpId
      message: '{{ template "test_wechat.html" . }}'  # 发送消息的模版

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Alertmanager主要负责对Prometheus产生的告警进行统一处理，因此在Alertmanager配置中一般会包含以下几个主要部分：

全局配置（global）：用于定义一些全局的公共参数，如全局的SMTP配置，Slack配置等内容；
模板（templates）：用于定义告警通知时的模板，如HTML模板，邮件模板等；
告警路由（route）：根据标签匹配，确定当前告警应该如何处理；
接收人（receivers）：接收人是一个抽象的概念，它可以是一个邮箱也可以是微信，Slack或者Webhook等，接收人一般配合告警路由使用；
抑制规则（inhibit_rules）：合理设置抑制规则可以减少垃圾告警的产生

.tmpl模板的配置

# 创建.tmpl模版存放目录
mkdir /opt/alertmanager/template && cd /opt/alertmanager/template

# 企业微信
cat >test_wechat.tmpl <<EOF
{{ define "test_wechat.html" }}
  {{ range $i, $alert := .Alerts.Firing }}
    [报警项]:{{ index $alert.Labels "alertname" }}
    [环境]: {{ index $alert.Labels "env" }}
    [实例]:{{ index $alert.Labels "instance" }}
    [级别]: {{ index $alert.Labels "severity" }}
    [报警阀值]: {{ index $alert.Annotations "summary" }}
    [报警描述]: {{ index $alert.Annotations "description" }}
    [开始时间]: {{ $alert.StartsAt }}
  {{ end }}
{{ end }}
EOF

# 邮件告警
cat >test.tmpl <<EOF
{{ define "test.html" }}
<table border="1">
        <tr>
                <td>报警项</td>
                <td>环境</td>
                <td>实例</td>
                <td>级别</td>
                <td>报警阀值</td>
                <td>报警描述</td>
                <td>开始时间</td>
        </tr>
        {{ range $i, $alert := .Alerts }}
                <tr>
                        <td>{{ index $alert.Labels "alertname" }}</td>
                        <td>{{ index $alert.Labels "env"}}</td>
                        <td>{{ index $alert.Labels "instance" }}</td>
                        <td>{{ index $alert.Labels "severity" }}</td>
                        <td>{{ index $alert.Annotations "summary" }}</td>
                        <td>{{ index $alert.Annotations "description" }}</td>
                        <td>{{ $alert.StartsAt }}</td>
                </tr>
        {{ end }}
</table>
{{ end }}
EOF

# 重启alertmanager
systemctl restart alertmanager

企业微信截图