跳转至

Prometheus

部署

先去下载

wget https://github.com/prometheus/prometheus/releases/download/v2.33.5/prometheus-2.33.5.linux-amd64.tar.gz

启动服务

Bash
1
2
3
4
5
6
root@pts/12 # tar -zxvf prometheus-2.33.5.linux-amd64.tar.gz -C /opt/ 
root@pts/12 # cd /opt
root@pts/12 # mv prometheus-2.33.5.linux-amd64 prometheus
root@pts/12 # screen -S prome
root@pts/12 # cd /opt/prometheus/
root@pts/12 # ./prometheus --config.file=prometheus.yml --web.enable-lifecycle --storage.tsdb.retention.time=30d

以上参数,开启web API,采集存储30天

浏览器:http://172.16.1.18:9090/

重新加载配置

curl -X POST http://172.16.1.18:9090/-/reload

systemd管理服务

vim /usr/lib/systemd/system/prometheus.service

Bash
[Unit]
Description=Prometheus-Server
Documentation=https://prometheus.io/
After=network.target

[Service]
ExecStart=/opt/prometheus/prometheus --web.listen-address=0.0.0.0:9090  --config.file=/opt/prometheus/prometheus.yml --web.enable-lifecycle --storage.tsdb.retention.time=30d
User=root

[Install]
WantedBy=multi-user.target

重载服务

Bash
1
2
3
4
/usr/bin/systemctl daemon-reload
systemctl start prometheus.service
systemctl enable prometheus.service  
systemctl status prometheus.service

集成Alertmanager告警规则

修改配置

vim /opt/prometheus/prometheus.yml

Bash
1
2
3
4
5
6
7
8
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 172.16.1.18:9093

rule_files:
   - "/opt/prometheus/rule/node_exporter.yml"

添加告警规则

vim /opt/prometheus/rule/node_exporter.yml

Bash
groups:
- name: 服务器资源监控
  rules:
  - alert: 内存使用率过高
    expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 80
    for: 3m 
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }} 内存使用率过高, 请尽快处理!"
      description: "{{ $labels.instance }}内存使用率超过80%,当前使用率{{ $value }}%."

  - alert: 服务器宕机
    expr: up == 0
    for: 1s
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 服务器宕机, 请尽快处理!"
      description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. "

  - alert: CPU高负荷
    expr: 100 - (avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} CPU使用率过高,请尽快处理!"
      description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "

  - alert: 磁盘IO性能
    expr: avg(irate(node_disk_io_time_seconds_total[1m])) by(instance,job)* 100 > 90
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流入磁盘IO使用率过高,请尽快处理!"
      description: "{{$labels.instance}} 流入磁盘IO大于90%,当前使用率{{ $value }}%."


  - alert: 网络流入
    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流入网络带宽过高,请尽快处理!"
      description: "{{$labels.instance}} 流入网络带宽持续5分钟高于100M. RX带宽使用量{{$value}}."

  - alert: 网络流出
    expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流出网络带宽过高,请尽快处理!"
      description: "{{$labels.instance}} 流出网络带宽持续5分钟高于100M. RX带宽使用量{$value}}."

  - alert: TCP连接数
    expr: node_netstat_Tcp_CurrEstab > 10000
    for: 2m
    labels:
      severity: 严重告警
    annotations:
      summary: " TCP_ESTABLISHED过高!"
      description: "{{$labels.instance}} TCP_ESTABLISHED大于100%,当前使用率{{ $value }}%."

  - alert: 磁盘容量
    expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90
    for: 1m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 磁盘分区使用率过高,请尽快处理!"
      description: "{{$labels.instance}} 磁盘分区使用大于90%,当前使用率{{ $value }}%."

贴一个线上prometheus.yml配置