微服务监控和简单日志系统搭建grafana+prometheus+node_exporter+promtail+loki

一、问题

❌ 微服务挂了，不知道

❌ 某台服务器磁盘满了，不知道

❌ JVM 内存、线程池、接口 RT 异常，没有统一视图

❌ 没有主动告警，全靠人发现

❌ 没有统一日志查询入口，全靠进入服务器命令行 tail -f

二、方案

组件	作用
Prometheus	指标采集 + 存储（核心）
Grafana	指标可视化 + 告警规则
Alertmanager	告警分发（邮件、企业微信、钉钉）
Node Exporter	服务器级监控（磁盘、CPU、内存）
Spring Boot Actuator + Micrometer	微服务 JVM / 接口指标
promtail+loki+grafana	日志收集、存储、查看

三、部署

下载Grafana 12.3.1版本： https://dl.grafana.com/grafana-enterprise/release/12.3.1/grafana-enterprise_12.3.1_20271043721_linux_amd64.tar.gz
下载Node Exporter 1.10.2版本：

下载prometheus-3.8.1 版本

上传到Linux服务器

/web 目录下

解压

复制代码

tar -zxvf  grafana.tar.gz
tar -zxvf  node_exporter.tar.gz
tar -zxvf  prometheus-3.8.1.linux-amd64.tar.gz

启动node_exporter：

复制代码

vi /etc/systemd/system/node_exporter.service

写入内容如下：

复制代码

[Unit]
Description=Node Exporter
After=network.target

[Service]
ExecStart=/web/node_exporter-1.10.2/node_exporter
Restart=always

[Install]
WantedBy=multi-user.target

如果想换端口可以在ExecStart后面加 --web.listen-address=:xxxx

比如换成9999端口

复制代码

ExecStart=/web/node_exporter-1.10.2/node_exporter  --web.listen-address=:9999

systemctl daemon-reload

systemctl enable node_exporter

systemctl start node_exporter

检查是否运行：

ss -tnlp | grep 9100

访问验证：

curl -I http://127.0.0.1:9100/metrics

启动grafana：

新增grafana配置文件

复制代码

cd /web/grafana-12.3.1/conf/

vi custom.ini

填写如下内容：

其中 66.222.241.214 换成你的外网ip

复制代码

[server]
http_addr = 0.0.0.0
http_port = 3000
domain = 66.222.241.214
root_url = http://66.222.241.214/grafana/
serve_from_sub_path = true

再

复制代码

vi /etc/systemd/system/grafana.service

填写如下：wq保存

复制代码

[Unit]
Description=Grafana Server
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/web/grafana-12.3.1
ExecStart=/web/grafana-12.3.1/bin/grafana-server \
  --homepath=/web/grafana-12.3.1 \
  --config=/web/grafana-12.3.1/conf/custom.ini

Restart=always
LimitNOFILE=10000

[Install]
WantedBy=multi-user.target

启动：

复制代码

systemctl daemon-reload
systemctl enable grafana
systemctl start grafana

配置nginx代理

其中 11.11.11.5 为你 grafana部署的内网地址

复制代码

location /grafana/ {
    proxy_pass http://11.11.11.5:3000;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    
    proxy_buffering off;
    proxy_request_buffering off;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    
    # 添加这些头解决 CORS
    add_header Access-Control-Allow-Private-Network true always;
    add_header Access-Control-Allow-Origin * always;
}

这样你就可以使用 http://66.222.241.214/grafana/ 访问grafana了

其中 66.222.241.214为外网ip

启动 prometheus-3.8.1

复制代码

vi   /etc/systemd/system/prometheus.service

写入如下内容：

复制代码

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
ExecStart=/web/prometheus-3.8.1/prometheus \
  --config.file=/web/prometheus-3.8.1/prometheus.yml \
  --storage.tsdb.path=/web/prometheus-3.8.1/data \
  --web.listen-address=:9090

Restart=always

[Install]
WantedBy=multi-user.target

配置 Prometheus

复制代码

cd /web/prometheus-3.8.1
vi prometheus.yml

写入如下内容：

复制代码

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']
      # 如果要监控多台服务器，直接加逗号：

启动：

复制代码

systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus
systemctl status prometheus

确认监听端口

复制代码

ss -tnlp | grep 9090

API验证：

复制代码

curl http://10.0.0.5:9090/-/ready

四、配置数据源和面板

grafana切换中文菜单

配置prometheus数据源

保存即可

配置仪表面板

监控Linux服务器面板id 8919

直接输入别人分享的面板id 8919 点击加载即可

可以点击编辑，改一下面板名称

保存后在首页也能看见面板了：

五、添加邮件告警

举例：

监控 192.168.0.220 机器的 / 目录超过80% 就发邮件告警

这里A查询的是时序数据

复制代码

(1 - (
  node_filesystem_avail_bytes{mountpoint="/report", instance="192.168.0.220:9100"}
  /
  node_filesystem_size_bytes{mountpoint="/report", instance="192.168.0.220:9100"}
)) * 100

上面ip和端口是你机器node_exporter 部署的机器ip和端口

下面需要添加一个表达式对A的时序数据进行规约，取最新一条数据：

下面再添加一个比较规则：

输入选B IS ABOVE （大于）填80

再点一下 Set "C" as alert condition (把C的判断结果作为告警标识)

下面3、4、5 自己填就行

5就是告警邮件接收方

6就是填一下邮件的提醒内容

可以参考

复制代码

192.168.0.220    /  分区容量告警。请立即扩容！

/ 分区使用率已超过 80%，当前值：{{ printf "%.2f" $values.B.Value }}%

六、监控Java微服务 todo

七、查看Java日志（promtail + loki + grafana）

下载2.8.4版本（版本太高的话CentOS7部分依赖库不支持）：

loki-linux-amd64.zip
promtail-linux-amd64.zip

上传到 /web

解压：

复制代码

unzip  loki-linux-amd64.zip
unzip  promtail-linux-amd64.zip

promtail 用于收集日志，所以需要部署在你的日志所在机器上

loki用于存储查询，找台磁盘稍微大点的机器部署

先启动Loki

新增 loki配置文件

复制代码

vi  loki.yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 3110
  grpc_server_max_recv_msg_size: 1073741824  #grpc最大接收消息值，默认4m
  grpc_server_max_send_msg_size: 1073741824  #grpc最大发送消息值，默认4m

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_transfer_retries: 0
  max_chunk_age: 20m  #一个timeseries块在内存中的最大持续时间。如果timeseries运行的时间超过此时间，则当前块将刷新到存储并创建一个新块

schema_config:
  configs:
    - from: 2021-01-01
      store: boltdb
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 168h

storage_config:
  boltdb:
    directory: /web/loki-data/index #存储索引地址
  filesystem:
    directory: /web/loki-data/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  ingestion_rate_mb: 30  #修改每用户摄入速率限制，即每秒样本量，默认值为4M
  ingestion_burst_size_mb: 15  #修改每用户摄入速率限制，即每秒样本量，默认值为6M

chunk_store_config:
        #max_look_back_period: 168h   #回看日志行的最大时间，只适用于即时日志
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: true #日志保留周期开关，默认为false
  retention_period: 720h  #日志保留周期

新增Loki启动服务：

复制代码

# 用于收集启动日志
touch loki.out
vi /etc/systemd/system/loki.service
# 输入如下内容
[Unit]
Description=Loki service
After=network.target

[Service]
Type=simple
# 修改为你的实际路径
ExecStart=/web/loki-linux-amd64 -config.file=/web/loki.yaml
Restart=on-failure
RestartSec=5s
StandardOutput=append:/web/loki.out
StandardError=append:/web/loki.out

[Install]
WantedBy=multi-user.target

Loki启动命令：

复制代码

systemctl daemon-reload
systemctl enable loki
systemctl start loki
systemctl status loki

# 查看启动日志
tail -f  loki.out

再启动Promtail

新增 promtail配置文件

复制代码

cd /web
vi promtail.yaml

新增如下内容：

复制代码

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /web/positions.txt

clients:
  - url: http://你的loki部署机器的IP地址:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          app: ks-app-10.0.0.30
          __path__: 你的日志目录/*.log

其中 path : 要换成自己机器的日志目录

如果要增加多个服务的日志就这样：

复制代码

scrape_configs:
  # 第一个服务（原有的）
  - job_name: ks-exam
    static_configs:
      - targets:
          - localhost
        labels:
          app: ks-app-10.0.0.30
          __path__: /usr/local/ks-cloud/logs/ks-app/**/*.log

  # 第二个服务（新增的）
  - job_name: new-service  # 任务名称，起个容易辨识的名字
    static_configs:
      - targets:
          - localhost
        labels:
          app: new-service-name  # 这个名字会出现在 Grafana 的"服务名"下拉框里
          __path__: /path/to/your/new/logs/*.log  # 换成新服务的实际日志路径

新增Promtail启动脚本：

复制代码

# 用于收集启动日志
touch promtail.out
vi /etc/systemd/system/promtail.service
# 输入如下内容
[Unit]
Description=Promtail service
After=network.target

[Service]
Type=simple
# 修改为你的实际路径
ExecStart=/web/promtail-linux-amd64 -config.file=/web/promtail.yaml
Restart=on-failure
RestartSec=5s
StandardOutput=append:/web/promtail.out
StandardError=append:/web/promtail.out

[Install]
WantedBy=multi-user.target

Promtail启动命令：

复制代码

systemctl daemon-reload
systemctl enable promtail
systemctl start promtail
systemctl status promtail

# 查看启动日志
tail -f  promtail.out

数据流转示例：

Grafana配置Loki

搜索loki

输入loki的地址(推荐内网地址)

配置grafana的日志查看面板

点击loki数据源的探索

添加到面板

打开

设置查询变量以及应用变量

目标如下：

设置三个搜索条件

应用名称 （最好配置promtail配置文件的时候把label里面应用名称配上ip+端口最好）
例如：
复制代码
```
labels:
        app: ks-exam-10.0.0.30:8989
```
文件名（方便自主选择那个日志文件查看）
内容搜索 （根据日志内容进行搜索）