Files

8.5 KiB

监控和告警系统配置指南

📋 目录

  1. 快速开始
  2. 详细配置
  3. 告警配置
  4. 监控面板配置
  5. 故障排查

🚀 快速开始

步骤 1: 环境检查

# 运行环境检查脚本
chmod +x scripts/check-monitoring-env.sh
./scripts/check-monitoring-env.sh

步骤 2: 配置邮件服务

编辑 monitoring/alertmanager.yml,更新邮件配置:

receivers:
  - name: 'default'
    email_configs:
      - to: 'admin@novalon.cn'           # 接收告警的邮箱
        from: 'alertmanager@novalon.cn'   # 发送告警的邮箱
        smarthost: 'smtp.resend.com:587'  # SMTP 服务器
        auth_username: 'resend'            # SMTP 用户名
        auth_password: 're_xxxxxxxxxxxxxx' # Resend API 密钥
        require_tls: true                  # 启用 TLS

步骤 3: 启动监控服务

# 启动所有监控服务
docker-compose -f docker-compose.monitoring.yml up -d

# 查看服务状态
docker-compose -f docker-compose.monitoring.yml ps

步骤 4: 访问监控界面

🔧 详细配置

1. Prometheus 配置

文件位置: monitoring/prometheus.yml

基础配置

global:
  scrape_interval: 15s      # 数据采集间隔
  evaluation_interval: 15s   # 规则评估间隔

scrape_configs:
  - job_name: 'novalon-website'
    static_configs:
      - targets: ['localhost:3000']  # 应用服务地址
    metrics_path: '/api/health'      # 健康检查端点

添加更多监控目标

scrape_configs:
  - job_name: 'novalon-website'
    static_configs:
      - targets: ['localhost:3000']
    metrics_path: '/api/health'

  - job_name: 'node-exporter'  # 系统指标
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'postgres-exporter'  # 数据库指标
    static_configs:
      - targets: ['localhost:9187']

2. 告警规则配置

文件位置: monitoring/alerts.yml

服务可用性告警

- alert: ServiceDown
  expr: up{job="novalon-website"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "服务不可用"
    description: "Novalon 网站服务已停止响应超过 1 分钟"

错误率告警

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "高错误率"
    description: "5xx 错误率在过去 5 分钟内超过 5%: {{ $value }}"

响应时间告警

- alert: HighResponseTime
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "高响应时间"
    description: "P95 响应时间超过 1 秒: {{ $value }}s"

资源使用告警

- alert: HighCPUUsage
  expr: rate(process_cpu_seconds_total[5m]) > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "CPU 使用率过高"
    description: "CPU 使用率超过 80%: {{ $value }}"

- alert: HighMemoryUsage
  expr: process_resident_memory_bytes / 1024 / 1024 / 1024 > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "内存使用率过高"
    description: "内存使用超过 1GB: {{ $value }}GB"

3. Alertmanager 配置

邮件通知配置

receivers:
  - name: 'critical-alerts'
    email_configs:
      - to: 'admin@novalon.cn,ops@novalon.cn'
        from: 'alertmanager@novalon.cn'
        smarthost: 'smtp.resend.com:587'
        auth_username: 'resend'
        auth_password: 're_xxxxxxxxxxxxxx'
        require_tls: true
        headers:
          Subject: '🚨 CRITICAL: Novalon Website Alert'
          X-Priority: '1'  # 高优先级邮件

告警路由配置

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s      # 等待更多告警分组
  group_interval: 10s   # 发送告警的间隔
  repeat_interval: 12h  # 重复告警的间隔

  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      continue: true    # 继续匹配其他规则

    - match:
        severity: warning
      receiver: 'warning-alerts'

静默规则(维护期间)

# 临时静默告警(维护期间)
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      alertname: 'ServiceDown'
    equal: ['instance']

📊 监控面板配置

Grafana 仪表板配置

1. 登录 Grafana

访问 http://localhost:3001

  • 用户名: admin
  • 密码: admin

2. 添加 Prometheus 数据源

  1. 进入 Configuration → Data Sources
  2. 点击 "Add data source"
  3. 选择 "Prometheus"
  4. 配置:
  5. 点击 "Save & Test"

3. 导入仪表板

  1. 进入 Dashboards → Import
  2. 上传 monitoring/grafana-dashboard.json
  3. 选择 Prometheus 数据源
  4. 点击 "Import"

4. 自定义仪表板

创建关键指标面板:

HTTP 请求面板

{
  "title": "HTTP Requests",
  "targets": [
    {
      "expr": "rate(http_requests_total[5m])",
      "legendFormat": "{{ method }} {{ status }}"
    }
  ],
  "type": "graph"
}

响应时间面板

{
  "title": "Response Time (P95)",
  "targets": [
    {
      "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
      "legendFormat": "P95"
    }
  ],
  "type": "graph"
}

错误率面板

{
  "title": "Error Rate",
  "targets": [
    {
      "expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
      "legendFormat": "5xx Errors"
    }
  ],
  "type": "graph"
}

🧪 测试告警

1. 测试服务不可用告警

# 停止应用服务
npm stop

# 等待 1 分钟后,检查 Alertmanager
curl http://localhost:9093/api/v1/alerts

# 恢复服务
npm start

2. 测试高错误率告警

# 模拟高错误率
for i in {1..100}; do
  curl -X POST http://localhost:3000/api/test/error
done

# 检查 Prometheus
curl http://localhost:9090/api/v1/query?query=rate(http_requests_total%7Bstatus%3D~%225..%22%7D%5B5m%5D

3. 测试邮件通知

# 发送测试告警
curl -X POST http://localhost:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[
    {
      "labels": {
        "alertname": "TestAlert",
        "severity": "warning"
      },
      "annotations": {
        "description": "这是一个测试告警"
      }
    }
  ]'

🔧 故障排查

问题 1: 服务无法启动

# 检查 Docker 日志
docker-compose -f docker-compose.monitoring.yml logs

# 检查端口占用
netstat -tulpn | grep -E '3000|9090|3001|9093'

# 重启服务
docker-compose -f docker-compose.monitoring.yml restart

问题 2: 告警不发送

# 检查 Alertmanager 配置
docker-compose -f docker-compose.monitoring.yml exec alertmanager \
  cat /etc/alertmanager/alertmanager.yml

# 检查 Alertmanager 日志
docker-compose -f docker-compose.monitoring.yml logs alertmanager

# 测试 SMTP 连接
telnet smtp.resend.com 587

问题 3: Grafana 无法连接 Prometheus

# 检查 Prometheus 是否运行
docker-compose -f docker-compose.monitoring.yml ps prometheus

# 测试 Prometheus API
curl http://localhost:9090/api/v1/status/config

# 检查网络连接
docker-compose -f docker-compose.monitoring.yml exec grafana \
  ping prometheus

问题 4: 数据采集失败

# 检查应用健康检查端点
curl http://localhost:3000/api/health

# 检查 Prometheus targets
curl http://localhost:9090/api/v1/targets

# 查看 Prometheus 日志
docker-compose -f docker-compose.monitoring.yml logs prometheus

📈 最佳实践

1. 告警级别设置

  • Critical: 立即需要处理,影响用户体验
  • Warning: 需要关注,但不影响主要功能
  • Info: 信息性告警,用于记录

2. 告警频率控制

  • 避免告警风暴
  • 使用合理的 for 参数
  • 设置合适的 repeat_interval

3. 监控指标选择

  • 关键指标: 必须监控
  • 重要指标: 建议监控
  • 辅助指标: 可选监控

4. 性能优化

  • 调整 Prometheus 采集间隔
  • 配置数据保留策略
  • 使用 PromQL 优化查询

📞 联系支持

如遇到问题,请联系: