# 监控和告警系统配置指南 ## 📋 目录 1. [快速开始](#快速开始) 2. [详细配置](#详细配置) 3. [告警配置](#告警配置) 4. [监控面板配置](#监控面板配置) 5. [故障排查](#故障排查) ## 🚀 快速开始 ### 步骤 1: 环境检查 ```bash # 运行环境检查脚本 chmod +x scripts/check-monitoring-env.sh ./scripts/check-monitoring-env.sh ``` ### 步骤 2: 配置邮件服务 编辑 `monitoring/alertmanager.yml`,更新邮件配置: ```yaml receivers: - name: 'default' email_configs: - to: 'admin@novalon.cn' # 接收告警的邮箱 from: 'alertmanager@novalon.cn' # 发送告警的邮箱 smarthost: 'smtp.resend.com:587' # SMTP 服务器 auth_username: 'resend' # SMTP 用户名 auth_password: 're_xxxxxxxxxxxxxx' # Resend API 密钥 require_tls: true # 启用 TLS ``` ### 步骤 3: 启动监控服务 ```bash # 启动所有监控服务 docker-compose -f docker-compose.monitoring.yml up -d # 查看服务状态 docker-compose -f docker-compose.monitoring.yml ps ``` ### 步骤 4: 访问监控界面 - **Prometheus**: http://localhost:9090 - **Grafana**: http://localhost:3001 (admin/admin) - **Alertmanager**: http://localhost:9093 ## 🔧 详细配置 ### 1. Prometheus 配置 文件位置: `monitoring/prometheus.yml` #### 基础配置 ```yaml global: scrape_interval: 15s # 数据采集间隔 evaluation_interval: 15s # 规则评估间隔 scrape_configs: - job_name: 'novalon-website' static_configs: - targets: ['localhost:3000'] # 应用服务地址 metrics_path: '/api/health' # 健康检查端点 ``` #### 添加更多监控目标 ```yaml scrape_configs: - job_name: 'novalon-website' static_configs: - targets: ['localhost:3000'] metrics_path: '/api/health' - job_name: 'node-exporter' # 系统指标 static_configs: - targets: ['localhost:9100'] - job_name: 'postgres-exporter' # 数据库指标 static_configs: - targets: ['localhost:9187'] ``` ### 2. 告警规则配置 文件位置: `monitoring/alerts.yml` #### 服务可用性告警 ```yaml - alert: ServiceDown expr: up{job="novalon-website"} == 0 for: 1m labels: severity: critical annotations: summary: "服务不可用" description: "Novalon 网站服务已停止响应超过 1 分钟" ``` #### 错误率告警 ```yaml - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "高错误率" description: "5xx 错误率在过去 5 分钟内超过 5%: {{ $value }}" ``` #### 响应时间告警 ```yaml - alert: HighResponseTime expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "高响应时间" description: "P95 响应时间超过 1 秒: {{ $value }}s" ``` #### 资源使用告警 ```yaml - alert: HighCPUUsage expr: rate(process_cpu_seconds_total[5m]) > 0.8 for: 5m labels: severity: warning annotations: summary: "CPU 使用率过高" description: "CPU 使用率超过 80%: {{ $value }}" - alert: HighMemoryUsage expr: process_resident_memory_bytes / 1024 / 1024 / 1024 > 1 for: 5m labels: severity: warning annotations: summary: "内存使用率过高" description: "内存使用超过 1GB: {{ $value }}GB" ``` ### 3. Alertmanager 配置 #### 邮件通知配置 ```yaml receivers: - name: 'critical-alerts' email_configs: - to: 'admin@novalon.cn,ops@novalon.cn' from: 'alertmanager@novalon.cn' smarthost: 'smtp.resend.com:587' auth_username: 'resend' auth_password: 're_xxxxxxxxxxxxxx' require_tls: true headers: Subject: '🚨 CRITICAL: Novalon Website Alert' X-Priority: '1' # 高优先级邮件 ``` #### 告警路由配置 ```yaml route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s # 等待更多告警分组 group_interval: 10s # 发送告警的间隔 repeat_interval: 12h # 重复告警的间隔 routes: - match: severity: critical receiver: 'critical-alerts' continue: true # 继续匹配其他规则 - match: severity: warning receiver: 'warning-alerts' ``` #### 静默规则(维护期间) ```yaml # 临时静默告警(维护期间) inhibit_rules: - source_match: severity: 'critical' target_match: alertname: 'ServiceDown' equal: ['instance'] ``` ## 📊 监控面板配置 ### Grafana 仪表板配置 #### 1. 登录 Grafana 访问 http://localhost:3001 - 用户名: admin - 密码: admin #### 2. 添加 Prometheus 数据源 1. 进入 Configuration → Data Sources 2. 点击 "Add data source" 3. 选择 "Prometheus" 4. 配置: - Name: Prometheus - URL: http://prometheus:9090 - Access: Server (default) 5. 点击 "Save & Test" #### 3. 导入仪表板 1. 进入 Dashboards → Import 2. 上传 `monitoring/grafana-dashboard.json` 3. 选择 Prometheus 数据源 4. 点击 "Import" #### 4. 自定义仪表板 创建关键指标面板: **HTTP 请求面板** ```json { "title": "HTTP Requests", "targets": [ { "expr": "rate(http_requests_total[5m])", "legendFormat": "{{ method }} {{ status }}" } ], "type": "graph" } ``` **响应时间面板** ```json { "title": "Response Time (P95)", "targets": [ { "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", "legendFormat": "P95" } ], "type": "graph" } ``` **错误率面板** ```json { "title": "Error Rate", "targets": [ { "expr": "rate(http_requests_total{status=~\"5..\"}[5m])", "legendFormat": "5xx Errors" } ], "type": "graph" } ``` ## 🧪 测试告警 ### 1. 测试服务不可用告警 ```bash # 停止应用服务 npm stop # 等待 1 分钟后,检查 Alertmanager curl http://localhost:9093/api/v1/alerts # 恢复服务 npm start ``` ### 2. 测试高错误率告警 ```bash # 模拟高错误率 for i in {1..100}; do curl -X POST http://localhost:3000/api/test/error done # 检查 Prometheus curl http://localhost:9090/api/v1/query?query=rate(http_requests_total%7Bstatus%3D~%225..%22%7D%5B5m%5D ``` ### 3. 测试邮件通知 ```bash # 发送测试告警 curl -X POST http://localhost:9093/api/v1/alerts \ -H 'Content-Type: application/json' \ -d '[ { "labels": { "alertname": "TestAlert", "severity": "warning" }, "annotations": { "description": "这是一个测试告警" } } ]' ``` ## 🔧 故障排查 ### 问题 1: 服务无法启动 ```bash # 检查 Docker 日志 docker-compose -f docker-compose.monitoring.yml logs # 检查端口占用 netstat -tulpn | grep -E '3000|9090|3001|9093' # 重启服务 docker-compose -f docker-compose.monitoring.yml restart ``` ### 问题 2: 告警不发送 ```bash # 检查 Alertmanager 配置 docker-compose -f docker-compose.monitoring.yml exec alertmanager \ cat /etc/alertmanager/alertmanager.yml # 检查 Alertmanager 日志 docker-compose -f docker-compose.monitoring.yml logs alertmanager # 测试 SMTP 连接 telnet smtp.resend.com 587 ``` ### 问题 3: Grafana 无法连接 Prometheus ```bash # 检查 Prometheus 是否运行 docker-compose -f docker-compose.monitoring.yml ps prometheus # 测试 Prometheus API curl http://localhost:9090/api/v1/status/config # 检查网络连接 docker-compose -f docker-compose.monitoring.yml exec grafana \ ping prometheus ``` ### 问题 4: 数据采集失败 ```bash # 检查应用健康检查端点 curl http://localhost:3000/api/health # 检查 Prometheus targets curl http://localhost:9090/api/v1/targets # 查看 Prometheus 日志 docker-compose -f docker-compose.monitoring.yml logs prometheus ``` ## 📈 最佳实践 ### 1. 告警级别设置 - **Critical**: 立即需要处理,影响用户体验 - **Warning**: 需要关注,但不影响主要功能 - **Info**: 信息性告警,用于记录 ### 2. 告警频率控制 - 避免告警风暴 - 使用合理的 `for` 参数 - 设置合适的 `repeat_interval` ### 3. 监控指标选择 - **关键指标**: 必须监控 - **重要指标**: 建议监控 - **辅助指标**: 可选监控 ### 4. 性能优化 - 调整 Prometheus 采集间隔 - 配置数据保留策略 - 使用 PromQL 优化查询 ## 📞 联系支持 如遇到问题,请联系: - 运维团队: ops@novalon.cn - 业务咨询: contact@novalon.cn