docs: 整理文档结构并创建索引(任务 2.3/20)
This commit is contained in:
@@ -0,0 +1,412 @@
|
||||
# 监控和告警系统配置指南
|
||||
|
||||
## 📋 目录
|
||||
1. [快速开始](#快速开始)
|
||||
2. [详细配置](#详细配置)
|
||||
3. [告警配置](#告警配置)
|
||||
4. [监控面板配置](#监控面板配置)
|
||||
5. [故障排查](#故障排查)
|
||||
|
||||
## 🚀 快速开始
|
||||
|
||||
### 步骤 1: 环境检查
|
||||
|
||||
```bash
|
||||
# 运行环境检查脚本
|
||||
chmod +x scripts/check-monitoring-env.sh
|
||||
./scripts/check-monitoring-env.sh
|
||||
```
|
||||
|
||||
### 步骤 2: 配置邮件服务
|
||||
|
||||
编辑 `monitoring/alertmanager.yml`,更新邮件配置:
|
||||
|
||||
```yaml
|
||||
receivers:
|
||||
- name: 'default'
|
||||
email_configs:
|
||||
- to: 'admin@novalon.cn' # 接收告警的邮箱
|
||||
from: 'alertmanager@novalon.cn' # 发送告警的邮箱
|
||||
smarthost: 'smtp.resend.com:587' # SMTP 服务器
|
||||
auth_username: 'resend' # SMTP 用户名
|
||||
auth_password: 're_xxxxxxxxxxxxxx' # Resend API 密钥
|
||||
require_tls: true # 启用 TLS
|
||||
```
|
||||
|
||||
### 步骤 3: 启动监控服务
|
||||
|
||||
```bash
|
||||
# 启动所有监控服务
|
||||
docker-compose -f docker-compose.monitoring.yml up -d
|
||||
|
||||
# 查看服务状态
|
||||
docker-compose -f docker-compose.monitoring.yml ps
|
||||
```
|
||||
|
||||
### 步骤 4: 访问监控界面
|
||||
|
||||
- **Prometheus**: http://localhost:9090
|
||||
- **Grafana**: http://localhost:3001 (admin/admin)
|
||||
- **Alertmanager**: http://localhost:9093
|
||||
|
||||
## 🔧 详细配置
|
||||
|
||||
### 1. Prometheus 配置
|
||||
|
||||
文件位置: `monitoring/prometheus.yml`
|
||||
|
||||
#### 基础配置
|
||||
|
||||
```yaml
|
||||
global:
|
||||
scrape_interval: 15s # 数据采集间隔
|
||||
evaluation_interval: 15s # 规则评估间隔
|
||||
|
||||
scrape_configs:
|
||||
- job_name: 'novalon-website'
|
||||
static_configs:
|
||||
- targets: ['localhost:3000'] # 应用服务地址
|
||||
metrics_path: '/api/health' # 健康检查端点
|
||||
```
|
||||
|
||||
#### 添加更多监控目标
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'novalon-website'
|
||||
static_configs:
|
||||
- targets: ['localhost:3000']
|
||||
metrics_path: '/api/health'
|
||||
|
||||
- job_name: 'node-exporter' # 系统指标
|
||||
static_configs:
|
||||
- targets: ['localhost:9100']
|
||||
|
||||
- job_name: 'postgres-exporter' # 数据库指标
|
||||
static_configs:
|
||||
- targets: ['localhost:9187']
|
||||
```
|
||||
|
||||
### 2. 告警规则配置
|
||||
|
||||
文件位置: `monitoring/alerts.yml`
|
||||
|
||||
#### 服务可用性告警
|
||||
|
||||
```yaml
|
||||
- alert: ServiceDown
|
||||
expr: up{job="novalon-website"} == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "服务不可用"
|
||||
description: "Novalon 网站服务已停止响应超过 1 分钟"
|
||||
```
|
||||
|
||||
#### 错误率告警
|
||||
|
||||
```yaml
|
||||
- alert: HighErrorRate
|
||||
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "高错误率"
|
||||
description: "5xx 错误率在过去 5 分钟内超过 5%: {{ $value }}"
|
||||
```
|
||||
|
||||
#### 响应时间告警
|
||||
|
||||
```yaml
|
||||
- alert: HighResponseTime
|
||||
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "高响应时间"
|
||||
description: "P95 响应时间超过 1 秒: {{ $value }}s"
|
||||
```
|
||||
|
||||
#### 资源使用告警
|
||||
|
||||
```yaml
|
||||
- alert: HighCPUUsage
|
||||
expr: rate(process_cpu_seconds_total[5m]) > 0.8
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "CPU 使用率过高"
|
||||
description: "CPU 使用率超过 80%: {{ $value }}"
|
||||
|
||||
- alert: HighMemoryUsage
|
||||
expr: process_resident_memory_bytes / 1024 / 1024 / 1024 > 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "内存使用率过高"
|
||||
description: "内存使用超过 1GB: {{ $value }}GB"
|
||||
```
|
||||
|
||||
### 3. Alertmanager 配置
|
||||
|
||||
#### 邮件通知配置
|
||||
|
||||
```yaml
|
||||
receivers:
|
||||
- name: 'critical-alerts'
|
||||
email_configs:
|
||||
- to: 'admin@novalon.cn,ops@novalon.cn'
|
||||
from: 'alertmanager@novalon.cn'
|
||||
smarthost: 'smtp.resend.com:587'
|
||||
auth_username: 'resend'
|
||||
auth_password: 're_xxxxxxxxxxxxxx'
|
||||
require_tls: true
|
||||
headers:
|
||||
Subject: '🚨 CRITICAL: Novalon Website Alert'
|
||||
X-Priority: '1' # 高优先级邮件
|
||||
```
|
||||
|
||||
#### 告警路由配置
|
||||
|
||||
```yaml
|
||||
route:
|
||||
group_by: ['alertname', 'cluster', 'service']
|
||||
group_wait: 10s # 等待更多告警分组
|
||||
group_interval: 10s # 发送告警的间隔
|
||||
repeat_interval: 12h # 重复告警的间隔
|
||||
|
||||
routes:
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'critical-alerts'
|
||||
continue: true # 继续匹配其他规则
|
||||
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: 'warning-alerts'
|
||||
```
|
||||
|
||||
#### 静默规则(维护期间)
|
||||
|
||||
```yaml
|
||||
# 临时静默告警(维护期间)
|
||||
inhibit_rules:
|
||||
- source_match:
|
||||
severity: 'critical'
|
||||
target_match:
|
||||
alertname: 'ServiceDown'
|
||||
equal: ['instance']
|
||||
```
|
||||
|
||||
## 📊 监控面板配置
|
||||
|
||||
### Grafana 仪表板配置
|
||||
|
||||
#### 1. 登录 Grafana
|
||||
|
||||
访问 http://localhost:3001
|
||||
- 用户名: admin
|
||||
- 密码: admin
|
||||
|
||||
#### 2. 添加 Prometheus 数据源
|
||||
|
||||
1. 进入 Configuration → Data Sources
|
||||
2. 点击 "Add data source"
|
||||
3. 选择 "Prometheus"
|
||||
4. 配置:
|
||||
- Name: Prometheus
|
||||
- URL: http://prometheus:9090
|
||||
- Access: Server (default)
|
||||
5. 点击 "Save & Test"
|
||||
|
||||
#### 3. 导入仪表板
|
||||
|
||||
1. 进入 Dashboards → Import
|
||||
2. 上传 `monitoring/grafana-dashboard.json`
|
||||
3. 选择 Prometheus 数据源
|
||||
4. 点击 "Import"
|
||||
|
||||
#### 4. 自定义仪表板
|
||||
|
||||
创建关键指标面板:
|
||||
|
||||
**HTTP 请求面板**
|
||||
```json
|
||||
{
|
||||
"title": "HTTP Requests",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "rate(http_requests_total[5m])",
|
||||
"legendFormat": "{{ method }} {{ status }}"
|
||||
}
|
||||
],
|
||||
"type": "graph"
|
||||
}
|
||||
```
|
||||
|
||||
**响应时间面板**
|
||||
```json
|
||||
{
|
||||
"title": "Response Time (P95)",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
|
||||
"legendFormat": "P95"
|
||||
}
|
||||
],
|
||||
"type": "graph"
|
||||
}
|
||||
```
|
||||
|
||||
**错误率面板**
|
||||
```json
|
||||
{
|
||||
"title": "Error Rate",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
|
||||
"legendFormat": "5xx Errors"
|
||||
}
|
||||
],
|
||||
"type": "graph"
|
||||
}
|
||||
```
|
||||
|
||||
## 🧪 测试告警
|
||||
|
||||
### 1. 测试服务不可用告警
|
||||
|
||||
```bash
|
||||
# 停止应用服务
|
||||
npm stop
|
||||
|
||||
# 等待 1 分钟后,检查 Alertmanager
|
||||
curl http://localhost:9093/api/v1/alerts
|
||||
|
||||
# 恢复服务
|
||||
npm start
|
||||
```
|
||||
|
||||
### 2. 测试高错误率告警
|
||||
|
||||
```bash
|
||||
# 模拟高错误率
|
||||
for i in {1..100}; do
|
||||
curl -X POST http://localhost:3000/api/test/error
|
||||
done
|
||||
|
||||
# 检查 Prometheus
|
||||
curl http://localhost:9090/api/v1/query?query=rate(http_requests_total%7Bstatus%3D~%225..%22%7D%5B5m%5D
|
||||
```
|
||||
|
||||
### 3. 测试邮件通知
|
||||
|
||||
```bash
|
||||
# 发送测试告警
|
||||
curl -X POST http://localhost:9093/api/v1/alerts \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '[
|
||||
{
|
||||
"labels": {
|
||||
"alertname": "TestAlert",
|
||||
"severity": "warning"
|
||||
},
|
||||
"annotations": {
|
||||
"description": "这是一个测试告警"
|
||||
}
|
||||
}
|
||||
]'
|
||||
```
|
||||
|
||||
## 🔧 故障排查
|
||||
|
||||
### 问题 1: 服务无法启动
|
||||
|
||||
```bash
|
||||
# 检查 Docker 日志
|
||||
docker-compose -f docker-compose.monitoring.yml logs
|
||||
|
||||
# 检查端口占用
|
||||
netstat -tulpn | grep -E '3000|9090|3001|9093'
|
||||
|
||||
# 重启服务
|
||||
docker-compose -f docker-compose.monitoring.yml restart
|
||||
```
|
||||
|
||||
### 问题 2: 告警不发送
|
||||
|
||||
```bash
|
||||
# 检查 Alertmanager 配置
|
||||
docker-compose -f docker-compose.monitoring.yml exec alertmanager \
|
||||
cat /etc/alertmanager/alertmanager.yml
|
||||
|
||||
# 检查 Alertmanager 日志
|
||||
docker-compose -f docker-compose.monitoring.yml logs alertmanager
|
||||
|
||||
# 测试 SMTP 连接
|
||||
telnet smtp.resend.com 587
|
||||
```
|
||||
|
||||
### 问题 3: Grafana 无法连接 Prometheus
|
||||
|
||||
```bash
|
||||
# 检查 Prometheus 是否运行
|
||||
docker-compose -f docker-compose.monitoring.yml ps prometheus
|
||||
|
||||
# 测试 Prometheus API
|
||||
curl http://localhost:9090/api/v1/status/config
|
||||
|
||||
# 检查网络连接
|
||||
docker-compose -f docker-compose.monitoring.yml exec grafana \
|
||||
ping prometheus
|
||||
```
|
||||
|
||||
### 问题 4: 数据采集失败
|
||||
|
||||
```bash
|
||||
# 检查应用健康检查端点
|
||||
curl http://localhost:3000/api/health
|
||||
|
||||
# 检查 Prometheus targets
|
||||
curl http://localhost:9090/api/v1/targets
|
||||
|
||||
# 查看 Prometheus 日志
|
||||
docker-compose -f docker-compose.monitoring.yml logs prometheus
|
||||
```
|
||||
|
||||
## 📈 最佳实践
|
||||
|
||||
### 1. 告警级别设置
|
||||
|
||||
- **Critical**: 立即需要处理,影响用户体验
|
||||
- **Warning**: 需要关注,但不影响主要功能
|
||||
- **Info**: 信息性告警,用于记录
|
||||
|
||||
### 2. 告警频率控制
|
||||
|
||||
- 避免告警风暴
|
||||
- 使用合理的 `for` 参数
|
||||
- 设置合适的 `repeat_interval`
|
||||
|
||||
### 3. 监控指标选择
|
||||
|
||||
- **关键指标**: 必须监控
|
||||
- **重要指标**: 建议监控
|
||||
- **辅助指标**: 可选监控
|
||||
|
||||
### 4. 性能优化
|
||||
|
||||
- 调整 Prometheus 采集间隔
|
||||
- 配置数据保留策略
|
||||
- 使用 PromQL 优化查询
|
||||
|
||||
## 📞 联系支持
|
||||
|
||||
如遇到问题,请联系:
|
||||
- 运维团队: ops@novalon.cn
|
||||
- 业务咨询: contact@novalon.cn
|
||||
Reference in New Issue
Block a user