1538 lines
34 KiB
Markdown
1538 lines
34 KiB
Markdown
# 部署运维文档
|
||
|
||
> 文档编号: GYM-OPS-DEPLOY-001
|
||
> 版本: v1.0
|
||
> 日期: 2026-03-04
|
||
> 作者: 张翔
|
||
> 状态: 正式发布
|
||
|
||
---
|
||
|
||
## 文档修订历史
|
||
|
||
| 版本 | 日期 | 作者 | 修订内容 |
|
||
| ---- | ---------- | ---- | ------------------ |
|
||
| v1.0 | 2026-03-04 | 张翔 | 创建部署运维文档 |
|
||
|
||
---
|
||
|
||
## 参考文档
|
||
|
||
- 《健身房管理系统技术架构设计文档》 GYM-HLD-TECH-001
|
||
- 《健身房管理系统响应式编程规范文档》 GYM-STD-REACTIVE-001
|
||
- Docker 官方文档
|
||
- Docker Compose 官方文档
|
||
|
||
---
|
||
|
||
## 一、部署架构
|
||
|
||
### 1.1 部署拓扑
|
||
|
||
```mermaid
|
||
flowchart TB
|
||
subgraph 部署架构拓扑
|
||
A[用户层<br/>• 会员小程序<br/>• 教练端App<br/>• 管理后台PC]
|
||
B[负载均衡层 Nginx<br/>• 负载均衡<br/>• SSL 终止<br/>• 静态资源<br/>• 限流]
|
||
C[应用层 Docker Compose<br/>• gym-manage 应用<br/>• postgres 数据库<br/>• redis 缓存<br/>• rabbitmq 消息队列<br/>• elasticsearch 搜索引擎<br/>• prometheus 监控<br/>• grafana 可视化<br/>• kibana 日志可视化]
|
||
D[监控层 Prometheus + Grafana<br/>• 指标采集<br/>• 告警规则<br/>• 可视化仪表板]
|
||
end
|
||
|
||
A --> B
|
||
B --> C
|
||
C --> D
|
||
```
|
||
|
||
### 1.2 服务器配置
|
||
|
||
#### 1.2.1 生产环境配置
|
||
|
||
| 组件 | CPU | 内存 | 磁盘 | 用途 |
|
||
|------|------|------|------|
|
||
| **应用服务器** | 4 核 | 8GB | 100GB | 运行应用 |
|
||
| **数据库服务器** | 8 核 | 16GB | 500GB | PostgreSQL |
|
||
| **缓存服务器** | 2 核 | 4GB | 50GB | Redis |
|
||
| **消息队列服务器** | 2 核 | 4GB | 100GB | RabbitMQ |
|
||
| **搜索服务器** | 4 核 | 8GB | 200GB | Elasticsearch |
|
||
| **监控服务器** | 2 核 | 4GB | 50GB | Prometheus + Grafana |
|
||
|
||
**推荐配置**:
|
||
- 初期:应用 + 数据库 + 缓存部署在同一台服务器(8 核 16GB)
|
||
- 中期:应用独立部署(4 核 8GB),数据库独立部署(8 核 16GB)
|
||
- 长期:各组件独立部署,提高可用性
|
||
|
||
#### 1.2.2 开发环境配置
|
||
|
||
| 组件 | CPU | 内存 | 磁盘 | 用途 |
|
||
|------|------|------|------|
|
||
| **开发服务器** | 4 核 | 8GB | 100GB | 开发测试 |
|
||
|
||
---
|
||
|
||
## 二、环境准备
|
||
|
||
### 2.1 系统要求
|
||
|
||
#### 2.1.1 操作系统
|
||
|
||
- **推荐**:Ubuntu 20.04 LTS / 22.04 LTS
|
||
- **兼容**:CentOS 7+ / Debian 10+
|
||
- **内核版本**:>= 4.15
|
||
|
||
#### 2.1.2 软件依赖
|
||
|
||
| 软件 | 版本 | 用途 |
|
||
|------|------|------|
|
||
| **Docker** | 24.x+ | 容器化部署 |
|
||
| **Docker Compose** | 2.20.x+ | 容器编排 |
|
||
| **Git** | 2.30+ | 版本控制 |
|
||
| **JDK** | 17+ | 运行环境 |
|
||
| **Maven** | 3.9.x+ | 项目构建 |
|
||
|
||
### 2.2 环境安装
|
||
|
||
#### 2.2.1 安装 Docker
|
||
|
||
```bash
|
||
# Ubuntu/Debian
|
||
curl -fsSL https://get.docker.com -o get-docker.sh
|
||
sudo sh get-docker.sh
|
||
|
||
# 启动 Docker 服务
|
||
sudo systemctl start docker
|
||
sudo systemctl enable docker
|
||
|
||
# 验证安装
|
||
docker --version
|
||
docker info
|
||
```
|
||
|
||
#### 2.2.2 安装 Docker Compose
|
||
|
||
```bash
|
||
# 下载 Docker Compose
|
||
sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
|
||
|
||
# 添加执行权限
|
||
sudo chmod +x /usr/local/bin/docker-compose
|
||
|
||
# 验证安装
|
||
docker-compose --version
|
||
```
|
||
|
||
#### 2.2.3 安装 JDK
|
||
|
||
```bash
|
||
# Ubuntu/Debian
|
||
sudo apt update
|
||
sudo apt install -y openjdk-17-jdk
|
||
|
||
# 验证安装
|
||
java -version
|
||
```
|
||
|
||
#### 2.2.4 安装 Maven
|
||
|
||
```bash
|
||
# 下载 Maven
|
||
wget https://dlcdn.apache.org/maven/maven-3/3.9.5/binaries/apache-maven-3.9.5-bin.tar.gz
|
||
|
||
# 解压
|
||
tar -xzf apache-maven-3.9.5-bin.tar.gz
|
||
|
||
# 移动到 /opt
|
||
sudo mv apache-maven-3.9.5 /opt/maven
|
||
|
||
# 配置环境变量
|
||
echo 'export PATH=/opt/maven/bin:$PATH' >> ~/.bashrc
|
||
source ~/.bashrc
|
||
|
||
# 验证安装
|
||
mvn -version
|
||
```
|
||
|
||
---
|
||
|
||
## 三、部署流程
|
||
|
||
### 3.1 代码部署
|
||
|
||
#### 3.1.1 克隆代码
|
||
|
||
```bash
|
||
# 克隆代码仓库
|
||
git clone <repository-url>
|
||
cd gym-manage
|
||
|
||
# 查看分支
|
||
git branch -a
|
||
|
||
# 切换到生产分支
|
||
git checkout production
|
||
|
||
# 拉取最新代码
|
||
git pull origin production
|
||
```
|
||
|
||
#### 3.1.2 配置环境变量
|
||
|
||
```bash
|
||
# 复制环境变量模板
|
||
cp .env.example .env
|
||
|
||
# 编辑环境变量
|
||
vim .env
|
||
```
|
||
|
||
**.env 文件示例**:
|
||
|
||
```bash
|
||
# 数据库配置
|
||
DB_USERNAME=postgres
|
||
DB_PASSWORD=your-strong-password
|
||
|
||
# Redis 配置
|
||
REDIS_PASSWORD=your-strong-password
|
||
|
||
# RabbitMQ 配置
|
||
MQ_USERNAME=admin
|
||
MQ_PASSWORD=your-strong-password
|
||
|
||
# Grafana 配置
|
||
GRAFANA_USER=admin
|
||
GRAFANA_PASSWORD=your-strong-password
|
||
|
||
# Spring 配置
|
||
SPRING_PROFILES_ACTIVE=prod
|
||
|
||
# JVM 配置 (响应式编程最佳实践)
|
||
JAVA_OPTS=-Xms512m -Xmx1024m -XX:+UseZGC -XX:ZAllocationSpikeTolerance=5 -XX:+UnlockExperimentalVMOptions -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
|
||
```
|
||
|
||
#### 3.1.3 构建镜像
|
||
|
||
```bash
|
||
# 构建应用镜像
|
||
docker-compose build gym-manage
|
||
|
||
# 查看镜像
|
||
docker images | grep gym-manage
|
||
```
|
||
|
||
### 3.2 服务部署
|
||
|
||
#### 3.2.1 启动所有服务
|
||
|
||
```bash
|
||
# 启动所有服务
|
||
docker-compose up -d
|
||
|
||
# 查看服务状态
|
||
docker-compose ps
|
||
|
||
# 查看日志
|
||
docker-compose logs -f gym-manage
|
||
```
|
||
|
||
#### 3.2.2 启动单个服务
|
||
|
||
```bash
|
||
# 启动数据库
|
||
docker-compose up -d postgres
|
||
|
||
# 启动应用
|
||
docker-compose up -d gym-manage
|
||
|
||
# 查看应用日志
|
||
docker-compose logs -f gym-manage
|
||
```
|
||
|
||
#### 3.2.3 健康检查
|
||
|
||
```bash
|
||
# 检查应用健康状态
|
||
curl http://localhost:8080/actuator/health
|
||
|
||
# 检查数据库连接
|
||
docker-compose exec postgres pg_isready -U postgres
|
||
|
||
# 检查 Redis 连接
|
||
docker-compose exec redis redis-cli ping
|
||
|
||
# 检查 RabbitMQ 连接
|
||
curl http://localhost:15672/api/overview -u admin:admin123
|
||
```
|
||
|
||
### 3.3 数据库初始化
|
||
|
||
#### 3.3.1 创建数据库
|
||
|
||
```bash
|
||
# 连接到 PostgreSQL
|
||
docker-compose exec postgres psql -U postgres
|
||
|
||
# 创建数据库
|
||
CREATE DATABASE gym_manage;
|
||
|
||
# 创建用户
|
||
CREATE USER gym_manage WITH PASSWORD 'your-password';
|
||
|
||
# 授权
|
||
GRANT ALL PRIVILEGES ON DATABASE gym_manage TO gym_manage;
|
||
|
||
# 退出
|
||
\q
|
||
```
|
||
|
||
#### 3.3.2 执行初始化脚本
|
||
|
||
```bash
|
||
# 执行初始化脚本
|
||
docker-compose exec -T postgres psql -U postgres -d gym_manage < sql/init.sql
|
||
```
|
||
|
||
---
|
||
|
||
## 四、更新部署
|
||
|
||
### 4.1 代码更新
|
||
|
||
#### 4.1.1 拉取最新代码
|
||
|
||
```bash
|
||
# 拉取最新代码
|
||
git pull origin production
|
||
|
||
# 查看变更
|
||
git log --oneline -5
|
||
```
|
||
|
||
#### 4.1.2 重新构建
|
||
|
||
```bash
|
||
# 停止服务
|
||
docker-compose down
|
||
|
||
# 重新构建镜像
|
||
docker-compose build gym-manage
|
||
|
||
# 启动服务
|
||
docker-compose up -d
|
||
```
|
||
|
||
### 4.2 滚动更新
|
||
|
||
#### 4.2.1 零停机更新
|
||
|
||
```bash
|
||
# 启动新实例
|
||
docker-compose up -d --scale gym-manage=2
|
||
|
||
# 等待新实例就绪
|
||
sleep 30
|
||
|
||
# 停止旧实例
|
||
docker-compose up -d --scale gym-manage=1
|
||
```
|
||
|
||
### 4.3 回滚部署
|
||
|
||
#### 4.3.1 快速回滚
|
||
|
||
```bash
|
||
# 回滚到上一个版本
|
||
git checkout HEAD~1
|
||
|
||
# 重新构建
|
||
docker-compose build gym-manage
|
||
|
||
# 启动服务
|
||
docker-compose up -d
|
||
```
|
||
|
||
#### 4.3.2 使用 Docker 镜像回滚
|
||
|
||
```bash
|
||
# 查看镜像历史
|
||
docker images | grep gym-manage
|
||
|
||
# 使用上一个镜像
|
||
docker-compose up -d --no-deps gym-manage
|
||
```
|
||
|
||
---
|
||
|
||
## 五、监控运维
|
||
|
||
### 5.1 监控体系
|
||
|
||
#### 5.1.1 Prometheus 监控
|
||
|
||
**访问地址**:http://your-server:9090
|
||
|
||
**主要功能**:
|
||
- 指标采集
|
||
- 数据存储
|
||
- 告警规则
|
||
- 查询接口
|
||
|
||
#### 5.1.2 Grafana 可视化
|
||
|
||
**访问地址**:http://your-server:3000
|
||
|
||
**默认账号**:
|
||
- 用户名:admin
|
||
- 密码:admin123
|
||
|
||
**主要功能**:
|
||
- 数据可视化
|
||
- 仪表板配置
|
||
- 告警通知
|
||
- 用户管理
|
||
|
||
#### 5.1.3 Kibana 日志可视化
|
||
|
||
**访问地址**:http://your-server:5601
|
||
|
||
**主要功能**:
|
||
- 日志查询
|
||
- 日志分析
|
||
- 可视化图表
|
||
- 告警配置
|
||
|
||
### 5.2 日志管理
|
||
|
||
#### 5.2.1 应用日志
|
||
|
||
```bash
|
||
# 查看实时日志
|
||
docker-compose logs -f gym-manage
|
||
|
||
# 查看最近 100 行日志
|
||
docker-compose logs --tail=100 gym-manage
|
||
|
||
# 查看特定时间的日志
|
||
docker-compose logs --since 2026-01-01T00:00:00 gym-manage
|
||
```
|
||
|
||
#### 5.2.2 日志文件
|
||
|
||
```bash
|
||
# 查看日志文件
|
||
tail -f logs/gym-manage.log
|
||
|
||
# 查看错误日志
|
||
grep ERROR logs/gym-manage.log
|
||
|
||
# 统计错误数量
|
||
grep -c ERROR logs/gym-manage.log
|
||
```
|
||
|
||
### 5.3 告警配置
|
||
|
||
#### 5.3.1 告警规则
|
||
|
||
**文件位置**:`monitoring/alerts.yml`
|
||
|
||
**告警类型**:
|
||
- 高错误率
|
||
- 高响应时间
|
||
- 高内存使用率
|
||
- 数据库连接池耗尽
|
||
- 缓存命中率低
|
||
|
||
#### 5.3.2 告警通知
|
||
|
||
**通知方式**:
|
||
- 邮件通知
|
||
- 钉钉通知
|
||
- 企业微信通知
|
||
- 短信通知
|
||
|
||
**配置示例**:
|
||
|
||
```yaml
|
||
alertmanager:
|
||
receivers:
|
||
- name: 'email'
|
||
email_configs:
|
||
- to: 'your-email@example.com'
|
||
from: 'alertmanager@example.com'
|
||
smarthost: 'smtp.example.com:587'
|
||
auth_username: 'your-email@example.com'
|
||
auth_password: 'your-password'
|
||
```
|
||
|
||
---
|
||
|
||
## 六、性能优化
|
||
|
||
### 6.1 应用优化
|
||
|
||
#### 6.1.1 JVM 参数调优
|
||
|
||
```bash
|
||
# 生产环境推荐参数 (响应式编程最佳实践)
|
||
JAVA_OPTS=-Xms1024m -Xmx2048m -XX:+UseZGC -XX:ZAllocationSpikeTolerance=5 -XX:+UnlockExperimentalVMOptions -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/app/logs/heapdump.hprof
|
||
```
|
||
|
||
**参数说明**:
|
||
- `-Xms`:初始堆内存大小
|
||
- `-Xmx`:最大堆内存大小
|
||
- `-XX:+UseZGC`:使用 ZGC 垃圾回收器(响应式编程推荐)
|
||
- `-XX:ZAllocationSpikeTolerance`:分配峰值容忍度
|
||
- `-XX:+UnlockExperimentalVMOptions`:解锁实验性选项
|
||
- `-XX:+UseTransparentHugePages`:使用透明大页
|
||
- `-XX:+AlwaysPreTouch`:预分配内存
|
||
- `-XX:+HeapDumpOnOutOfMemoryError`:内存溢出时生成堆转储
|
||
- `-XX:HeapDumpPath`:堆转储文件路径
|
||
|
||
**ZGC 优势**:
|
||
- 低延迟:GC 暂停时间通常 < 10ms
|
||
- 高吞吐量:适合响应式编程的高并发场景
|
||
- 大堆支持:支持 TB 级堆内存
|
||
- 自适应:自动调整 GC 参数
|
||
|
||
#### 6.1.2 连接池调优
|
||
|
||
```yaml
|
||
# application-prod.yml (响应式编程最佳实践)
|
||
spring:
|
||
r2dbc:
|
||
pool:
|
||
initial-size: 5 # 初始连接数(响应式编程推荐较少连接)
|
||
max-size: 20 # 最大连接数(响应式编程推荐较少连接)
|
||
max-idle-time: 30m # 最大空闲时间
|
||
max-life-time: 1h # 最大生命周期
|
||
acquire-timeout: 10s # 获取连接超时时间(响应式编程推荐较长超时)
|
||
max-create-connection-time: 30s # 创建连接最大时间
|
||
max-validation-time: 5s # 验证连接最大时间
|
||
```
|
||
|
||
**连接池配置说明**:
|
||
- 响应式编程使用较少的连接数(5-20)即可支持高并发
|
||
- 连接获取超时时间设置为 10s,避免快速失败
|
||
- 使用连接池复用,减少连接创建开销
|
||
|
||
### 6.2 数据库优化
|
||
|
||
#### 6.2.1 PostgreSQL 配置(响应式编程优化)
|
||
|
||
```bash
|
||
# postgresql.conf (响应式编程最佳实践)
|
||
# 内存配置
|
||
shared_buffers = 512MB # 共享缓冲区(响应式编程推荐较大值)
|
||
effective_cache_size = 2GB # 有效缓存大小
|
||
maintenance_work_mem = 128MB # 维护工作内存
|
||
work_mem = 32MB # 工作内存(响应式编程推荐较大值)
|
||
|
||
# WAL 配置
|
||
wal_buffers = 64MB # WAL 缓冲区
|
||
min_wal_size = 2GB # 最小 WAL 大小
|
||
max_wal_size = 8GB # 最大 WAL 大小
|
||
checkpoint_completion_target = 0.9 # 检查点完成目标
|
||
|
||
# 并发配置
|
||
max_connections = 200 # 最大连接数(响应式编程推荐较少连接)
|
||
max_worker_processes = 8 # 最大工作进程数
|
||
max_parallel_workers_per_gather = 4 # 每个查询的最大并行工作进程数
|
||
max_parallel_workers = 8 # 最大并行工作进程数
|
||
|
||
# IO 配置
|
||
random_page_cost = 1.1 # 随机页面成本(SSD 优化)
|
||
effective_io_concurrency = 300 # 有效 IO 并发数(SSD 优化)
|
||
max_io_concurrency = 200 # 最大 IO 并发数
|
||
|
||
# 查询优化
|
||
default_statistics_target = 100 # 默认统计目标
|
||
from_collapse_limit = 8 # FROM 子句折叠限制
|
||
join_collapse_limit = 8 # JOIN 子句折叠限制
|
||
|
||
# 日志配置
|
||
log_min_duration_statement = 1000 # 记录执行时间超过 1s 的语句
|
||
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ' # 日志前缀
|
||
log_checkpoints = on # 记录检查点
|
||
log_connections = on # 记录连接
|
||
log_disconnections = on # 记录断开连接
|
||
log_lock_waits = on # 记录锁等待
|
||
```
|
||
|
||
#### 6.2.2 索引优化
|
||
|
||
```sql
|
||
-- 查看索引使用情况
|
||
SELECT schemaname, tablename, attname, n_distinct, correlation
|
||
FROM pg_stats
|
||
WHERE schemaname = 'public'
|
||
ORDER BY correlation DESC;
|
||
|
||
-- 查看慢查询
|
||
SELECT query, mean_exec_time, calls
|
||
FROM pg_stat_statements
|
||
ORDER BY mean_exec_time DESC
|
||
LIMIT 10;
|
||
```
|
||
|
||
### 6.3 缓存优化
|
||
|
||
#### 6.3.1 Redis 配置
|
||
|
||
```bash
|
||
# redis.conf
|
||
maxmemory 2gb
|
||
maxmemory-policy allkeys-lru
|
||
save 900 1
|
||
save 300 10
|
||
save 60 10000
|
||
```
|
||
|
||
**参数说明**:
|
||
- `maxmemory`:最大内存使用量
|
||
- `maxmemory-policy`:内存淘汰策略
|
||
- `save`:RDB 持久化策略
|
||
|
||
---
|
||
|
||
## 七、故障排查
|
||
|
||
### 7.1 常见问题
|
||
|
||
#### 7.1.1 应用启动失败
|
||
|
||
**症状**:应用无法启动
|
||
|
||
**排查步骤**:
|
||
|
||
```bash
|
||
# 查看应用日志
|
||
docker-compose logs gym-manage
|
||
|
||
# 检查配置文件
|
||
cat application-prod.yml
|
||
|
||
# 检查环境变量
|
||
docker-compose config
|
||
|
||
# 检查数据库连接
|
||
docker-compose exec postgres pg_isready -U postgres
|
||
```
|
||
|
||
**常见原因**:
|
||
- 数据库连接失败
|
||
- 配置文件错误
|
||
- 端口冲突
|
||
- 内存不足
|
||
|
||
#### 7.1.2 数据库连接失败
|
||
|
||
**症状**:应用无法连接数据库
|
||
|
||
**排查步骤**:
|
||
|
||
```bash
|
||
# 检查数据库状态
|
||
docker-compose ps postgres
|
||
|
||
# 查看数据库日志
|
||
docker-compose logs postgres
|
||
|
||
# 测试数据库连接
|
||
docker-compose exec postgres psql -U postgres -d gym_manage -c "SELECT 1;"
|
||
|
||
# 检查网络连接
|
||
docker-compose exec gym-manage ping postgres
|
||
```
|
||
|
||
**常见原因**:
|
||
- 数据库未启动
|
||
- 网络不通
|
||
- 用户名密码错误
|
||
- 数据库不存在
|
||
|
||
#### 7.1.3 性能下降
|
||
|
||
**症状**:响应时间变长
|
||
|
||
**排查步骤**:
|
||
|
||
```bash
|
||
# 查看应用日志
|
||
docker-compose logs gym-manage | grep "Slow query"
|
||
|
||
# 查看数据库慢查询
|
||
docker-compose exec postgres psql -U postgres -d gym_manage -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
|
||
|
||
# 查看系统资源
|
||
top
|
||
htop
|
||
|
||
# 查看数据库连接数
|
||
docker-compose exec postgres psql -U postgres -d gym_manage -c "SELECT count(*) FROM pg_stat_activity;"
|
||
```
|
||
|
||
**常见原因**:
|
||
- 慢查询
|
||
- 数据库连接池耗尽
|
||
- 缓存命中率低
|
||
- 系统资源不足
|
||
|
||
### 7.2 应急处理
|
||
|
||
#### 7.2.1 重启服务
|
||
|
||
```bash
|
||
# 重启应用
|
||
docker-compose restart gym-manage
|
||
|
||
# 重启数据库
|
||
docker-compose restart postgres
|
||
|
||
# 重启所有服务
|
||
docker-compose restart
|
||
```
|
||
|
||
#### 7.2.2 回滚版本
|
||
|
||
```bash
|
||
# 回滚到上一个版本
|
||
git checkout HEAD~1
|
||
|
||
# 重新构建
|
||
docker-compose build gym-manage
|
||
|
||
# 启动服务
|
||
docker-compose up -d
|
||
```
|
||
|
||
#### 7.2.3 扩容
|
||
|
||
```bash
|
||
# 增加应用实例
|
||
docker-compose up -d --scale gym-manage=2
|
||
|
||
# 增加数据库资源
|
||
docker-compose up -d --scale postgres=2
|
||
```
|
||
|
||
---
|
||
|
||
## 八、备份恢复
|
||
|
||
### 8.1 数据备份
|
||
|
||
#### 8.1.1 数据库备份
|
||
|
||
```bash
|
||
# 备份数据库
|
||
docker-compose exec postgres pg_dump -U postgres gym_manage > backup/gym_manage_$(date +%Y%m%d_%H%M%S).sql
|
||
|
||
# 压缩备份文件
|
||
gzip backup/gym_manage_$(date +%Y%m%d_%H%M%S).sql
|
||
```
|
||
|
||
#### 8.1.2 定时备份
|
||
|
||
```bash
|
||
# 添加 crontab 任务
|
||
crontab -e
|
||
|
||
# 每天凌晨 2 点备份数据库
|
||
0 2 * * * docker-compose exec -T postgres pg_dump -U postgres gym_manage > backup/gym_manage_$(date +\%Y\%m\%d_\%H\%M\%S).sql
|
||
|
||
# 每周日凌晨 3 点清理 7 天前的备份
|
||
0 3 * * 0 find backup -name "gym_manage_*.sql" -mtime +7 -delete
|
||
```
|
||
|
||
### 8.2 数据恢复
|
||
|
||
#### 8.2.1 数据库恢复
|
||
|
||
```bash
|
||
# 停止应用
|
||
docker-compose stop gym-manage
|
||
|
||
# 恢复数据库
|
||
docker-compose exec -T postgres psql -U postgres gym_manage < backup/gym_manage_20260101_020000.sql
|
||
|
||
# 启动应用
|
||
docker-compose start gym-manage
|
||
```
|
||
|
||
---
|
||
|
||
## 九、安全加固
|
||
|
||
### 9.1 网络安全
|
||
|
||
#### 9.1.1 防火墙配置
|
||
|
||
```bash
|
||
# 配置防火墙
|
||
sudo ufw allow 22/tcp # SSH
|
||
sudo ufw allow 80/tcp # HTTP
|
||
sudo ufw allow 443/tcp # HTTPS
|
||
sudo ufw enable
|
||
```
|
||
|
||
#### 9.1.2 SSL 证书
|
||
|
||
```bash
|
||
# 使用 Let's Encrypt 获取免费 SSL 证书
|
||
sudo apt install certbot
|
||
sudo certbot certonly --standalone -d your-domain.com
|
||
|
||
# 配置 Nginx SSL
|
||
vim nginx/nginx.conf
|
||
```
|
||
|
||
### 9.2 应用安全
|
||
|
||
#### 9.2.1 敏感数据加密
|
||
|
||
```bash
|
||
# 配置环境变量
|
||
export DB_PASSWORD=$(openssl rand -base64 32)
|
||
export REDIS_PASSWORD=$(openssl rand -base64 32)
|
||
export MQ_PASSWORD=$(openssl rand -base64 32)
|
||
```
|
||
|
||
#### 9.2.2 权限控制
|
||
|
||
```yaml
|
||
# application-prod.yml
|
||
spring:
|
||
security:
|
||
user:
|
||
name: admin
|
||
password: ${ADMIN_PASSWORD}
|
||
roles: ADMIN
|
||
```
|
||
|
||
---
|
||
|
||
|
||
## 六、监控告警详细配置
|
||
|
||
### 6.1 Prometheus 监控配置
|
||
|
||
#### 6.1.1 prometheus.yml 配置
|
||
|
||
**文件位置**: `monitoring/prometheus.yml`
|
||
|
||
```yaml
|
||
global:
|
||
scrape_interval: 15s # 采集间隔
|
||
evaluation_interval: 15s # 规则评估间隔
|
||
external_labels:
|
||
monitor: 'gym-manage'
|
||
environment: 'production'
|
||
|
||
# 告警规则配置
|
||
rule_files:
|
||
- "alerts.yml"
|
||
|
||
# 告警管理器配置
|
||
alerting:
|
||
alertmanagers:
|
||
- static_configs:
|
||
- targets:
|
||
- alertmanager:9093
|
||
|
||
# 采集配置
|
||
scrape_configs:
|
||
# Prometheus 自监控
|
||
- job_name: 'prometheus'
|
||
static_configs:
|
||
- targets: ['localhost:9090']
|
||
labels:
|
||
instance: 'prometheus-server'
|
||
|
||
# 应用监控
|
||
- job_name: 'gym-manage'
|
||
metrics_path: '/actuator/prometheus'
|
||
static_configs:
|
||
- targets: ['gym-manage:8080']
|
||
labels:
|
||
application: 'gym-manage'
|
||
environment: 'production'
|
||
scrape_interval: 10s
|
||
|
||
# Node 导出器
|
||
- job_name: 'node-exporter'
|
||
static_configs:
|
||
- targets: ['node-exporter:9100']
|
||
labels:
|
||
instance: 'server-node'
|
||
|
||
# Redis 导出器
|
||
- job_name: 'redis-exporter'
|
||
static_configs:
|
||
- targets: ['redis-exporter:9121']
|
||
labels:
|
||
instance: 'redis-server'
|
||
|
||
# PostgreSQL 导出器
|
||
- job_name: 'postgres-exporter'
|
||
static_configs:
|
||
- targets: ['postgres-exporter:9187']
|
||
labels:
|
||
instance: 'postgres-server'
|
||
|
||
# RabbitMQ 导出器
|
||
- job_name: 'rabbitmq-exporter'
|
||
static_configs:
|
||
- targets: ['rabbitmq-exporter:9419']
|
||
labels:
|
||
instance: 'rabbitmq-server'
|
||
```
|
||
|
||
#### 6.1.2 alerts.yml 告警规则
|
||
|
||
**文件位置**: `monitoring/alerts.yml`
|
||
|
||
```yaml
|
||
groups:
|
||
- name: gym-manage-alerts
|
||
interval: 30s
|
||
rules:
|
||
# 应用可用性告警
|
||
- alert: ApplicationDown
|
||
expr: up{job="gym-manage"} == 0
|
||
for: 1m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "应用不可用"
|
||
description: "应用 {{ $labels.instance }} 已宕机超过 1 分钟"
|
||
|
||
# 高错误率告警
|
||
- alert: HighErrorRate
|
||
expr: sum(rate(http_server_requests_seconds_count{status=~"5..", job="gym-manage"}[5m])) / sum(rate(http_server_requests_seconds_count{job="gym-manage"}[5m])) > 0.05
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "高错误率"
|
||
description: "应用错误率超过 5% (当前值:{{ $value | humanizePercentage }})"
|
||
|
||
# 高响应时间告警
|
||
- alert: HighResponseTime
|
||
expr: histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket{job="gym-manage"}[5m])) by (le)) > 1
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "高响应时间"
|
||
description: "应用 P95 响应时间超过 1 秒 (当前值:{{ $value | humanizeDuration }})"
|
||
|
||
# 高内存使用率告警
|
||
- alert: HighMemoryUsage
|
||
expr: (jvm_memory_used_bytes{area="heap", job="gym-manage"} / jvm_memory_max_bytes{area="heap", job="gym-manage"}) > 0.85
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "高内存使用率"
|
||
description: "JVM 堆内存使用率超过 85% (当前值:{{ $value | humanizePercentage }})"
|
||
|
||
# OOM 告警
|
||
- alert: OutOfMemory
|
||
expr: (jvm_memory_used_bytes{area="heap", job="gym-manage"} / jvm_memory_max_bytes{area="heap", job="gym-manage"}) > 0.95
|
||
for: 2m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "内存即将耗尽"
|
||
description: "JVM 堆内存使用率超过 95% (当前值:{{ $value | humanizePercentage }})"
|
||
|
||
# 数据库连接池耗尽告警
|
||
- alert: DatabaseConnectionPoolExhausted
|
||
expr: hikaricp_active_connections{job="gym-manage"} / hikaricp_max_connections{job="gym-manage"} > 0.9
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "数据库连接池耗尽"
|
||
description: "数据库连接池使用率超过 90% (当前值:{{ $value | humanizePercentage }})"
|
||
|
||
# Redis 连接失败告警
|
||
- alert: RedisConnectionFailed
|
||
expr: redis_up{job="redis-exporter"} == 0
|
||
for: 1m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "Redis 连接失败"
|
||
description: "Redis {{ $labels.instance }} 连接失败"
|
||
|
||
# PostgreSQL 连接失败告警
|
||
- alert: PostgresConnectionFailed
|
||
expr: pg_up{job="postgres-exporter"} == 0
|
||
for: 1m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "PostgreSQL 连接失败"
|
||
description: "PostgreSQL {{ $labels.instance }} 连接失败"
|
||
|
||
# RabbitMQ 队列堆积告警
|
||
- alert: RabbitMQQueueBacklog
|
||
expr: rabbitmq_queue_messages{job="rabbitmq-exporter"} > 1000
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "消息队列堆积"
|
||
description: "队列 {{ $labels.queue }} 消息数量超过 1000 (当前值:{{ $value }})"
|
||
|
||
# 磁盘空间不足告警
|
||
- alert: DiskSpaceLow
|
||
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "磁盘空间不足"
|
||
description: "服务器 {{ $labels.instance }} 根分区磁盘空间不足 15% (当前值:{{ $value | humanizePercentage }})"
|
||
|
||
# CPU 使用率过高告警
|
||
- alert: HighCPUUsage
|
||
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
|
||
for: 10m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "CPU 使用率过高"
|
||
description: "服务器 {{ $labels.instance }} CPU 使用率超过 85% (当前值:{{ $value | humanize }}%)"
|
||
```
|
||
|
||
### 6.2 Grafana 仪表板配置
|
||
|
||
#### 6.2.1 应用监控仪表板
|
||
|
||
**仪表板 ID**: `gym-manage-overview`
|
||
|
||
**主要面板**:
|
||
1. **应用健康状态**
|
||
- 应用在线状态
|
||
- 健康检查状态
|
||
- 运行时长
|
||
|
||
2. **流量指标**
|
||
- QPS (每秒请求数)
|
||
- 并发连接数
|
||
- 网络吞吐量
|
||
|
||
3. **响应时间**
|
||
- 平均响应时间
|
||
- P95 响应时间
|
||
- P99 响应时间
|
||
|
||
4. **错误率**
|
||
- HTTP 5xx 错误率
|
||
- HTTP 4xx 错误率
|
||
- 业务错误率
|
||
|
||
5. **JVM 指标**
|
||
- 堆内存使用率
|
||
- 非堆内存使用率
|
||
- GC 次数和时间
|
||
- 线程数
|
||
|
||
6. **数据库连接池**
|
||
- 活跃连接数
|
||
- 空闲连接数
|
||
- 连接池使用率
|
||
- 平均获取连接时间
|
||
|
||
7. **Redis 缓存**
|
||
- 缓存命中率
|
||
- 缓存键数量
|
||
- 内存使用量
|
||
- 命令执行时间
|
||
|
||
8. **消息队列**
|
||
- 队列消息数量
|
||
- 消息生产速率
|
||
- 消息消费速率
|
||
- 消息堆积情况
|
||
|
||
#### 6.2.2 系统监控仪表板
|
||
|
||
**仪表板 ID**: `system-overview`
|
||
|
||
**主要面板**:
|
||
1. **CPU 指标**
|
||
- CPU 使用率
|
||
- CPU 负载 (1/5/15 分钟)
|
||
- CPU 核心数
|
||
|
||
2. **内存指标**
|
||
- 内存使用率
|
||
- 可用内存
|
||
- Swap 使用率
|
||
|
||
3. **磁盘指标**
|
||
- 磁盘使用率
|
||
- 磁盘 I/O
|
||
- 磁盘读写速率
|
||
|
||
4. **网络指标**
|
||
- 网络流量
|
||
- 网络连接数
|
||
- 网络错误率
|
||
|
||
### 6.3 告警通知配置
|
||
|
||
#### 6.3.1 Alertmanager 配置
|
||
|
||
**文件位置**: `monitoring/alertmanager.yml`
|
||
|
||
```yaml
|
||
global:
|
||
# 邮件配置
|
||
smtp_smarthost: 'smtp.example.com:587'
|
||
smtp_from: 'alertmanager@example.com'
|
||
smtp_auth_username: 'alertmanager@example.com'
|
||
smtp_auth_password: 'your-password'
|
||
|
||
# 钉钉配置
|
||
dingtalk_configs:
|
||
- url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
|
||
secret: 'YOUR_SECRET'
|
||
send_resolved: true
|
||
|
||
# 企业微信配置
|
||
wechat_configs:
|
||
- corp_id: 'YOUR_CORP_ID'
|
||
agent_id: 'YOUR_AGENT_ID'
|
||
secret: 'YOUR_SECRET'
|
||
to_user: '@all'
|
||
send_resolved: true
|
||
|
||
# 模板配置
|
||
templates:
|
||
- '/etc/alertmanager/templates/*.tmpl'
|
||
|
||
# 路由配置
|
||
route:
|
||
receiver: 'default-receiver'
|
||
group_by: ['alertname', 'severity']
|
||
group_wait: 30s
|
||
group_interval: 5m
|
||
repeat_interval: 4h
|
||
routes:
|
||
# 严重告警立即通知
|
||
- match:
|
||
severity: critical
|
||
receiver: 'critical-receiver'
|
||
group_wait: 10s
|
||
repeat_interval: 1h
|
||
# 警告告警延迟通知
|
||
- match:
|
||
severity: warning
|
||
receiver: 'warning-receiver'
|
||
group_wait: 5m
|
||
repeat_interval: 4h
|
||
|
||
# 接收器配置
|
||
receivers:
|
||
- name: 'default-receiver'
|
||
email_configs:
|
||
- to: 'devops-team@example.com'
|
||
send_resolved: true
|
||
|
||
- name: 'critical-receiver'
|
||
email_configs:
|
||
- to: 'oncall@example.com'
|
||
send_resolved: true
|
||
dingtalk_configs:
|
||
- send_resolved: true
|
||
wechat_configs:
|
||
- send_resolved: true
|
||
|
||
- name: 'warning-receiver'
|
||
email_configs:
|
||
- to: 'dev-team@example.com'
|
||
send_resolved: true
|
||
|
||
# 抑制规则
|
||
inhibit_rules:
|
||
# 如果应用宕机,抑制其他告警
|
||
- source_match:
|
||
alertname: 'ApplicationDown'
|
||
target_match:
|
||
severity: 'warning'
|
||
equal: ['instance']
|
||
```
|
||
|
||
#### 6.3.2 告警升级策略
|
||
|
||
**升级规则**:
|
||
1. **P0 级别 (Critical)**
|
||
- 立即通知:钉钉 + 企业微信 + 短信 + 电话
|
||
- 15 分钟未响应:升级至技术总监
|
||
- 30 分钟未响应:升级至 CTO
|
||
|
||
2. **P1 级别 (Warning)**
|
||
- 立即通知:钉钉 + 企业微信
|
||
- 1 小时未响应:升级至部门经理
|
||
- 2 小时未响应:升级至技术总监
|
||
|
||
3. **P2 级别 (Info)**
|
||
- 工作时间通知:邮件
|
||
- 24 小时未处理:升级为 Warning
|
||
|
||
#### 6.3.3 告警值班安排
|
||
|
||
**值班表配置**:
|
||
```yaml
|
||
# 工作日值班
|
||
work_hours:
|
||
- Monday to Friday: 09:00-18:00
|
||
|
||
# 值班人员
|
||
on_call_schedule:
|
||
- name: "张三"
|
||
email: "zhangsan@example.com"
|
||
phone: "13800138000"
|
||
schedule: "周一,周三"
|
||
- name: "李四"
|
||
email: "lisi@example.com"
|
||
phone: "13900139000"
|
||
schedule: "周二,周四"
|
||
- name: "王五"
|
||
email: "wangwu@example.com"
|
||
phone: "13700137000"
|
||
schedule: "周五"
|
||
|
||
# 周末值班
|
||
weekend_on_call:
|
||
- name: "值班团队"
|
||
email: "weekend-team@example.com"
|
||
phone: "400-xxx-xxxx"
|
||
```
|
||
|
||
---
|
||
|
||
## 七、备份恢复详细策略
|
||
|
||
### 7.1 备份策略
|
||
|
||
#### 7.1.1 备份类型
|
||
|
||
**全量备份**:
|
||
- 频率:每日凌晨 2 点
|
||
- 保留期限:30 天
|
||
- 备份内容:完整数据库、配置文件
|
||
|
||
**增量备份**:
|
||
- 频率:每小时
|
||
- 保留期限:7 天
|
||
- 备份内容:WAL 日志、变更数据
|
||
|
||
**差异备份**:
|
||
- 频率:每 6 小时
|
||
- 保留期限:7 天
|
||
- 备份内容:自上次全量备份后的变更
|
||
|
||
#### 7.1.2 备份内容
|
||
|
||
**数据库备份**:
|
||
```bash
|
||
# PostgreSQL 全量备份脚本
|
||
#!/bin/bash
|
||
BACKUP_DIR="/backup/postgres"
|
||
DATE=$(date +%Y%m%d_%H%M%S)
|
||
DB_NAME="gym_manage"
|
||
DB_USER="postgres"
|
||
|
||
# 创建备份目录
|
||
mkdir -p ${BACKUP_DIR}
|
||
|
||
# 全量备份
|
||
pg_dump -U ${DB_USER} -h localhost ${DB_NAME} | gzip > ${BACKUP_DIR}/${DB_NAME}_${DATE}.sql.gz
|
||
|
||
# 备份 WAL 日志
|
||
# 配置 postgresql.conf:
|
||
# wal_level = replica
|
||
# archive_mode = on
|
||
# archive_command = 'cp %p /backup/wal/%f'
|
||
|
||
# 清理旧备份 (保留 30 天)
|
||
find ${BACKUP_DIR} -name "*.sql.gz" -mtime +30 -delete
|
||
```
|
||
|
||
**配置文件备份**:
|
||
```bash
|
||
# 备份应用配置
|
||
#!/bin/bash
|
||
BACKUP_DIR="/backup/config"
|
||
DATE=$(date +%Y%m%d_%H%M%S)
|
||
|
||
# 备份配置文件
|
||
tar -czf ${BACKUP_DIR}/config_${DATE}.tar.gz application-prod.yml docker-compose.yml nginx/nginx.conf monitoring/prometheus.yml monitoring/alerts.yml
|
||
|
||
# 备份环境变量
|
||
docker-compose exec gym-manage env > ${BACKUP_DIR}/env_${DATE}.txt
|
||
```
|
||
|
||
**数据文件备份**:
|
||
```bash
|
||
# 备份 Redis 数据
|
||
#!/bin/bash
|
||
BACKUP_DIR="/backup/redis"
|
||
DATE=$(date +%Y%m%d_%H%M%S)
|
||
|
||
# 触发 RDB 保存
|
||
docker-compose exec redis redis-cli BGSAVE
|
||
|
||
# 等待保存完成
|
||
sleep 5
|
||
|
||
# 复制 RDB 文件
|
||
docker cp gym-manage-redis:/data/dump.rdb ${BACKUP_DIR}/dump_${DATE}.rdb
|
||
|
||
# 备份 Elasticsearch 数据
|
||
docker-compose exec elasticsearch elasticsearch-snapshot -repository backup -snapshot gym_manage_${DATE}
|
||
```
|
||
|
||
#### 7.1.3 备份验证
|
||
|
||
**定期验证**:
|
||
- 频率:每周日凌晨 3 点
|
||
- 内容:验证备份文件完整性
|
||
- 方法:恢复测试
|
||
|
||
```bash
|
||
# 备份验证脚本
|
||
#!/bin/bash
|
||
BACKUP_DIR="/backup/postgres"
|
||
LATEST_BACKUP=$(ls -t ${BACKUP_DIR}/*.sql.gz | head -1)
|
||
|
||
# 验证备份文件完整性
|
||
if gzip -t ${LATEST_BACKUP}; then
|
||
echo "备份文件完整: ${LATEST_BACKUP}"
|
||
else
|
||
echo "备份文件损坏: ${LATEST_BACKUP}"
|
||
# 发送告警
|
||
curl -X POST "https://alert.example.com/backup-failed"
|
||
fi
|
||
|
||
# 恢复测试 (在测试环境)
|
||
# gunzip -c ${LATEST_BACKUP} | psql -U postgres -h test-db gym_manage_test
|
||
```
|
||
|
||
### 7.2 恢复策略
|
||
|
||
#### 7.2.1 恢复优先级
|
||
|
||
**P0 - 核心业务恢复** (RTO ≤ 30 分钟):
|
||
1. 数据库恢复
|
||
2. 应用服务恢复
|
||
3. 缓存恢复
|
||
|
||
**P1 - 重要业务恢复** (RTO ≤ 2 小时):
|
||
4. 消息队列恢复
|
||
5. 搜索引擎恢复
|
||
6. 日志系统恢复
|
||
|
||
**P2 - 辅助业务恢复** (RTO ≤ 4 小时):
|
||
7. 监控系统恢复
|
||
8. 报表系统恢复
|
||
9. 备份系统恢复
|
||
|
||
#### 7.2.2 数据库恢复流程
|
||
|
||
**完整恢复流程**:
|
||
```bash
|
||
#!/bin/bash
|
||
# 数据库恢复脚本
|
||
|
||
BACKUP_FILE=$1
|
||
DB_NAME="gym_manage"
|
||
DB_USER="postgres"
|
||
|
||
echo "开始恢复数据库..."
|
||
|
||
# 1. 停止应用
|
||
echo "停止应用..."
|
||
docker-compose stop gym-manage
|
||
|
||
# 2. 创建临时数据库
|
||
echo "创建临时数据库..."
|
||
docker-compose exec postgres psql -U postgres -c "CREATE DATABASE ${DB_NAME}_restore;"
|
||
|
||
# 3. 恢复数据
|
||
echo "恢复数据..."
|
||
gunzip -c ${BACKUP_FILE} | docker-compose exec -T postgres psql -U postgres ${DB_NAME}_restore
|
||
|
||
# 4. 验证数据
|
||
echo "验证数据..."
|
||
docker-compose exec postgres psql -U postgres -d ${DB_NAME}_restore -c "SELECT COUNT(*) FROM members;"
|
||
|
||
# 5. 备份当前数据库 (如果有)
|
||
if docker-compose exec postgres psql -U postgres -lqt | cut -d \| -f 1 | grep -w ${DB_NAME}; then
|
||
echo "备份当前数据库..."
|
||
docker-compose exec postgres pg_dump -U postgres ${DB_NAME} | gzip > /backup/emergency_${DB_NAME}_$(date +%Y%m%d_%H%M%S).sql.gz
|
||
fi
|
||
|
||
# 6. 删除原数据库
|
||
echo "删除原数据库..."
|
||
docker-compose exec postgres psql -U postgres -c "DROP DATABASE ${DB_NAME};"
|
||
|
||
# 7. 重命名恢复的数据库
|
||
echo "重命名数据库..."
|
||
docker-compose exec postgres psql -U postgres -c "ALTER DATABASE ${DB_NAME}_restore RENAME TO ${DB_NAME};"
|
||
|
||
# 8. 启动应用
|
||
echo "启动应用..."
|
||
docker-compose start gym-manage
|
||
|
||
# 9. 验证应用
|
||
echo "验证应用..."
|
||
sleep 10
|
||
curl -f http://localhost:8080/actuator/health
|
||
|
||
echo "数据库恢复完成!"
|
||
```
|
||
|
||
#### 7.2.3 应用恢复流程
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# 应用恢复脚本
|
||
|
||
echo "开始恢复应用..."
|
||
|
||
# 1. 停止应用
|
||
docker-compose stop gym-manage
|
||
|
||
# 2. 清理旧容器
|
||
docker-compose rm -f gym-manage
|
||
|
||
# 3. 拉取最新镜像
|
||
docker-compose pull gym-manage
|
||
|
||
# 4. 恢复配置
|
||
cp backup/application/application-prod.yml.bak ./config/application-prod.yml
|
||
|
||
# 5. 启动应用
|
||
docker-compose up -d gym-manage
|
||
|
||
# 6. 等待启动
|
||
sleep 30
|
||
|
||
# 7. 健康检查
|
||
curl -f http://localhost:8080/actuator/health || exit 1
|
||
|
||
echo "应用恢复完成!"
|
||
```
|
||
|
||
#### 7.2.4 缓存恢复流程
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# Redis 恢复脚本
|
||
|
||
echo "开始恢复 Redis..."
|
||
|
||
# 1. 停止 Redis
|
||
docker-compose stop redis
|
||
|
||
# 2. 清理旧数据
|
||
docker-compose run --rm redis rm -rf /data/*
|
||
|
||
# 3. 恢复 RDB 文件
|
||
LATEST_RDB=$(ls -t /backup/redis/dump_*.rdb | head -1)
|
||
cp ${LATEST_RDB} docker/redis/data/dump.rdb
|
||
|
||
# 4. 启动 Redis
|
||
docker-compose up -d redis
|
||
|
||
# 5. 验证
|
||
docker-compose exec redis redis-cli PING
|
||
|
||
echo "Redis 恢复完成!"
|
||
```
|
||
|
||
### 7.3 灾难恢复
|
||
|
||
#### 7.3.1 灾难恢复场景
|
||
|
||
**场景 1: 单服务器故障**
|
||
- 恢复时间:RTO ≤ 1 小时
|
||
- 恢复点:RPO ≤ 15 分钟
|
||
- 恢复步骤:
|
||
1. 切换到备用服务器
|
||
2. 从备份恢复数据
|
||
3. 更新 DNS 解析
|
||
4. 验证服务可用性
|
||
|
||
**场景 2: 数据中心故障**
|
||
- 恢复时间:RTO ≤ 4 小时
|
||
- 恢复点:RPO ≤ 1 小时
|
||
- 恢复步骤:
|
||
1. 启用异地灾备中心
|
||
2. 从异地备份恢复数据
|
||
3. 切换流量到灾备中心
|
||
4. 验证服务可用性
|
||
|
||
**场景 3: 数据损坏/丢失**
|
||
- 恢复时间:RTO ≤ 2 小时
|
||
- 恢复点:RPO ≤ 15 分钟
|
||
- 恢复步骤:
|
||
1. 确定数据损坏时间点
|
||
2. 从损坏前的备份恢复
|
||
3. 应用增量备份
|
||
4. 验证数据完整性
|
||
|
||
#### 7.3.2 灾难恢复演练
|
||
|
||
**演练频率**:
|
||
- 桌面推演:每月一次
|
||
- 实战演练:每季度一次
|
||
- 全链路演练:每半年一次
|
||
|
||
**演练内容**:
|
||
1. 备份恢复验证
|
||
2. 故障切换验证
|
||
3. 监控告警验证
|
||
4. 通讯流程验证
|
||
5. 文档更新验证
|
||
|
||
**演练报告**:
|
||
- 演练目标
|
||
- 演练过程
|
||
- 问题记录
|
||
- 改进措施
|
||
- 责任人和时间节点
|
||
|
||
---
|
||
|
||
## 十、总结
|
||
|
||
### 10.1 部署要点
|
||
|
||
1. ✅ 使用 Docker Compose 一键部署
|
||
2. ✅ 配置健康检查和自动重启
|
||
3. ✅ 完善的监控和告警体系
|
||
4. ✅ 定期备份数据
|
||
5. ✅ 安全加固和权限控制
|
||
|
||
### 10.2 运维要点
|
||
|
||
1. ✅ 定期查看日志和监控
|
||
2. ✅ 及时处理告警
|
||
3. ✅ 定期备份数据
|
||
4. ✅ 定期更新系统和依赖
|
||
5. ✅ 定期进行安全审计
|
||
|
||
### 10.3 持续改进
|
||
|
||
1. ✅ 性能监控和优化
|
||
2. ✅ 故障复盘和改进
|
||
3. ✅ 文档更新和维护
|
||
4. ✅ 团队培训和知识分享
|
||
5. ✅ 自动化运维工具开发
|