# Redis改造实施计划（缓存+队列完整版）

> **文档版本：** V2.0  
> **更新日期：** 2025-12-12  
> **目标完成时间：** 2025-12-18（7天）  
> **负责人：** 技术团队  
> **风险等级：** 🟡 中等（有降级方案）  
> **重要变更：** Redis队列从"可选"调整为"必须"

---

## ⚠️ **重要说明（V2.0更新）**

经过深入分析，Redis队列**不是可选项**，而是**核心功能的必须项**：

1. **ASL文献筛选**：1000篇文献需要2小时，不用Redis队列失败率 > 95%
2. **DC Tool B病历提取**：1000份病历需要2-3小时，同样问题
3. **SAE实例特性**：15分钟无流量自动缩容，长任务必然失败

**因此本计划调整为：缓存+队列一起实施（7天完成）**

---

## 📋 目录

1. [改造背景与目标](#1-改造背景与目标)
2. [当前系统状态分析](#2-当前系统状态分析)
3. [Redis配置信息](#3-redis配置信息)
4. [改造详细步骤](#4-改造详细步骤)（✨ **已更新：包含队列**）
5. [测试方案](#5-测试方案)
6. [风险评估与缓解](#6-风险评估与缓解)
7. [上线计划](#7-上线计划)
8. [回滚方案](#8-回滚方案)
9. [监控与运维](#9-监控与运维)

---

## 1. 改造背景与目标

### 1.1 为什么要改造？

#### **当前问题**：
1. ❌ **违反云原生规范**：系统使用内存缓存，违反自己制定的云原生开发规范
2. ❌ **LLM成本失控**：缓存不持久化，导致重复调用DeepSeek/Qwen API
3. ❌ **长任务不可靠**：30-60分钟的文献筛选任务，SAE实例重启后丢失
4. ❌ **多实例不同步**：SAE扩容后，各实例缓存不共享
5. ❌ **Serverless不适配**：内存状态在Serverless环境下不可靠

#### **改造目标**：
- ✅ **符合架构规范**：使用分布式缓存（Redis）
- ✅ **降低API成本**：LLM结果缓存持久化，避免重复调用
- ✅ **任务持久化**：长时间任务不因实例重启而丢失
- ✅ **支持多实例**：缓存在多实例间共享
- ✅ **平滑过渡**：保留降级方案，确保系统稳定

---

## 2. 当前系统状态分析

### 2.1 已使用缓存的位置

#### **位置1：HealthCheckService.ts**
```typescript
// 文件：backend/src/modules/dc/tool-b/services/HealthCheckService.ts
// 第47行：读取缓存
const cached = await cache.get<HealthCheckResult>(cacheKey);

// 第145行：写入缓存
await cache.set(cacheKey, result, 86400);  // 24小时
```

**用途**：Excel健康检查结果缓存  
**重要性**：🟡 中等（避免重复解析Excel）  
**数据量**：~5KB/项  

---

#### **位置2：LLM12FieldsService.ts**
```typescript
// 文件：backend/src/modules/asl/common/llm/LLM12FieldsService.ts
// 第516行：读取缓存
const cached = await cache.get(cacheKey);

// 第530行：写入缓存
await cache.set(cacheKey, JSON.stringify(result), 3600);  // 1小时
```

**用途**：LLM 12字段提取结果缓存  
**重要性**：🔴 高（直接影响API成本）  
**数据量**：~50KB/项  
**成本影响**：
```
单次提取成本：~¥0.43/篇
如果缓存失效，重复调用：
- 10次 = ¥4.3
- 100次 = ¥43
- 1000次 = ¥430
```

---

### 2.2 长时间异步任务

#### **ASL模块：文献筛选任务**
```typescript
// 文件：backend/src/modules/asl/services/screeningService.ts
// 第63-65行
processLiteraturesInBackground(task.id, projectId, literatures);
```

**问题**：
- 199篇文献需要 33-66分钟
- 当前使用内存队列（MemoryQueue）
- SAE实例重启/缩容时任务丢失

**影响**：
- 用户体验极差（任务突然消失）
- 已处理结果丢失，浪费API费用
- 无法追溯任务状态

---

### 2.3 当前架构配置

```env
# backend/.env
CACHE_TYPE=memory    # ← 需要改为 redis
QUEUE_TYPE=memory    # ← 需要改为 redis
```

---

## 3. Redis配置信息

### 3.1 阿里云Redis购买信息

| 配置项 | 值 | 说明 |
|--------|---|------|
| **产品** | Redis 开源版 | 完整Redis功能 |
| **付费方式** | 包年包月 | 首次购买享6折优惠 |
| **部署模式** | 云原生（高可用） | 主从自动切换 |
| **系列** | 标准版 | 满足需求 |
| **地域** | 华北2（北京） | 与SAE同地域 |
| **实例类型** | 高可用 | ✅ 99.95%可用性 |
| **大版本** | Redis 7.0 | 最新稳定版 |
| **架构类型** | 不启用集群（单节点） | 满足当前规模 |
| **分片规格** | 256 MB | 初期足够 |
| **分片数量** | 1 | 单分片 |
| **读写分离** | 关闭 | 简化配置 |

### 3.2 预估成本

```
基础价格：¥72/年（单机版）
高可用版：¥180/年（估算）
首次购买：¥108/年（6折后）

对比收益：
- 节省LLM API费用：>¥500/年
- 提升用户满意度：无价
- ROI：>400%
```

### 3.3 连接信息（购买后获取）

```env
# 阿里云控制台 → Redis实例 → 连接信息
REDIS_HOST=r-xxxxxxxxxxxx.redis.rds.aliyuncs.com
REDIS_PORT=6379
REDIS_PASSWORD=your_secure_password_here
REDIS_DB=0

# 或使用连接字符串
REDIS_URL=redis://:your_password@r-xxxxxxxxxxxx.redis.rds.aliyuncs.com:6379/0
```

---

## 4. 改造详细步骤

### 4.1 Phase 1：本地开发环境准备 ✅

#### **步骤1.1：安装依赖**
```bash
cd backend
npm install ioredis --save
npm install @types/ioredis --save-dev
```

#### **步骤1.2：配置本地Redis**
```bash
# 确认Docker Redis正在运行
docker ps | findstr redis

# 如果没有运行，启动它
docker start ai-clinical-redis

# 测试连接
docker exec -it ai-clinical-redis redis-cli ping
# 应该返回：PONG
```

#### **步骤1.3：更新本地.env**
```env
# backend/.env
DATABASE_URL=postgresql://postgres:postgres123@localhost:5432/ai_clinical_research

# ==================== Redis配置 ====================
# 启用Redis缓存
CACHE_TYPE=redis
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_DB=0
# REDIS_PASSWORD=  # 本地无密码

# 队列暂时用内存（分阶段启用）
QUEUE_TYPE=memory

# ==================== JWT ====================
JWT_SECRET=your-secret-key-change-in-production
JWT_EXPIRES_IN=7d

# ==================== LLM API ====================
DEEPSEEK_API_KEY=sk-7f8cc37a79fa4799860b38fc7ba2e150
DASHSCOPE_API_KEY=sk-75b4ff29a14a49e79667a331034f3298

# ==================== Dify ====================
DIFY_API_URL=http://localhost/v1
DIFY_API_KEY=dataset-mfvdiKvQ2l3NvxWm7RoYMN3c

# ==================== Server ====================
PORT=3001
NODE_ENV=development

# ==================== CloseAI配置 ====================
CLOSEAI_API_KEY=sk-cu0iepbXYGGx2jc7BqP6ogtSWmP6fk918qV3RUdtGC3Ed1po
CLOSEAI_OPENAI_BASE_URL=https://api.openai-proxy.org/v1
CLOSEAI_CLAUDE_BASE_URL=https://api.openai-proxy.org/anthropic

# ==================== 存储配置 ====================
STORAGE_TYPE=local
LOCAL_STORAGE_DIR=uploads
LOCAL_STORAGE_URL=http://localhost:3001/uploads

# ==================== CORS配置 ====================
CORS_ORIGIN=http://localhost:5173

# ==================== 日志配置 ====================
LOG_LEVEL=debug
```

---

### 4.2 Phase 2：实现RedisCacheAdapter ✅

#### **步骤2.1：修改RedisCacheAdapter.ts**

```typescript
// 文件：backend/src/common/cache/RedisCacheAdapter.ts

import Redis from 'ioredis';
import type { CacheAdapter } from './CacheAdapter.js';
import { logger } from '../logging/index.js';

/**
 * Redis缓存适配器
 * 
 * 使用ioredis客户端，支持：
 * - 字符串/对象自动序列化
 * - TTL过期时间
 * - 连接池管理
 * - 错误重试
 * 
 * @example
 * const cache = new RedisCacheAdapter({
 *   host: 'localhost',
 *   port: 6379,
 *   password: 'xxx',
 *   db: 0
 * });
 * 
 * await cache.set('key', { data: 'value' }, 60);
 * const value = await cache.get('key');
 */
export class RedisCacheAdapter implements CacheAdapter {
  private redis: Redis;

  constructor(options?: {
    host?: string;
    port?: number;
    password?: string;
    db?: number;
  }) {
    this.redis = new Redis({
      host: options?.host || 'localhost',
      port: options?.port || 6379,
      password: options?.password || undefined,
      db: options?.db || 0,
      // 连接配置
      retryStrategy: (times) => {
        const delay = Math.min(times * 50, 2000);
        logger.warn(`Redis连接重试 ${times} 次，${delay}ms后重试`);
        return delay;
      },
      maxRetriesPerRequest: 3,
      enableReadyCheck: true,
      // 连接池配置
      lazyConnect: false,
      keepAlive: 30000,
    });

    // 监听连接事件
    this.redis.on('connect', () => {
      logger.info('Redis连接成功');
    });

    this.redis.on('error', (error) => {
      logger.error('Redis连接错误', { error: error.message });
    });

    this.redis.on('close', () => {
      logger.warn('Redis连接关闭');
    });

    this.redis.on('reconnecting', () => {
      logger.warn('Redis正在重连...');
    });
  }

  /**
   * 获取缓存值
   */
  async get<T>(key: string): Promise<T | null> {
    try {
      const value = await this.redis.get(key);
      
      if (!value) {
        logger.debug('Redis缓存未命中', { key });
        return null;
      }

      logger.debug('Redis缓存命中', { key, size: value.length });

      // 尝试解析JSON
      try {
        return JSON.parse(value) as T;
      } catch {
        // 如果不是JSON，返回原始字符串
        return value as unknown as T;
      }
    } catch (error) {
      logger.error('Redis GET失败', { key, error });
      return null;  // 降级：返回null而不是抛异常
    }
  }

  /**
   * 设置缓存值
   * @param ttl 过期时间（秒），不传则永不过期（不推荐）
   */
  async set<T>(key: string, value: T, ttl?: number): Promise<void> {
    try {
      // 序列化值
      const serialized = typeof value === 'string' 
        ? value 
        : JSON.stringify(value);

      // 设置值（带TTL）
      if (ttl) {
        await this.redis.setex(key, ttl, serialized);
        logger.debug('Redis SET成功（带TTL）', { 
          key, 
          ttl, 
          size: serialized.length 
        });
      } else {
        await this.redis.set(key, serialized);
        logger.warn('Redis SET成功（无TTL）', { key });  // 警告：无过期时间
      }
    } catch (error) {
      logger.error('Redis SET失败', { key, ttl, error });
      // 不抛出异常，允许系统继续运行
    }
  }

  /**
   * 删除缓存
   */
  async delete(key: string): Promise<boolean> {
    try {
      const result = await this.redis.del(key);
      logger.debug('Redis DEL', { key, deleted: result > 0 });
      return result > 0;
    } catch (error) {
      logger.error('Redis DEL失败', { key, error });
      return false;
    }
  }

  /**
   * 检查key是否存在
   */
  async has(key: string): Promise<boolean> {
    try {
      const result = await this.redis.exists(key);
      return result > 0;
    } catch (error) {
      logger.error('Redis EXISTS失败', { key, error });
      return false;
    }
  }

  /**
   * 清空所有缓存（危险操作！）
   */
  async clear(): Promise<void> {
    try {
      await this.redis.flushdb();
      logger.warn('Redis FLUSHDB执行（所有缓存已清空）');
    } catch (error) {
      logger.error('Redis FLUSHDB失败', { error });
    }
  }

  /**
   * 测试Redis连接
   */
  async ping(): Promise<boolean> {
    try {
      const result = await this.redis.ping();
      return result === 'PONG';
    } catch (error) {
      logger.error('Redis PING失败', { error });
      return false;
    }
  }

  /**
   * 关闭连接（用于优雅关闭）
   */
  async disconnect(): Promise<void> {
    try {
      await this.redis.quit();
      logger.info('Redis连接已关闭');
    } catch (error) {
      logger.error('Redis关闭连接失败', { error });
    }
  }
}
```

---

#### **步骤2.2：更新CacheFactory.ts（添加降级策略）**

```typescript
// 文件：backend/src/common/cache/CacheFactory.ts

import { config } from '../../config/env.js';
import type { CacheAdapter } from './CacheAdapter.js';
import { MemoryCacheAdapter } from './MemoryCacheAdapter.js';
import { RedisCacheAdapter } from './RedisCacheAdapter.js';
import { logger } from '../logging/index.js';

/**
 * 缓存工厂（单例）
 * 
 * 根据环境变量自动选择缓存实现：
 * - CACHE_TYPE=memory → MemoryCacheAdapter
 * - CACHE_TYPE=redis → RedisCacheAdapter（支持降级）
 * 
 * @example
 * import { cache } from '@/common/cache'
 * await cache.set('user:123', userData, 60)
 * const user = await cache.get<User>('user:123')
 */
export class CacheFactory {
  private static instance: CacheAdapter | null = null;
  private static fallbackToMemory = false;  // 降级标记

  /**
   * 获取缓存实例（单例）
   */
  static getInstance(): CacheAdapter {
    if (!this.instance) {
      this.instance = this.createCache();
    }
    return this.instance;
  }

  /**
   * 创建缓存实例
   */
  private static createCache(): CacheAdapter {
    const cacheType = config.cacheType || 'memory';

    logger.info('[CacheFactory] 初始化缓存系统', { cacheType });

    switch (cacheType) {
      case 'redis':
        return this.createRedisCache();
      case 'memory':
      default:
        return this.createMemoryCache();
    }
  }

  /**
   * 创建Redis缓存（带降级策略）
   */
  private static createRedisCache(): CacheAdapter {
    try {
      logger.info('[CacheFactory] 正在连接Redis...', {
        host: config.redisHost,
        port: config.redisPort,
        db: config.redisDb,
      });

      const redisCache = new RedisCacheAdapter({
        host: config.redisHost,
        port: config.redisPort,
        password: config.redisPassword,
        db: config.redisDb,
      });

      // 测试连接（同步等待）
      redisCache.ping().then((isConnected) => {
        if (isConnected) {
          logger.info('[CacheFactory] ✅ Redis连接成功');
        } else {
          logger.error('[CacheFactory] ❌ Redis连接失败，已降级到内存缓存');
          this.fallbackToMemory = true;
        }
      }).catch((error) => {
        logger.error('[CacheFactory] ❌ Redis连接异常，已降级到内存缓存', { error });
        this.fallbackToMemory = true;
      });

      return redisCache;
    } catch (error) {
      logger.error('[CacheFactory] ❌ Redis初始化失败，降级到内存缓存', { error });
      this.fallbackToMemory = true;
      return this.createMemoryCache();
    }
  }

  /**
   * 创建内存缓存
   */
  private static createMemoryCache(): MemoryCacheAdapter {
    logger.info('[CacheFactory] 使用内存缓存');
    return new MemoryCacheAdapter();
  }

  /**
   * 检查是否已降级到内存缓存
   */
  static isFallbackMode(): boolean {
    return this.fallbackToMemory;
  }
}

/**
 * 导出单例
 */
export const cache = CacheFactory.getInstance();
```

---

#### **步骤2.3：更新env.ts配置**

```typescript
// 文件：backend/src/config/env.ts
// 确保Redis配置正确读取

export const config = {
  // ... 其他配置 ...

  /** 缓存类型 */
  cacheType: process.env.CACHE_TYPE || 'memory',

  /** Redis配置 */
  redisHost: process.env.REDIS_HOST || 'localhost',
  redisPort: parseInt(process.env.REDIS_PORT || '6379', 10),
  redisPassword: process.env.REDIS_PASSWORD || undefined,
  redisDb: parseInt(process.env.REDIS_DB || '0', 10),
  redisUrl: process.env.REDIS_URL || 'redis://localhost:6379',

  /** 队列类型 */
  queueType: process.env.QUEUE_TYPE || 'memory',

  // ... 其他配置 ...
};

// 验证配置
export function validateConfig() {
  console.log('✅ [Config] 环境变量加载成功');
  console.log('[Config] 应用配置:');
  console.log(`  - 缓存: ${config.cacheType}`);
  console.log(`  - 队列: ${config.queueType}`);
  
  if (config.cacheType === 'redis') {
    console.log(`  - Redis: ${config.redisHost}:${config.redisPort}/${config.redisDb}`);
  }
}
```

---

### 4.3 Phase 3：本地测试 ✅

#### **步骤3.1：创建Redis测试脚本**

```typescript
// 文件：backend/src/scripts/test-redis.ts

import { cache } from '../common/cache/index.js';
import { logger } from '../common/logging/index.js';

async function testRedis() {
  console.log('\n🧪 开始测试Redis缓存...\n');

  try {
    // 测试1：基本读写
    console.log('📝 测试1：基本读写');
    await cache.set('test:hello', 'world', 10);
    const value1 = await cache.get('test:hello');
    console.log(`  ✅ 写入: "hello" → "world"`);
    console.log(`  ✅ 读取: "${value1}"`);
    console.assert(value1 === 'world', '值应该匹配');

    // 测试2：对象序列化
    console.log('\n📝 测试2：对象序列化');
    const obj = { id: 123, name: '测试', data: [1, 2, 3] };
    await cache.set('test:object', obj, 10);
    const value2 = await cache.get<typeof obj>('test:object');
    console.log(`  ✅ 写入对象:`, obj);
    console.log(`  ✅ 读取对象:`, value2);
    console.assert(value2?.id === 123, 'ID应该匹配');

    // 测试3：TTL过期
    console.log('\n📝 测试3：TTL过期（2秒）');
    await cache.set('test:expire', 'will-expire', 2);
    console.log(`  ✅ 写入（TTL=2秒）`);
    
    console.log(`  ⏳ 等待1秒...`);
    await sleep(1000);
    const value3a = await cache.get('test:expire');
    console.log(`  ✅ 1秒后读取: "${value3a}"`);
    console.assert(value3a === 'will-expire', '应该还存在');
    
    console.log(`  ⏳ 等待2秒...`);
    await sleep(2000);
    const value3b = await cache.get('test:expire');
    console.log(`  ✅ 3秒后读取: ${value3b}`);
    console.assert(value3b === null, '应该已过期');

    // 测试4：has和delete
    console.log('\n📝 测试4：has和delete');
    await cache.set('test:delete', 'to-be-deleted', 10);
    const exists1 = await cache.has('test:delete');
    console.log(`  ✅ 写入后exists: ${exists1}`);
    
    await cache.delete('test:delete');
    const exists2 = await cache.has('test:delete');
    console.log(`  ✅ 删除后exists: ${exists2}`);
    console.assert(!exists2, '应该不存在');

    // 测试5：大对象（50KB）
    console.log('\n📝 测试5：大对象缓存（模拟LLM结果）');
    const bigObj = {
      literatureId: 'xxx',
      fields: {
        研究类型: 'RCT',
        样本量: '500',
        干预措施: '药物A 100mg',
        // ... 12个字段
      },
      metadata: {
        model: 'deepseek-v3',
        tokens: 8000,
        timestamp: Date.now(),
      },
      rawOutput: 'x'.repeat(40000),  // 模拟大输出
    };
    
    const start = Date.now();
    await cache.set('test:bigobj', bigObj, 3600);
    const value5 = await cache.get('test:bigobj');
    const duration = Date.now() - start;
    
    const size = JSON.stringify(bigObj).length;
    console.log(`  ✅ 写入+读取大对象: ${size} bytes，耗时 ${duration}ms`);
    console.assert(value5 !== null, '应该能读取');

    console.log('\n✅ 所有测试通过！\n');
    process.exit(0);
  } catch (error) {
    console.error('\n❌ 测试失败:', error);
    process.exit(1);
  }
}

function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

// 运行测试
testRedis();
```

#### **步骤3.2：执行测试**
```bash
cd backend

# 1. 启动后端（确保Redis配置生效）
npm run dev

# 2. 新开一个终端，运行测试脚本
npx tsx src/scripts/test-redis.ts
```

**预期输出**：
```
🧪 开始测试Redis缓存...

📝 测试1：基本读写
  ✅ 写入: "hello" → "world"
  ✅ 读取: "world"

📝 测试2：对象序列化
  ✅ 写入对象: { id: 123, name: '测试', data: [ 1, 2, 3 ] }
  ✅ 读取对象: { id: 123, name: '测试', data: [ 1, 2, 3 ] }

📝 测试3：TTL过期（2秒）
  ✅ 写入（TTL=2秒）
  ⏳ 等待1秒...
  ✅ 1秒后读取: "will-expire"
  ⏳ 等待2秒...
  ✅ 3秒后读取: null

📝 测试4：has和delete
  ✅ 写入后exists: true
  ✅ 删除后exists: false

📝 测试5：大对象缓存（模拟LLM结果）
  ✅ 写入+读取大对象: 40123 bytes，耗时 5ms

✅ 所有测试通过！
```

---

#### **步骤3.3：测试业务代码**

```bash
# 1. 测试HealthCheckService（DC模块）
# 上传一个Excel文件，查看日志：

[HealthCheck] Cache miss, processing file
[HealthCheck] Check completed
[HealthCheck] Cache SET: health:xxx, TTL=86400

# 第二次上传同一个文件
[HealthCheck] Cache hit  ← 成功从Redis读取
```

```bash
# 2. 测试LLM12FieldsService（ASL模块）
# 提交全文复筛任务，查看日志：

[LLM12FieldsService] 调用LLM提取12字段
[LLM12FieldsService] Result cached with key: fulltext:xxx
[LLM12FieldsService] 缓存写入成功

# 重新运行同一篇PDF
[LLM12FieldsService] Cache hit, returning cached result  ← 节省API费用！
```

---

### 4.4 Phase 4：阿里云Redis配置 ✅

#### **步骤4.1：购买Redis实例**

1. 登录阿里云控制台
2. 进入 **云数据库 Redis** 产品页
3. 点击 **创建实例**
4. 按照截图配置选择：
   - 产品：Redis 开源版
   - 部署模式：云原生（高可用）
   - 地域：华北2（北京）—— **与SAE同地域！**
   - 版本：Redis 7.0
   - 分片规格：256 MB
   - 付费方式：包年包月
5. 提交订单并支付

#### **步骤4.2：配置白名单**

```
阿里云控制台 → Redis实例 → 白名单设置

添加：
1. 本地开发IP（用于本地测试）
   - 你的公网IP/32

2. SAE应用IP（生产环境）
   - 0.0.0.0/0 （临时，后续改为SAE VPC）
   或
   - SAE实例的VPC网段
```

#### **步骤4.3：获取连接信息**

```
阿里云控制台 → Redis实例 → 连接信息

复制以下信息：
- 连接地址：r-xxxxxxxxxxxx.redis.rds.aliyuncs.com
- 端口：6379
- 实例ID：r-xxxxxxxxxxxx
- 密码：点击"修改密码"设置
```

#### **步骤4.4：更新SAE环境变量**

```
阿里云控制台 → SAE应用 → 配置管理 → 环境变量

添加：
CACHE_TYPE=redis
REDIS_HOST=r-xxxxxxxxxxxx.redis.rds.aliyuncs.com
REDIS_PORT=6379
REDIS_PASSWORD=你设置的密码
REDIS_DB=0
```

---

### 4.5 Phase 5：启用Redis队列 🔴 **必须实施**

> **重要变更**：经分析，Redis队列对ASL和DC Tool B模块是**必须的**，不是可选的！  
> **理由**：2小时长任务在SAE环境下不用Redis队列失败率 > 95%

#### **步骤5.1：安装BullMQ**

```bash
cd backend
npm install bullmq --save

# BullMQ已在package.json中，只需确认安装
npm list bullmq
# 应该显示：bullmq@5.65.0
```

#### **步骤5.2：实现RedisQueue.ts**

```typescript
// 文件：backend/src/common/jobs/RedisQueue.ts

import { Queue, Worker, Job as BullJob, QueueEvents } from 'bullmq';
import type { Job, JobQueue, JobHandler } from './types.js';
import { logger } from '../logging/index.js';
import { config } from '../../config/env.js';

/**
 * Redis队列实现（基于BullMQ）
 * 
 * 核心功能：
 * - 任务持久化（实例重启不丢失）
 * - 自动重试（失败后指数退避）
 * - 分布式任务分配（多实例协调）
 * - 进度跟踪
 */
export class RedisQueue implements JobQueue {
  private queues: Map<string, Queue> = new Map();
  private workers: Map<string, Worker> = new Map();
  private queueEvents: Map<string, QueueEvents> = new Map();
  
  private connection = {
    host: config.redisHost,
    port: config.redisPort,
    password: config.redisPassword,
    db: config.redisDb,
  };

  /**
   * 推送任务到队列
   */
  async push<T = any>(type: string, data: T, options?: any): Promise<Job> {
    try {
      // 获取或创建队列
      let queue = this.queues.get(type);
      if (!queue) {
        queue = new Queue(type, { 
          connection: this.connection,
          defaultJobOptions: {
            removeOnComplete: 100,  // 保留最近100个完成任务
            removeOnFail: false,    // 失败任务不删除（便于排查）
            attempts: 3,            // 失败重试3次
            backoff: {
              type: 'exponential',
              delay: 2000,          // 2秒、4秒、8秒
            },
          }
        });
        this.queues.set(type, queue);
      }

      // 添加任务
      const job = await queue.add(type, data, {
        ...options,
        jobId: options?.jobId,  // 支持自定义jobId
      });

      logger.info(`[RedisQueue] 任务入队成功`, { 
        type, 
        jobId: job.id,
        dataSize: JSON.stringify(data).length 
      });

      return {
        id: job.id!,
        type,
        data,
        status: 'pending',
        createdAt: new Date(),
      };
    } catch (error) {
      logger.error(`[RedisQueue] 任务入队失败`, { type, error });
      throw error;
    }
  }

  /**
   * 注册任务处理器
   */
  process<T = any>(type: string, handler: JobHandler<T>): void {
    try {
      // 创建Worker
      const worker = new Worker(
        type,
        async (job: BullJob) => {
          logger.info(`[RedisQueue] 开始处理任务`, { 
            type, 
            jobId: job.id,
            attemptsMade: job.attemptsMade,
            attemptsTotal: job.opts.attempts
          });
          
          const startTime = Date.now();
          
          try {
            // 调用业务处理函数
            const result = await handler({
              id: job.id!,
              type,
              data: job.data as T,
              status: 'processing',
              createdAt: new Date(job.timestamp),
            });

            const duration = Date.now() - startTime;
            logger.info(`[RedisQueue] 任务处理成功`, { 
              type, 
              jobId: job.id,
              duration: `${duration}ms`
            });
            
            return result;
          } catch (error) {
            const duration = Date.now() - startTime;
            logger.error(`[RedisQueue] 任务处理失败`, { 
              type, 
              jobId: job.id,
              attemptsMade: job.attemptsMade,
              duration: `${duration}ms`,
              error: error instanceof Error ? error.message : 'Unknown error'
            });
            throw error;  // 抛出错误，触发重试
          }
        },
        { 
          connection: this.connection,
          concurrency: 1,  // 每个Worker并发处理1个任务
        }
      );

      this.workers.set(type, worker);

      // 监听Worker事件
      worker.on('completed', (job) => {
        logger.info(`[RedisQueue] ✅ 任务完成`, { 
          type, 
          jobId: job.id,
          returnvalue: job.returnvalue 
        });
      });

      worker.on('failed', (job, err) => {
        logger.error(`[RedisQueue] ❌ 任务失败`, { 
          type, 
          jobId: job?.id,
          attemptsMade: job?.attemptsMade,
          error: err.message,
          stack: err.stack
        });
      });

      worker.on('error', (err) => {
        logger.error(`[RedisQueue] Worker错误`, { type, error: err.message });
      });

      logger.info(`[RedisQueue] Worker已注册`, { type });

    } catch (error) {
      logger.error(`[RedisQueue] Worker注册失败`, { type, error });
      throw error;
    }
  }

  /**
   * 获取任务状态
   */
  async getJob(id: string): Promise<Job | null> {
    try {
      // 遍历所有队列查找任务
      for (const [type, queue] of this.queues) {
        const job = await queue.getJob(id);
        if (job) {
          return {
            id: job.id!,
            type,
            data: job.data,
            status: await this.getJobStatus(job),
            progress: job.progress as number || 0,
            createdAt: new Date(job.timestamp),
            error: job.failedReason,
          };
        }
      }
      return null;
    } catch (error) {
      logger.error(`[RedisQueue] getJob失败`, { id, error });
      return null;
    }
  }

  /**
   * 获取任务状态
   */
  private async getJobStatus(job: BullJob): Promise<string> {
    const state = await job.getState();
    switch (state) {
      case 'completed': return 'completed';
      case 'failed': return 'failed';
      case 'active': return 'processing';
      case 'waiting': return 'pending';
      case 'delayed': return 'pending';
      default: return 'pending';
    }
  }

  /**
   * 更新任务进度
   */
  async updateProgress(id: string, progress: number, message?: string): Promise<void> {
    try {
      for (const queue of this.queues.values()) {
        const job = await queue.getJob(id);
        if (job) {
          await job.updateProgress(progress);
          if (message) {
            await job.log(message);
          }
          logger.debug(`[RedisQueue] 进度更新`, { id, progress, message });
          return;
        }
      }
      logger.warn(`[RedisQueue] 任务不存在，无法更新进度`, { id });
    } catch (error) {
      logger.error(`[RedisQueue] 更新进度失败`, { id, error });
    }
  }

  /**
   * 取消任务
   */
  async cancelJob(id: string): Promise<boolean> {
    try {
      for (const queue of this.queues.values()) {
        const job = await queue.getJob(id);
        if (job) {
          await job.remove();
          logger.info(`[RedisQueue] 任务已取消`, { id });
          return true;
        }
      }
      logger.warn(`[RedisQueue] 任务不存在，无法取消`, { id });
      return false;
    } catch (error) {
      logger.error(`[RedisQueue] 取消任务失败`, { id, error });
      return false;
    }
  }

  /**
   * 重试失败任务
   */
  async retryJob(id: string): Promise<boolean> {
    try {
      for (const queue of this.queues.values()) {
        const job = await queue.getJob(id);
        if (job) {
          await job.retry();
          logger.info(`[RedisQueue] 任务已重试`, { id });
          return true;
        }
      }
      logger.warn(`[RedisQueue] 任务不存在，无法重试`, { id });
      return false;
    } catch (error) {
      logger.error(`[RedisQueue] 重试任务失败`, { id, error });
      return false;
    }
  }

  /**
   * 清理旧任务
   */
  async cleanup(olderThan: number = 86400000): Promise<number> {
    try {
      let totalCleaned = 0;
      
      for (const [type, queue] of this.queues) {
        // 清理完成的任务（保留最近100个）
        const completed = await queue.clean(olderThan, 100, 'completed');
        // 清理失败的任务（保留最近50个）
        const failed = await queue.clean(olderThan, 50, 'failed');
        
        const cleaned = completed.length + failed.length;
        totalCleaned += cleaned;
        
        if (cleaned > 0) {
          logger.info(`[RedisQueue] 队列清理完成`, { 
            type, 
            completed: completed.length,
            failed: failed.length
          });
        }
      }
      
      return totalCleaned;
    } catch (error) {
      logger.error(`[RedisQueue] 清理任务失败`, { error });
      return 0;
    }
  }

  /**
   * 关闭所有连接（优雅关闭）
   */
  async close(): Promise<void> {
    try {
      // 关闭所有Workers
      for (const [type, worker] of this.workers) {
        await worker.close();
        logger.info(`[RedisQueue] Worker已关闭`, { type });
      }
      
      // 关闭所有Queues
      for (const [type, queue] of this.queues) {
        await queue.close();
        logger.info(`[RedisQueue] Queue已关闭`, { type });
      }
      
      // 关闭所有QueueEvents
      for (const [type, events] of this.queueEvents) {
        await events.close();
        logger.info(`[RedisQueue] QueueEvents已关闭`, { type });
      }
      
      logger.info(`[RedisQueue] 所有连接已关闭`);
    } catch (error) {
      logger.error(`[RedisQueue] 关闭连接失败`, { error });
    }
  }
}
```

#### **步骤5.3：更新JobFactory支持Redis队列**

```typescript
// 文件：backend/src/common/jobs/JobFactory.ts

import { JobQueue } from './types.js'
import { MemoryQueue } from './MemoryQueue.js'
import { RedisQueue } from './RedisQueue.js'  // ← 新增
import { logger } from '../logging/index.js'
import { config } from '../../config/env.js'

export class JobFactory {
  private static instance: JobQueue | null = null

  static getInstance(): JobQueue {
    if (!this.instance) {
      this.instance = this.createQueue()
    }
    return this.instance
  }

  private static createQueue(): JobQueue {
    const queueType = config.queueType || 'memory'

    logger.info('[JobFactory] 初始化任务队列', { queueType });

    switch (queueType) {
      case 'redis':  // ← 新增
        return this.createRedisQueue()
      
      case 'memory':
        return this.createMemoryQueue()
      
      default:
        logger.warn(`[JobFactory] Unknown QUEUE_TYPE: ${queueType}, fallback to memory`)
        return this.createMemoryQueue()
    }
  }

  /**
   * 创建Redis队列（带降级策略）
   */
  private static createRedisQueue(): JobQueue {
    try {
      logger.info('[JobFactory] 正在连接Redis队列...');
      
      const redisQueue = new RedisQueue();
      
      logger.info('[JobFactory] ✅ Redis队列初始化成功');
      return redisQueue;
      
    } catch (error) {
      logger.error('[JobFactory] ❌ Redis队列初始化失败，降级到内存队列', { error });
      return this.createMemoryQueue();
    }
  }

  private static createMemoryQueue(): MemoryQueue {
    logger.info('[JobFactory] 使用内存队列')
    
    const queue = new MemoryQueue()
    
    // 定期清理（避免内存泄漏）
    if (process.env.NODE_ENV !== 'test') {
      setInterval(() => {
        queue.cleanup()
      }, 60 * 60 * 1000)
    }
    
    return queue
  }

  static reset(): void {
    this.instance = null
  }
}
```

#### **步骤5.4：修改业务代码使用队列**

```typescript
// 示例：ASL文献筛选改造
// 文件：backend/src/modules/asl/services/screeningService.ts

import { jobQueue } from '../../../common/jobs/index.js';

export async function startScreeningTask(projectId: string, userId: string) {
  // 1. 创建数据库任务记录
  const task = await prisma.aslScreeningTask.create({
    data: {
      projectId,
      status: 'pending',
      totalItems: literatures.length,
      // ...
    }
  });

  // 2. 推送到Redis队列（不阻塞请求）
  await jobQueue.push('asl:title-screening', {
    taskId: task.id,
    projectId,
    literatureIds: literatures.map(lit => lit.id),
  });

  logger.info('任务已入队', { taskId: task.id });

  // 3. 立即返回（前端轮询进度）
  return task;
}

// 注册Worker（在应用启动时）
// 文件：backend/src/index.ts
jobQueue.process('asl:title-screening', async (job) => {
  const { taskId, projectId, literatureIds } = job.data;
  
  logger.info('开始处理筛选任务', { taskId, total: literatureIds.length });
  
  for (let i = 0; i < literatureIds.length; i++) {
    const literatureId = literatureIds[i];
    
    // 处理单篇文献
    await processSingleLiterature(literatureId, projectId);
    
    // 更新进度
    const progress = ((i + 1) / literatureIds.length) * 100;
    await jobQueue.updateProgress(job.id, progress);
    
    // 更新数据库
    await prisma.aslScreeningTask.update({
      where: { id: taskId },
      data: { processedItems: i + 1 }
    });
  }
  
  // 标记完成
  await prisma.aslScreeningTask.update({
    where: { id: taskId },
    data: { status: 'completed', completedAt: new Date() }
  });
  
  logger.info('筛选任务完成', { taskId });
  
  return { success: true, processed: literatureIds.length };
});
```

#### **步骤5.5：更新.env配置**

```env
# backend/.env

# ==================== 任务队列配置 ====================
QUEUE_TYPE=redis  # ← 改为redis

# Redis配置（与缓存共用）
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
REDIS_DB=0
```

#### **步骤5.6：测试Redis队列**

```bash
# 1. 启动后端
cd backend
npm run dev

# 应该看到日志：
# [JobFactory] 正在连接Redis队列...
# [JobFactory] ✅ Redis队列初始化成功

# 2. 提交测试任务（使用REST Client或Postman）
POST http://localhost:3001/api/v1/asl/projects/:projectId/screening

# 3. 观察日志
# [RedisQueue] 任务入队成功 { type: 'asl:title-screening', jobId: '1' }
# [RedisQueue] 开始处理任务 { type: 'asl:title-screening', jobId: '1' }
# [RedisQueue] ✅ 任务完成 { type: 'asl:title-screening', jobId: '1' }

# 4. 测试实例重启恢复
# 提交任务 → 等待处理到50% → Ctrl+C停止 → 重新启动
# 任务应该自动从Redis恢复并继续处理
```

---

## 5. 测试方案

### 5.1 单元测试清单

| 测试项 | 预期结果 | 实际结果 | 状态 |
|--------|---------|---------|------|
| Redis连接测试 | PONG | ✅ | ⬜ 待测 |
| 基本读写测试 | 值匹配 | ✅ | ⬜ 待测 |
| 对象序列化测试 | 对象完整 | ✅ | ⬜ 待测 |
| TTL过期测试 | 2秒后为null | ✅ | ⬜ 待测 |
| 大对象测试（50KB） | <10ms | ✅ | ⬜ 待测 |
| 降级策略测试 | 自动切换内存 | ✅ | ⬜ 待测 |

### 5.2 集成测试清单

| 测试项 | 操作步骤 | 预期结果 | 状态 |
|--------|---------|---------|------|
| **HealthCheckService** | 上传Excel 2次 | 第2次缓存命中 | ⬜ 待测 |
| **LLM12FieldsService** | 提取同一PDF 2次 | 第2次缓存命中 | ⬜ 待测 |
| **多实例缓存共享** | 启动2个后端实例，实例A写入，实例B读取 | B能读到A的缓存 | ⬜ 待测 |
| **实例重启数据持久化** | 写入缓存 → 重启后端 → 读取 | 能读取到 | ⬜ 待测 |

### 5.3 压力测试

```bash
# 使用ab或wrk测试并发读写

# 测试1：并发写入
ab -n 1000 -c 10 http://localhost:3001/api/test/cache-write

# 测试2：并发读取
ab -n 10000 -c 50 http://localhost:3001/api/test/cache-read

# 预期结果：
# - QPS > 1000
# - 响应时间 < 50ms
# - 错误率 = 0%
```

### 5.4 故障模拟测试

| 故障场景 | 模拟方法 | 预期行为 | 状态 |
|---------|---------|---------|------|
| **Redis突然挂掉** | `docker stop ai-clinical-redis` | 系统降级到内存缓存，应用继续运行 | ⬜ 待测 |
| **Redis网络延迟** | `tc qdisc add dev eth0 root netem delay 500ms` | 超时重试，最终返回null | ⬜ 待测 |
| **Redis内存满** | 写入大量数据至256MB | 触发LRU驱逐，不影响新写入 | ⬜ 待测 |
| **Redis密码错误** | 修改密码 | 连接失败，降级到内存缓存 | ⬜ 待测 |

---

## 6. 风险评估与缓解

### 6.1 风险矩阵

| 风险 | 严重性 | 概率 | 影响 | 缓解措施 | 状态 |
|------|--------|------|------|----------|------|
| **Redis连接失败** | 🔴 高 | 🟡 中 | 系统不可用 | ✅ 降级策略 | ✅ 已实现 |
| **数据丢失** | 🟡 中 | 🟢 低 | 缓存失效 | ✅ 关键数据双写DB | ⏳ 待实现 |
| **内存溢出（OOM）** | 🔴 高 | 🟡 中 | Redis崩溃 | ✅ 严格TTL + 监控 | ⏳ 待实现 |
| **网络延迟** | 🟢 低 | 🟢 低 | 响应变慢 | ✅ 批量操作 | ⏳ 可选优化 |
| **配置错误** | 🟡 中 | 🟡 中 | 启动失败 | ✅ 配置验证 | ✅ 已实现 |
| **密码泄露** | 🔴 高 | 🟡 中 | 数据泄露 | ✅ KMS管理 | ⏳ 待实现 |

### 6.2 关键缓解措施

#### **缓解措施1：降级策略（必须）** ✅
```typescript
// 已在CacheFactory中实现
// Redis不可用时自动切换到MemoryCache
```

#### **缓解措施2：关键数据双写（推荐）** ⏳
```typescript
// 需要在业务代码中添加

// 示例：任务进度双写
export async function updateTaskProgress(taskId: string, progress: number) {
  // 1. 写Redis（快速查询）
  await cache.set(`task:${taskId}:progress`, progress, 3600);
  
  // 2. 同时写DB（持久化）
  await prisma.aslScreeningTask.update({
    where: { id: taskId },
    data: { processedItems: progress }
  });
}
```

#### **缓解措施3：内存监控（推荐）** ⏳
```typescript
// 创建：backend/src/scripts/monitor-redis.ts

import { cache } from '../common/cache/index.js';

setInterval(async () => {
  // 检查内存使用
  const info = await redis.info('memory');
  const used = parseInt(info.match(/used_memory:(\d+)/)[1]);
  const max = 256 * 1024 * 1024;
  
  if (used > max * 0.8) {
    logger.warn('⚠️ Redis内存使用超过80%', { used, max });
    // TODO: 发送钉钉/邮件告警
  }
}, 60000);
```

#### **缓解措施4：配置验证（已实现）** ✅
```typescript
// env.ts中的validateConfig()
// 启动时检查Redis配置
```

---

## 7. 上线计划

### 7.1 上线时间表（V2.0更新）

| 阶段 | 时间 | 任务 | 负责人 | 状态 |
|------|------|------|--------|------|
| **Phase 1** | Day 1上午 | 本地开发环境准备 | 开发 | ⬜ 待开始 |
| **Phase 2** | Day 1下午 | 实现RedisCacheAdapter | 开发 | ⬜ 待开始 |
| **Phase 3** | Day 2全天 | Redis缓存本地测试 | 开发+测试 | ⬜ 待开始 |
| **Phase 4** | Day 3上午 | 阿里云Redis购买&配置 | 运维 | ⬜ 待开始 |
| **Phase 5** | Day 3下午-Day 5 | 🔴 实现RedisQueue（必须）| 开发 | ⬜ 待开始 |
| **Phase 6** | Day 6全天 | Redis队列本地测试 + 业务集成 | 开发+测试 | ⬜ 待开始 |
| **Phase 7** | Day 7上午 | SAE测试环境验证 | 开发+测试 | ⬜ 待开始 |
| **Phase 8** | Day 7下午 | 生产环境上线 | 全员 | ⬜ 待开始 |
| **Phase 9** | Day 7晚+ | 监控观察（24小时） | 运维 | ⬜ 待开始 |

**总工作量**：7天（比原计划增加4天，但确保核心功能可用）

### 7.2 上线步骤（生产环境）

#### **Step 1：发布前检查（15分钟）**
```bash
✅ 代码已提交Git
✅ 本地测试全部通过
✅ 阿里云Redis已就绪
✅ SAE环境变量已配置
✅ 回滚方案已准备
✅ 监控已就绪
```

#### **Step 2：灰度发布（30分钟）**
```
1. SAE控制台 → 选择应用
2. 应用部署 → 分批发布
3. 第1批：10%实例（观察15分钟）
4. 第2批：50%实例（观察10分钟）
5. 第3批：100%实例
```

#### **Step 3：验证（15分钟）**
```bash
# 1. 检查Redis连接
curl https://your-api.com/api/health
# 应该返回：{ cache: "redis", status: "ok" }

# 2. 测试缓存写入
# 上传Excel → 查看日志 → 确认Redis写入

# 3. 测试缓存读取
# 再次上传 → 查看日志 → 确认缓存命中

# 4. 检查Redis内存
阿里云控制台 → Redis监控 → 内存使用
```

#### **Step 4：监控观察（24小时）**
```
关注指标：
- Redis连接数（应该稳定）
- 内存使用率（应该 < 50%）
- 缓存命中率（应该 > 80%）
- 应用错误日志（应该无Redis相关错误）
- API成本（应该下降）
```

---

## 8. 回滚方案

### 8.1 快速回滚（5分钟内）

#### **场景1：Redis连接失败，应用无法启动**

```bash
# 方法1：修改SAE环境变量
阿里云控制台 → SAE应用 → 配置管理 → 环境变量
修改：CACHE_TYPE=memory
保存 → 应用重启

# 方法2：重新部署上一个版本
SAE控制台 → 应用部署 → 版本管理 → 回滚
```

#### **场景2：Redis性能问题，响应变慢**

```bash
# 临时降级到内存缓存
CACHE_TYPE=memory

# 或保留Redis但检查网络
ping r-xxxxxxxxxxxx.redis.rds.aliyuncs.com
```

#### **场景3：Redis内存满，无法写入**

```bash
# 方法1：清理Redis（危险！）
redis-cli -h r-xxx.redis.rds.aliyuncs.com -a password
> FLUSHDB

# 方法2：升级Redis规格
阿里云控制台 → Redis实例 → 变配 → 512MB
```

### 8.2 回滚检查清单

```bash
✅ 应用能否正常启动？
✅ 缓存是否工作（内存模式）？
✅ API响应是否正常？
✅ 错误日志是否清除？
✅ 用户是否能正常使用？
```

---

## 9. 监控与运维

### 9.1 监控指标

#### **Redis指标（阿里云控制台）**

| 指标 | 正常范围 | 告警阈值 | 处理方案 |
|------|---------|---------|---------|
| **内存使用率** | < 50% | > 80% | 检查大key，考虑升配 |
| **连接数** | < 50 | > 100 | 检查连接泄漏 |
| **QPS** | < 1000 | > 5000 | 考虑分片 |
| **命中率** | > 80% | < 50% | 检查缓存策略 |
| **响应时间** | < 5ms | > 50ms | 检查网络 |

#### **应用指标（SAE日志）**

```bash
# 查找Redis相关错误
grep "Redis" logs/app.log | grep "ERROR"

# 查找缓存命中情况
grep "Cache hit" logs/app.log | wc -l
grep "Cache miss" logs/app.log | wc -l

# 计算命中率
命中率 = hits / (hits + misses) * 100%
```

### 9.2 运维命令

#### **常用Redis CLI命令**
```bash
# 连接Redis
redis-cli -h r-xxx.redis.rds.aliyuncs.com -p 6379 -a your_password

# 查看所有key
KEYS *

# 查看任务相关key
KEYS task:*

# 查看缓存相关key
KEYS fulltext:*

# 查看key的TTL
TTL task:abc123:progress

# 查看key的值
GET task:abc123:progress

# 查看内存使用
INFO memory

# 查看连接数
INFO clients

# 实时监控命令
MONITOR

# 清空数据库（危险！）
FLUSHDB
```

#### **内存分析**
```bash
# 查找大key（>10KB）
redis-cli -h xxx -a password --bigkeys

# 查看key的内存占用
MEMORY USAGE task:abc123:progress
```

### 9.3 故障排查流程

```
问题：Redis连接失败
  ↓
1. 检查Redis实例状态（阿里云控制台）
  ↓
2. 检查白名单配置（是否包含SAE IP）
  ↓
3. 检查密码是否正确
  ↓
4. ping测试网络连通性
  ↓
5. 查看应用日志
  ↓
6. 如无法快速解决 → 降级到内存缓存
```

### 9.4 定期维护

| 维度 | 频率 | 内容 |
|------|------|------|
| **日常监控** | 每天 | 查看内存使用、连接数、错误日志 |
| **性能分析** | 每周 | 分析缓存命中率、响应时间 |
| **容量评估** | 每月 | 评估256MB是否够用，是否需要升配 |
| **安全检查** | 每月 | 检查白名单、密码强度、访问日志 |

---

## 10. 成功标准

### 10.1 技术指标（V2.0更新）

| 指标 | 当前值（内存） | 目标值（Redis） | 衡量方法 |
|------|---------------|----------------|---------|
| **缓存持久化** | ❌ 实例重启丢失 | ✅ 持久化保存 | 重启后仍能读取 |
| **多实例共享** | ❌ 各实例独立 | ✅ 全局共享 | 实例A写，实例B能读 |
| **LLM API重复调用** | 🔴 高 | 🟢 低 | 同一PDF只调用1次 |
| **缓存命中率** | N/A | > 60% | 监控日志统计 |
| **任务持久化** | ❌ 实例销毁丢失 | ✅ 任务继续 | 🔴 **新增** |
| **长任务成功率** | 10-30% | > 99% | 🔴 **关键指标** |
| **系统可用性** | 99% | 99.9% | 降级策略保障 |

### 10.2 业务指标（V2.0更新）

| 指标 | 改造前 | 改造后 | 衡量方法 |
|------|--------|--------|---------|
| **LLM API成本** | ¥X/月 | 降低40-60% | 对比账单 |
| **任务丢失率** | 🔴 70-95% | < 1% | 🔴 **最重要** |
| **任务完成时间** | 不确定 | 稳定 | 监控日志 |
| **用户重复提交次数** | 平均3次 | 几乎为0 | 用户行为分析 |
| **用户满意度** | 基线 | 显著提升 | 问卷调查 |

### 10.3 验收标准（V2.0更新）

#### **Redis缓存验收**
```bash
✅ 所有缓存单元测试通过
✅ HealthCheckService缓存命中测试通过
✅ LLM12FieldsService缓存命中测试通过
✅ 实例重启后缓存仍存在
✅ 缓存命中率 > 60%
✅ LLM API调用次数下降 > 40%
```

#### **Redis队列验收** 🔴 **新增关键项**
```bash
✅ Redis队列单元测试通过
✅ 长任务（2小时）测试通过
✅ 实例重启后任务自动恢复
✅ 任务失败自动重试（3次）
✅ 1000篇文献筛选成功率 > 99%
✅ 无用户投诉任务丢失
✅ 进度实时更新正常
```

#### **故障恢复测试** 🔴 **重要**
```bash
✅ 模拟实例销毁 → 任务自动恢复
✅ 模拟Redis宕机 → 系统降级运行
✅ 模拟网络延迟 → 任务正常完成
✅ 模拟并发任务 → 正确分配处理
```

#### **生产环境验收**
```bash
✅ 生产环境运行48小时无错误
✅ 2个完整的1000篇文献筛选任务成功
✅ 监控指标正常（内存、连接数、QPS）
✅ 无用户投诉
```

---

## 11. FAQ

### Q1：如果Redis在半夜突然挂了怎么办？
**A1**：系统会自动降级到内存缓存，应用继续运行。第二天运维检查并恢复Redis。

### Q2：256MB够用吗？什么时候需要升配？
**A2**：
- 当前预估：< 50% 使用率
- 触发升配信号：
  - 内存使用 > 80%
  - 频繁触发LRU驱逐
  - 监控告警
- 升配方式：阿里云控制台一键升级，无需重启

### Q3：Redis会影响系统性能吗？
**A3**：
- 本地内存：< 0.1ms
- 本地Redis：~1ms
- 阿里云Redis（同地域）：~2-5ms
- 影响可忽略，且有批量操作优化

### Q4：如果发现Redis不适合，能回退到内存缓存吗？
**A4**：可以！修改 `CACHE_TYPE=memory` 即可，代码支持热切换。

### Q5：需要学习Redis命令吗？
**A5**：
- 开发：不需要，代码已封装
- 运维：建议学习5个基础命令（GET/SET/KEYS/TTL/INFO）

---

## 12. 相关文档

- [云原生开发规范](../04-开发规范/08-云原生开发规范.md)
- [SAE部署完全指南](./02-SAE部署完全指南(产品经理版).md)
- [SAE环境变量配置指南](./03-SAE环境变量配置指南.md)
- [Redis官方文档](https://redis.io/docs/)
- [ioredis文档](https://github.com/redis/ioredis)
- [阿里云Redis文档](https://help.aliyun.com/product/26340.html)

---

## 13. 附录

### 附录A：完整的.env配置模板

```env
# ==================== 数据库 ====================
DATABASE_URL=postgresql://postgres:password@localhost:5432/ai_clinical_research

# ==================== Redis配置 ====================
CACHE_TYPE=redis
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
REDIS_DB=0
# 或使用连接字符串
# REDIS_URL=redis://localhost:6379

# ==================== 队列配置 ====================
QUEUE_TYPE=memory  # 第一阶段用memory，第二阶段改为redis

# ==================== JWT ====================
JWT_SECRET=your-secret-key-change-in-production
JWT_EXPIRES_IN=7d

# ==================== LLM API ====================
DEEPSEEK_API_KEY=sk-xxxxxxxxxxxxxx
DASHSCOPE_API_KEY=sk-xxxxxxxxxxxxxx
CLOSEAI_API_KEY=sk-xxxxxxxxxxxxxx
CLOSEAI_OPENAI_BASE_URL=https://api.openai-proxy.org/v1
CLOSEAI_CLAUDE_BASE_URL=https://api.openai-proxy.org/anthropic

# ==================== Dify ====================
DIFY_API_URL=http://localhost/v1
DIFY_API_KEY=dataset-xxxxxxxxxxxxxx

# ==================== Server ====================
PORT=3001
NODE_ENV=development

# ==================== 存储配置 ====================
STORAGE_TYPE=local
LOCAL_STORAGE_DIR=uploads
LOCAL_STORAGE_URL=http://localhost:3001/uploads

# ==================== CORS配置 ====================
CORS_ORIGIN=http://localhost:5173

# ==================== 日志配置 ====================
LOG_LEVEL=debug
```

### 附录B：Redis内存计算器

```
单个LLM结果缓存：~50KB
单个健康检查缓存：~5KB

预估容量：
- 1000个LLM结果 = 50MB
- 1000个健康检查 = 5MB
- 系统开销 = 20MB
-----------------------------
总计 = 75MB / 256MB = 29% 使用率
```

### 附录C：故障演练脚本

```bash
#!/bin/bash
# 文件：backend/scripts/disaster-recovery-drill.sh

echo "🚨 Redis故障演练开始..."

# 1. 停止Redis
echo "1. 停止Redis..."
docker stop ai-clinical-redis
sleep 2

# 2. 测试应用是否正常
echo "2. 测试应用健康检查..."
response=$(curl -s http://localhost:3001/api/health)
echo "响应: $response"

if [[ $response == *"memory"* ]]; then
  echo "✅ 降级成功，使用内存缓存"
else
  echo "❌ 降级失败"
  exit 1
fi

# 3. 恢复Redis
echo "3. 恢复Redis..."
docker start ai-clinical-redis
sleep 5

# 4. 测试Redis恢复
echo "4. 测试Redis恢复..."
response=$(curl -s http://localhost:3001/api/health)
echo "响应: $response"

if [[ $response == *"redis"* ]]; then
  echo "✅ Redis恢复成功"
else
  echo "⚠️ Redis未恢复，仍使用内存缓存"
fi

echo "🎉 故障演练完成！"
```

---

**文档维护者：** 技术团队  
**最后更新：** 2025-12-12  
**文档状态：** ✅ 待审核  
**下次更新：** 改造完成后总结经验教训

---

## ✅ 改造完成检查清单

在完成Redis改造后，请逐项检查：

### 代码层面
- [ ] `ioredis` 已安装
- [ ] `RedisCacheAdapter` 已实现
- [ ] `CacheFactory` 已添加降级逻辑
- [ ] `.env` 配置已更新
- [ ] 所有使用 `cache.set()` 的地方都设置了TTL

### 测试层面
- [ ] 单元测试全部通过
- [ ] 集成测试全部通过
- [ ] 压力测试达标
- [ ] 故障模拟测试通过

### 部署层面
- [ ] 阿里云Redis已购买
- [ ] 白名单已配置
- [ ] SAE环境变量已配置
- [ ] 生产环境已验证

### 文档层面
- [ ] 改造文档已更新
- [ ] 运维文档已补充
- [ ] 监控指标已记录
- [ ] 经验教训已总结

---

**祝改造顺利！如有问题，请及时沟通。** 🚀