feat(platform): Complete Postgres-Only architecture refactoring (Phase 1-7)

Major Changes: - Implement Platform-Only architecture pattern (unified task management) - Add PostgresCacheAdapter for unified caching (platform_schema.app_cache) - Add PgBossQueue for job queue management (platform_schema.job) - Implement CheckpointService using job.data (generic for all modules) - Add intelligent threshold-based dual-mode processing (THRESHOLD=50) - Add task splitting mechanism (auto chunk size recommendation) - Refactor ASL screening service with smart mode selection - Refactor DC extraction service with smart mode selection - Register workers for ASL and DC modules Technical Highlights: - All task management data stored in platform_schema.job.data (JSONB) - Business tables remain clean (no task management fields) - CheckpointService is generic (shared by all modules) - Zero code duplication (DRY principle) - Follows 3-layer architecture principle - Zero additional cost (no Redis needed, save 8400 CNY/year) Code Statistics: - New code: ~1750 lines - Modified code: ~500 lines - Test code: ~1800 lines - Documentation: ~3000 lines Testing: - Unit tests: 8/8 passed - Integration tests: 2/2 passed - Architecture validation: passed - Linter errors: 0 Files: - Platform layer: PostgresCacheAdapter, PgBossQueue, CheckpointService, utils - ASL module: screeningService, screeningWorker - DC module: ExtractionController, extractionWorker - Tests: 11 test files - Docs: Updated 4 key documents Status: Phase 1-7 completed, Phase 8-9 pending
2025-12-13 16:10:04 +08:00
parent a3586cdf30
commit fa72beea6c
135 changed files with 17508 additions and 91 deletions
--- a/docs/07-运维文档/08-Postgres-Only
+++ b/docs/07-运维文档/08-Postgres-Only
@@ -0,0 +1,717 @@
+# **Postgres-Only 全能架构解决方案**
+
+## **—— 面向微型 AI 团队的高可靠、低成本技术战略**
+
+版本：v1.0  
+适用场景：1-2人初创团队、Node.js/Fastify 技术栈、阿里云 SAE 部署环境  
+核心目标：在不引入 Redis 的前提下，实现企业级的任务队列、缓存与会话管理，保障 2小时+ 长任务的绝对可靠性。
+
+## **1\. 执行摘要 (Executive Summary)**
+
+针对我方当前（MAU \< 5000）的业务规模与“稳定性优先”的战略诉求，本方案主张采用 **"Postgres-Only" (全能数据库)** 架构。
+
+通过利用 PostgreSQL 的高级特性（如 SKIP LOCKED 锁机制、JSONB 存储、Unlogged Tables），我们可以完全替代 Redis 在**任务队列**、**缓存**、**会话存储**中的作用。
+
+**战略收益：**
+
+1. **架构极简**：移除 Redis 中间件，系统复杂度降低 50%。  
+2. **数据强一致**：业务数据与任务状态在同一事务中提交，彻底根除“分布式事务”风险。  
+3. **零额外成本**：复用现有 RDS 资源，每年节省数千元中间件费用。  
+4. **企业级可靠**：依托 RDS 的自动备份与 PITR（时间点恢复）能力，保障任务队列数据“永不丢失”。
+
+## **2\. 问题背景与挑战**
+
+### **2.1 当前痛点：长任务的脆弱性**
+
+我们的业务涉及“文献全库解析”和“双模型交叉验证”，单次任务耗时可能长达 **2小时**。
+
+* **现状**：使用内存队列（MemoryQueue）。  
+* **风险**：在 Serverless (SAE) 环境下，实例可能因无流量缩容、发布更新或内存溢出而随时销毁。一旦销毁，内存中的任务进度即刻丢失，导致用户任务失败。
+
+### **2.2 常见误区：只有 Redis 能救命？**
+
+业界常见的观点认为：“必须引入 Redis (BullMQ) 才能实现任务持久化。”
+
+* **反驳**：这是惯性思维。任务持久化的核心是\*\*“持久化存储”\*\*，而非 Redis 本身。PostgreSQL 同样具备持久化能力，且在事务安全性上优于 Redis。
+
+## **3\. 核心解决方案：Postgres-Only 架构**
+
+本方案将 Redis 的三大核心功能（队列、缓存、会话）全部收敛至 PostgreSQL。
+
+### **3.1 替代 Redis 队列：使用 pg-boss**
+
+我们引入 Node.js 库 **pg-boss**，它利用 PostgreSQL 的 FOR UPDATE SKIP LOCKED 特性，实现了高性能的抢占式队列。
+
+#### **架构逻辑**
+
+1. **入队**：API 接收请求，将任务元数据（JSON）写入 job 表。**此操作在毫秒级完成，数据立即安全落盘。**  
+2. **处理**：Worker 进程从数据库捞取任务，并锁定该行记录。  
+3. **容灾**：如果 SAE 实例在处理过程中崩溃（如 OOM），数据库锁会在超时后自动释放，其他存活的 Worker 实例会立即接管该任务重试。
+
+#### **代码实现范式**
+
+import PgBoss from 'pg-boss';
+
+const boss \= new PgBoss({  
+  connectionString: process.env.DATABASE\_URL,  
+  schema: 'job\_queue', // 独立Schema，不污染业务表  
+  max: 5               // 并发控制，保护 DeepSeek API  
+});
+
+await boss.start();
+
+// 消费者定义 (Worker)  
+await boss.work('screening-task', {  
+  // 关键配置：设置锁的有效期为 4小时  
+  // 即使任务跑 3.9小时，只要 Worker 活着，就不会被抢走  
+  // 如果 Worker 死了，锁过期，任务自动重试  
+  expireInSeconds: 14400,   
+  retryLimit: 3  
+}, async (job) \=\> {  
+  // 业务逻辑...  
+});
+
+### **3.2 替代 Redis 缓存：基于 Table 的 KV 存储**
+
+对于 AI 结果缓存（避免重复调用 LLM），Postgres 的查询速度（1-3ms）对于用户体验（秒级等待）来说完全可以接受。
+
+#### **性能论证**
+
+```
+实际并发分析：
+- 当前规模: 500 MAU
+- 峰值并发: < 50 QPS（极端情况）
+- Postgres能力: 5万+ QPS（简单查询）
+- 性能余量: 1000倍
+
+响应时间对比：
+- Redis: 0.15ms（网络+读取）
+- Postgres: 1.5ms（网络+查询）
+- 差异: 1.35ms
+- 用户感知: 无（总耗时200ms中占比 < 1%）
+
+结论：在日活10万以下，Postgres性能完全够用
+```
+
+#### **数据库设计**
+
+```prisma
+model AppCache {
+  id         Int      @id @default(autoincrement())
+  key        String   @unique
+  value      Json     // 对应 Redis 的 Value
+  expiresAt  DateTime // 对应 Redis 的 TTL
+  createdAt  DateTime @default(now())
+
+  @@index([expiresAt]) // 索引用于快速清理过期数据
+  @@index([key, expiresAt]) // 复合索引优化查询
+  @@map("app_cache")
+}
+```
+
+#### **封装 Service（完整版）**
+
+```typescript
+// 文件：backend/src/common/cache/PostgresCacheAdapter.ts
+
+import { prisma } from '../../lib/prisma';
+import type { CacheAdapter } from './types';
+import { logger } from '../logging/index';
+
+export class PostgresCacheAdapter implements CacheAdapter {
+  
+  /**
+   * 获取缓存（带懒惰删除）
+   */
+  async get<T = any>(key: string): Promise<T | null> {
+    try {
+      const record = await prisma.appCache.findUnique({
+        where: { key }
+      });
+      
+      if (!record) return null;
+      
+      // 检查是否过期
+      if (record.expiresAt < new Date()) {
+        // 懒惰删除：顺手清理（异步，不阻塞）
+        this.deleteAsync(key);
+        return null;
+      }
+      
+      return record.value as T;
+      
+    } catch (error) {
+      logger.error('[PostgresCache] 读取失败', { key, error });
+      return null;
+    }
+  }
+
+  /**
+   * 设置缓存
+   */
+  async set(key: string, value: any, ttlSeconds: number = 3600): Promise<void> {
+    try {
+      const expiresAt = new Date(Date.now() + ttlSeconds * 1000);
+      
+      await prisma.appCache.upsert({
+        where: { key },
+        create: { key, value, expiresAt },
+        update: { value, expiresAt }
+      });
+      
+      logger.debug('[PostgresCache] 写入成功', { key, ttl: ttlSeconds });
+      
+    } catch (error) {
+      logger.error('[PostgresCache] 写入失败', { key, error });
+      throw error;
+    }
+  }
+
+  /**
+   * 删除缓存
+   */
+  async delete(key: string): Promise<boolean> {
+    try {
+      await prisma.appCache.delete({ where: { key } });
+      return true;
+    } catch (error) {
+      return false;
+    }
+  }
+
+  /**
+   * 异步删除（不阻塞主流程）
+   */
+  private deleteAsync(key: string): void {
+    prisma.appCache.delete({ where: { key } })
+      .catch(err => logger.debug('[PostgresCache] 懒惰删除失败', { key, err }));
+  }
+
+  /**
+   * 批量删除
+   */
+  async deleteMany(pattern: string): Promise<number> {
+    try {
+      const result = await prisma.appCache.deleteMany({
+        where: { key: { contains: pattern } }
+      });
+      return result.count;
+    } catch (error) {
+      logger.error('[PostgresCache] 批量删除失败', { pattern, error });
+      return 0;
+    }
+  }
+
+  /**
+   * 清空所有缓存
+   */
+  async flush(): Promise<void> {
+    try {
+      await prisma.appCache.deleteMany({});
+      logger.info('[PostgresCache] 缓存已清空');
+    } catch (error) {
+      logger.error('[PostgresCache] 清空失败', { error });
+    }
+  }
+}
+
+/**
+ * 启动定时清理任务（分批清理，防止阻塞）
+ */
+export function startCacheCleanupTask() {
+  setInterval(async () => {
+    try {
+      // 每次只删除1000条过期数据
+      const result = await prisma.$executeRaw`
+        DELETE FROM app_cache 
+        WHERE id IN (
+          SELECT id FROM app_cache 
+          WHERE expires_at < NOW() 
+          LIMIT 1000
+        )
+      `;
+      
+      if (result > 0) {
+        logger.info('[PostgresCache] 清理过期数据', { count: result });
+      }
+      
+    } catch (error) {
+      logger.error('[PostgresCache] 定时清理失败', { error });
+    }
+  }, 60000);  // 每分钟执行一次
+  
+  logger.info('[PostgresCache] 定时清理任务已启动（每分钟1000条）');
+}
+```
+
+#### **性能优化技巧**
+
+1. **索引优化**：`@@index([key, expiresAt])` 覆盖查询，无需回表
+2. **懒惰删除**：读取时顺便清理，分散负载
+3. **分批清理**：每次LIMIT 1000，毫秒级完成
+4. **连接池复用**：Prisma自动管理，无额外开销
+
+### **3.3 替代 Redis 会话：connect-pg-simple**
+
+使用成熟的社区方案，将 Session 存储在 Postgres 的 session 表中。SAE 多实例重启后，用户无需重新登录。
+
+## **4\. 深度对比：为什么 Postgres 胜出？**
+
+| 维度 | 方案 A: 传统 Redis (BullMQ) | 方案 B: Postgres (pg-boss) | 获胜原因 |
+| :---- | :---- | :---- | :---- |
+| **数据一致性** | **弱** (双写一致性难题) 任务在 Redis，业务在 DB。若 DB 事务回滚，Redis 任务可能无法回滚。 | **强** (事务级原子性) 任务入队与业务数据写入在同一个 DB 事务中。要么全成，要么全败。 | **Postgres** |
+| **运维复杂度** | **高** 需维护 Redis 实例、VPC 白名单、监控内存碎片率、持久化策略。 | **零** 复用现有 RDS。备份、监控、扩容全由阿里云 RDS 托管。 | **Postgres** |
+| **备份与恢复** | **困难** Redis RDB/AOF 恢复可能会丢失最后几秒的数据。 | **完美** RDS 支持 PITR (按时间点恢复)。误删任务可精确回滚到 1秒前。 | **Postgres** |
+| **成本** | **¥1000+/年** (Tair 基础版) | **¥0** (资源复用) | **Postgres** |
+| **性能 (TPS)** | **极高** (10w+) | **高** (5000+) 对于日均几万次 AI 调用的场景，Postgres 性能绰绰有余。 | **Postgres** |
+| **锁竞争问题** | **误解** 读写都需要网络往返，高并发下Redis也会有竞争 | **真相** SELECT是快照读，不加锁。Node.js单线程会先成为瓶颈。 | **误解澄清** |
+| **缓存清理风险** | **存在** 内存溢出需要配置eviction策略，不当配置会导致数据丢失 | **可控** 分批删除（LIMIT 1000）+ 懒惰删除，永远不会阻塞。 | **Postgres** |
+| **学习曲线** | **陡峭** 需要学习Redis、BullMQ、ioredis、持久化策略、内存管理 | **平缓** 只需学习pg-boss（API类似BullMQ），其余都是熟悉的Postgres | **Postgres** |
+
+### **常见误解澄清**
+
+#### **误解1：Postgres并发性能差**
+```
+事实：
+- Postgres可处理5万+ QPS（简单查询）
+- 您的实际并发: < 50 QPS
+- SELECT是快照读（MVCC），无锁竞争
+- Node.js单线程（1-2万QPS上限）会先成为瓶颈
+
+结论：Postgres不是瓶颈
+```
+
+#### **误解2：DELETE会锁表阻塞**
+```
+事实：
+- DELETE是行级锁，不是表锁
+- LIMIT 1000，毫秒级完成（~5ms）
+- 配合懒惰删除，大部分过期数据在读取时已删除
+- 即使有积压，每分钟1000条，1小时可清理6万条
+
+结论：不会阻塞
+```
+
+#### **误解3：Redis内存操作一定快**
+```
+事实：
+- Redis: 网络延迟0.1ms + 读取0.05ms = 0.15ms
+- Postgres: 网络延迟0.5ms + 查询1ms = 1.5ms
+- 差异: 1.35ms
+- 但是总响应时间（含业务逻辑）: 200ms
+- 用户感知差异: 0%（1.35/200 < 1%）
+
+结论：性能差异在用户体验中无感知
+```
+
+## **5\. 针对“2小时长任务”的可靠性证明**
+
+质疑：Postgres 真的能保证 2 小时的任务不中断吗？  
+证明：
+
+1. **持久化保障**：任务一旦提交（API返回 200 OK），即写入硬盘。即使 SAE 集群下一秒全灭，任务记录依然在数据库中。  
+2. **崩溃恢复机制**：  
+   * **正常情况**：Worker 锁定任务 \-\> 执行 2 小时 \-\> 提交结果 \-\> 标记完成。  
+   * **异常情况 (SAE 缩容)**：Worker 执行到 1 小时被销毁 \-\> 数据库锁在 4 小时后过期 \-\> pg-boss 守护进程检测到过期 \-\> 将任务重新标记为 Pending \-\> 新 Worker 领取重试。  
+3. **断点续传**：Worker 可定期（如每 10 分钟）更新数据库中的 progress 字段。重试时读取 progress，从断点继续执行。
+
+**结论**：配合 pg-boss 的死信队列（Dead Letter）和重试策略，可靠性等同甚至高于 Redis（因为 Redis 内存溢出风险更大）。
+
+## **6\. 实施路线图**
+
+### **阶段1：任务队列改造（Week 1）**
+
+#### **Step 1.1：安装依赖**
+```bash
+cd backend
+npm install pg-boss --save
+```
+
+#### **Step 1.2：实现PgBossQueue适配器**
+```typescript
+// 文件：backend/src/common/jobs/PgBossQueue.ts
+
+import PgBoss from 'pg-boss';
+import type { Job, JobQueue, JobHandler } from './types';
+import { logger } from '../logging/index';
+import { config } from '../../config/env';
+
+export class PgBossQueue implements JobQueue {
+  private boss: PgBoss;
+  private started = false;
+
+  constructor() {
+    this.boss = new PgBoss({
+      connectionString: config.databaseUrl,
+      schema: 'job_queue',  // 独立schema，不污染业务表
+      max: 5,               // 连接池大小
+      // 关键配置：设置锁的有效期为4小时
+      // 保证2小时任务不被抢走，但实例崩溃后能自动恢复
+      expireInHours: 4,
+    });
+
+    // 监听错误
+    this.boss.on('error', error => {
+      logger.error('[PgBoss] 错误', { error });
+    });
+  }
+
+  async start(): Promise<void> {
+    if (this.started) return;
+    
+    await this.boss.start();
+    this.started = true;
+    logger.info('[PgBoss] 队列已启动');
+  }
+
+  async push<T = any>(type: string, data: T, options?: any): Promise<Job> {
+    await this.start();
+    
+    const jobId = await this.boss.send(type, data, {
+      retryLimit: 3,
+      retryDelay: 60,      // 失败后60秒重试
+      expireInHours: 4,    // 4小时后过期
+      ...options
+    });
+
+    logger.info('[PgBoss] 任务入队', { type, jobId });
+
+    return {
+      id: jobId,
+      type,
+      data,
+      status: 'pending',
+      createdAt: new Date(),
+    };
+  }
+
+  process<T = any>(type: string, handler: JobHandler<T>): void {
+    this.boss.work(type, async (job: any) => {
+      logger.info('[PgBoss] 开始处理任务', { 
+        type, 
+        jobId: job.id,
+        attemptsMade: job.data.__retryCount || 0
+      });
+
+      const startTime = Date.now();
+
+      try {
+        const result = await handler({
+          id: job.id,
+          type,
+          data: job.data as T,
+          status: 'processing',
+          createdAt: new Date(job.createdon),
+        });
+
+        logger.info('[PgBoss] 任务完成', { 
+          type, 
+          jobId: job.id,
+          duration: `${Date.now() - startTime}ms`
+        });
+
+        return result;
+
+      } catch (error) {
+        logger.error('[PgBoss] 任务失败', { 
+          type, 
+          jobId: job.id,
+          error: error instanceof Error ? error.message : 'Unknown'
+        });
+        throw error;  // 抛出错误触发重试
+      }
+    });
+
+    logger.info('[PgBoss] Worker已注册', { type });
+  }
+
+  async getJob(id: string): Promise<Job | null> {
+    const job = await this.boss.getJobById(id);
+    if (!job) return null;
+
+    return {
+      id: job.id,
+      type: job.name,
+      data: job.data,
+      status: this.mapState(job.state),
+      createdAt: new Date(job.createdon),
+    };
+  }
+
+  async updateProgress(id: string, progress: number, message?: string): Promise<void> {
+    // pg-boss暂不支持进度更新，可通过更新业务表实现
+    logger.debug('[PgBoss] 进度更新', { id, progress, message });
+  }
+
+  async cancelJob(id: string): Promise<boolean> {
+    await this.boss.cancel(id);
+    return true;
+  }
+
+  async retryJob(id: string): Promise<boolean> {
+    await this.boss.resume(id);
+    return true;
+  }
+
+  async cleanup(olderThan: number = 86400000): Promise<number> {
+    // pg-boss有自动清理机制
+    return 0;
+  }
+
+  private mapState(state: string): string {
+    switch (state) {
+      case 'completed': return 'completed';
+      case 'failed': return 'failed';
+      case 'active': return 'processing';
+      default: return 'pending';
+    }
+  }
+
+  async close(): Promise<void> {
+    await this.boss.stop();
+    logger.info('[PgBoss] 队列已关闭');
+  }
+}
+```
+
+#### **Step 1.3：更新JobFactory**
+```typescript
+// 文件：backend/src/common/jobs/JobFactory.ts
+
+import { JobQueue } from './types';
+import { MemoryQueue } from './MemoryQueue';
+import { PgBossQueue } from './PgBossQueue';
+import { logger } from '../logging/index';
+import { config } from '../../config/env';
+
+export class JobFactory {
+  private static instance: JobQueue | null = null;
+
+  static getInstance(): JobQueue {
+    if (!this.instance) {
+      this.instance = this.createQueue();
+    }
+    return this.instance;
+  }
+
+  private static createQueue(): JobQueue {
+    const queueType = config.queueType || 'pgboss';  // 默认使用pgboss
+
+    switch (queueType) {
+      case 'pgboss':
+        return new PgBossQueue();
+      
+      case 'memory':
+        logger.warn('[JobFactory] 使用内存队列（开发环境）');
+        return new MemoryQueue();
+      
+      default:
+        logger.warn(`[JobFactory] 未知队列类型: ${queueType}，使用pgboss`);
+        return new PgBossQueue();
+    }
+  }
+
+  static reset(): void {
+    this.instance = null;
+  }
+}
+```
+
+#### **Step 1.4：更新环境变量**
+```env
+# backend/.env
+QUEUE_TYPE=pgboss
+DATABASE_URL=postgresql://user:password@host:5432/dbname
+```
+
+#### **Step 1.5：测试（2小时长任务）**
+```bash
+# 提交1000篇文献筛选任务
+curl -X POST http://localhost:3001/api/v1/asl/projects/:id/screening
+
+# 等待处理到50% → 手动停止服务（Ctrl+C）
+# 重启服务
+npm run dev
+
+# 查看任务状态 → 应该自动恢复并继续处理
+```
+
+---
+
+### **阶段2：缓存改造（Week 2）**
+
+#### **Step 2.1：添加Prisma Schema**
+```prisma
+// backend/prisma/schema.prisma
+
+model AppCache {
+  id         Int      @id @default(autoincrement())
+  key        String   @unique
+  value      Json
+  expiresAt  DateTime
+  createdAt  DateTime @default(now())
+
+  @@index([expiresAt])
+  @@index([key, expiresAt])
+  @@map("app_cache")
+}
+```
+
+```bash
+# 生成迁移
+npx prisma migrate dev --name add_app_cache
+```
+
+#### **Step 2.2：实现PostgresCacheAdapter**
+（见上文"3.2 替代 Redis 缓存"部分）
+
+#### **Step 2.3：更新CacheFactory**
+```typescript
+// 文件：backend/src/common/cache/CacheFactory.ts
+
+export class CacheFactory {
+  static getInstance(): CacheAdapter {
+    const cacheType = config.cacheType || 'postgres';
+    
+    switch (cacheType) {
+      case 'postgres':
+        return new PostgresCacheAdapter();
+      case 'memory':
+        return new MemoryCacheAdapter();
+      default:
+        return new PostgresCacheAdapter();
+    }
+  }
+}
+```
+
+#### **Step 2.4：启动定时清理**
+```typescript
+// 文件：backend/src/index.ts
+
+import { startCacheCleanupTask } from './common/cache/PostgresCacheAdapter';
+
+// 在应用启动时
+await app.listen({ port: 3001, host: '0.0.0.0' });
+startCacheCleanupTask();  // 启动缓存清理
+```
+
+---
+
+### **阶段3：SAE部署（Week 3）**
+
+1. **本地测试通过**（ASL 1000篇文献 + DC 100份病历）
+2. **数据库连接配置**（SAE环境变量设置DATABASE_URL）
+3. **灰度发布**（先1个实例，观察24小时）
+4. **全量上线**（扩容到2-3个实例）
+
+## **7\. 性能边界与扩展路径**
+
+### **7.1 适用规模**
+
+| 指标 | Postgres-Only方案上限 | 您的当前值 | 安全余量 |
+|------|---------------------|-----------|---------|
+| 日活用户 | 10万 | 500 | 200倍 |
+| 并发QPS | 5000 | < 50 | 100倍 |
+| 缓存容量 | 10GB | < 100MB | 100倍 |
+| 队列吞吐 | 1000任务/小时 | < 50任务/小时 | 20倍 |
+
+**结论：在可预见的未来（2-3年），您不会超出这个上限。**
+
+### **7.2 何时需要Redis？**
+
+只有在以下情况发生时，才需要考虑引入Redis：
+
+```
+触发条件（ANY）：
+✅ 日活 > 5万
+✅ 并发QPS > 1000
+✅ Postgres CPU使用率持续 > 70%
+✅ 缓存查询延迟 > 50ms（P99）
+✅ LLM API月成本 > ¥5000（缓存命中率低）
+
+迁移策略：
+1. 先迁移LLM缓存到Redis（高频读）
+2. 保持任务队列在Postgres（强一致性）
+3. 业务缓存按需迁移
+
+成本：
+- 迁移工作量: 2-3天
+- 运维增加: 可接受（已有经验）
+```
+
+### **7.3 扩展路径**
+
+```
+阶段1（当前-5000用户）: Postgres-Only
+  ├─ 队列: pg-boss
+  ├─ 缓存: Postgres表
+  └─ 成本: ¥0
+
+阶段2（5000-5万用户）: 混合架构
+  ├─ 队列: pg-boss（保持）
+  ├─ LLM缓存: Redis（迁移）
+  ├─ 业务缓存: Postgres（保持）
+  └─ 成本: +¥1000/年
+
+阶段3（5万-50万用户）: 全Redis
+  ├─ 队列: BullMQ + Redis
+  ├─ 缓存: Redis
+  └─ 成本: +¥5000/年
+```
+
+---
+
+## **8\. FAQ（常见疑问）**
+
+### **Q1: pg-boss会不会拖慢Postgres？**
+
+**A:** 不会。pg-boss的查询都有索引优化，单次查询 < 5ms。即使100个Worker同时抢任务，也只是500ms的额外负载，对于5万QPS的Postgres来说可忽略。
+
+### **Q2: 缓存表会不会无限增长？**
+
+**A:** 不会。懒惰删除 + 分批清理，过期数据会被自动清理。即使有积压，每分钟1000条的清理速度足以应对。
+
+### **Q3: 如果Postgres挂了怎么办？**
+
+**A:** 
+- **阿里云RDS**：高可用版自动主从切换，故障恢复 < 30秒
+- **备份恢复**：PITR可恢复到任意秒，数据不丢失
+- **降级策略**：队列和缓存都在DB，一起恢复，无不一致风险
+
+相比之下，Redis挂了还需要担心数据不一致问题。
+
+### **Q4: 为什么不用Redis，却说自己是云原生？**
+
+**A:** 云原生的核心是**状态外置**，不是**必须用Redis**。
+
+```
+云原生的本质：
+✅ 无状态应用（不依赖本地内存）
+✅ 状态持久化（数据不丢失）
+✅ 水平扩展（多实例协调）
+
+Postgres-Only完全满足：
+✅ 状态存储在RDS（外置）
+✅ 任务持久化（不丢失）
+✅ pg-boss支持多实例（SKIP LOCKED）
+
+Redis只是实现方式之一，不是唯一方式。
+```
+
+---
+
+## **9\. 结语**
+
+对于 1-2 人规模的精益创业团队，**技术栈的坍缩（Stack Collapse）是降低熵增的最佳手段**。
+
+选择 Postgres-Only 不是因为我们技术落后，而是因为我们对技术有着更深刻的理解：
+
+- 我们理解**并发的真实规模**（不是臆想的百万QPS）
+- 我们理解**Postgres的能力边界**（不是印象中的"慢"）
+- 我们理解**架构的核心目标**（稳定性 > 炫技）
+- 我们理解**团队的真实能力**（运维能力 = 稳定性）
+
+我们选择用**架构的简洁性**来换取**运维的稳定性**，用**务实的判断**来换取**业务的快速迭代**。
+
+**这不是妥协，这是智慧。**