Files

HaHafeng 66255368b7 feat(admin): Add user management and upgrade to module permission system

Features - User Management (Phase 4.1):
- Database: Add user_modules table for fine-grained module permissions
- Database: Add 4 user permissions (view/create/edit/delete) to role_permissions
- Backend: UserService (780 lines) - CRUD with tenant isolation
- Backend: UserController + UserRoutes (648 lines) - 13 API endpoints
- Backend: Batch import users from Excel
- Frontend: UserListPage (412 lines) - list/filter/search/pagination
- Frontend: UserFormPage (341 lines) - create/edit with module config
- Frontend: UserDetailPage (393 lines) - details/tenant/module management
- Frontend: 3 modal components (592 lines) - import/assign/configure
- API: GET/POST/PUT/DELETE /api/admin/users/* endpoints

Architecture Upgrade - Module Permission System:
- Backend: Add getUserModules() method in auth.service
- Backend: Login API returns modules array in user object
- Frontend: AuthContext adds hasModule() method
- Frontend: Navigation filters modules based on user.modules
- Frontend: RouteGuard checks requiredModule instead of requiredVersion
- Frontend: Remove deprecated version-based permission system
- UX: Only show accessible modules in navigation (clean UI)
- UX: Smart redirect after login (avoid 403 for regular users)

Fixes:
- Fix UTF-8 encoding corruption in ~100 docs files
- Fix pageSize type conversion in userService (String to Number)
- Fix authUser undefined error in TopNavigation
- Fix login redirect logic with role-based access check
- Update Git commit guidelines v1.2 with UTF-8 safety rules

Database Changes:
- CREATE TABLE user_modules (user_id, tenant_id, module_code, is_enabled)
- ADD UNIQUE CONSTRAINT (user_id, tenant_id, module_code)
- INSERT 4 permissions + role assignments
- UPDATE PUBLIC tenant with 8 module subscriptions

Technical:
- Backend: 5 new files (~2400 lines)
- Frontend: 10 new files (~2500 lines)
- Docs: 1 development record + 2 status updates + 1 guideline update
- Total: ~4900 lines of code

Status: User management 100% complete, module permission system operational

2026-01-16 13:42:10 +08:00

30 KiB

Raw Blame History

ASL 文献处理技术选型

文档版本： V1.0
创建日期： 2025-11-15
适用模块： AI 智能文献（ASL）
目标： 定义初筛、全文复筛、全文提取的技术栈和实现路径

📋 文档概述

ASL 模块涉及三种不同的文献处理场景，每种场景有不同的技术特点和实现方案：

场景	输入格式	核心技术	主要挑战
标题摘要初筛	Excel 文件	Excel 解析 + LLM 筛选	批量处理效率
全文复筛	PDF 全文	PDF 提取 + LLM 筛选	PDF 解析准确率
全文数据提取	PDF 全文	PDF 提取 + LLM 结构化提取	表格、公式准确提取

🎯 技术架构总览

┌─────────────────────────────────────────────────────────┐
│                    ASL 文献处理流程                        │
└─────────────────────────────────────────────────────────┘
           │
           ├─ 场景 1: 标题摘要初筛
           │   └─ 用户上传 Excel → 解析 → LLM 批量筛选 → 导出结果
           │
           ├─ 场景 2: 全文复筛
           │   └─ 用户上传 PDF → PDF 提取 → LLM 筛选 → 复核
           │
           └─ 场景 3: 全文数据提取
               └─ PDF → 提取 + 结构化 → LLM 提取数据 → 人工复核

┌─────────────────────────────────────────────────────────┐
│               技术栈分层架构（共享）                        │
├─────────────────────────────────────────────────────────┤
│  前端层: React 19 + Ant Design 5 + xlsx/exceljs          │
├─────────────────────────────────────────────────────────┤
│  后端层: Node.js (Fastify) + TypeScript                  │
├─────────────────────────────────────────────────────────┤
│  文档处理层: Python 微服务 (extraction_service)           │
│    ├─ PyMuPDF: 快速 PDF 提取                             │
│    ├─ Nougat: 英文科学文献高质量提取 ⭐                   │
│    └─ Language Detector: 自动语言检测                     │
├─────────────────────────────────────────────────────────┤
│  LLM 层: DeepSeek-V3 + Qwen3 / GPT-5 + Claude-4.5        │
├─────────────────────────────────────────────────────────┤
│  数据库: PostgreSQL 15 (asl_schema)                      │
└─────────────────────────────────────────────────────────┘

📌 场景 1: 标题摘要初筛

1.1 技术特点

输入格式: Excel 文件 (.xlsx / .xls)
数据规模: 50-500 篇文献/批次
主要字段: 标题、摘要、DOI、作者、发表年份、期刊
处理重点: 批量高效处理，无需 PDF 解析

1.2 技术选型

前端：Excel 上传与解析

技术	库	用途	优势
Excel 上传	`antd Upload`	文件上传组件	拖拽上传、进度条
Excel 解析	`xlsx` / `exceljs`	前端解析 Excel	纯前端处理，快速预览
模板验证	自定义逻辑	校验列名和数据格式	提前发现格式错误

推荐方案：xlsx 库（SheetJS）

✅ 支持 .xlsx 和 .xls 格式
✅ 纯 JavaScript，前端直接解析
✅ 体积小（~600KB），性能好
✅ 支持大文件（1000+ 行）

代码示例：

import * as XLSX from 'xlsx';

function parseExcel(file: File): Promise<Literature[]> {
  return new Promise((resolve, reject) => {
    const reader = new FileReader();
    
    reader.onload = (e) => {
      try {
        const data = new Uint8Array(e.target.result as ArrayBuffer);
        const workbook = XLSX.read(data, { type: 'array' });
        
        // 读取第一个工作表
        const sheetName = workbook.SheetNames[0];
        const worksheet = workbook.Sheets[sheetName];
        
        // 转换为 JSON
        const jsonData = XLSX.utils.sheet_to_json(worksheet);
        
        // 映射为标准格式
        const literatures = jsonData.map((row: any) => ({
          title: row['Title'] || row['标题'],
          abstract: row['Abstract'] || row['摘要'],
          doi: row['DOI'],
          authors: row['Authors'] || row['作者'],
          year: row['Year'] || row['年份'],
          journal: row['Journal'] || row['期刊'],
        }));
        
        resolve(literatures);
      } catch (error) {
        reject(new Error('Excel 解析失败'));
      }
    };
    
    reader.onerror = () => reject(new Error('文件读取失败'));
    reader.readAsArrayBuffer(file);
  });
}

后端：批量筛选处理

处理流程：

Excel 数据 → 批量分组（10-20 篇/组）→ 并行调用 LLM → 汇总结果

关键技术点：

批量分组：避免单次请求过大，10-20 篇/组最优
并行处理：使用 Promise.all 并行调用 LLM
进度推送：WebSocket 实时推送处理进度
断点续传：支持任务中断后继续

代码示例：

async function batchScreening(
  literatures: Literature[],
  protocol: Protocol,
  progressCallback: (progress: number) => void
) {
  const batchSize = 15;
  const batches = chunk(literatures, batchSize);
  const results = [];
  
  for (let i = 0; i < batches.length; i++) {
    const batch = batches[i];
    
    // 并行处理当前批次
    const batchResults = await Promise.all(
      batch.map(lit => dualModelScreening(lit, protocol))
    );
    
    results.push(...batchResults);
    
    // 推送进度
    const progress = Math.round(((i + 1) / batches.length) * 100);
    progressCallback(progress);
  }
  
  return results;
}

1.3 数据流

用户操作             前端处理              后端处理            LLM 处理
   │                    │                     │                   │
   ├─ 上传 Excel        │                     │                   │
   │    └──────────────→│                     │                   │
   │                    ├─ 解析 Excel         │                   │
   │                    ├─ 验证格式           │                   │
   │                    ├─ 显示预览           │                   │
   │                    │                     │                   │
   │                    ├─ 提交筛选任务       │                   │
   │                    │    └───────────────→│                   │
   │                    │                     ├─ 保存任务         │
   │                    │                     ├─ 分组（15 篇/组） │
   │                    │                     │                   │
   │                    │                     ├─ 批次 1           │
   │                    │                     │    └──────────────→│
   │                    │                     │                   ├─ DeepSeek 筛选
   │                    │                     │                   ├─ Qwen3 筛选
   │                    │                     │                   ├─ 对比结果
   │                    │                     │    ←──────────────┘
   │                    │                     ├─ 保存结果         │
   │                    │                     │                   │
   │                    │                     ├─ 批次 2...        │
   │                    │                     │                   │
   │                    │    ←───────────────┤ 返回完整结果       │
   │    ←──────────────┤ 显示结果            │                   │
   └─ 人工复核          │                     │                   │

📌 场景 2 & 3: 全文复筛与数据提取

2.1 技术特点

输入格式: PDF 文件（英文医学文献）
文件特点:
- 科学论文格式（标题、摘要、引言、方法、结果、讨论、参考文献）
- 包含复杂表格、公式、图表
- 通常 10-30 页
处理重点: 高准确率提取，保留结构和格式

2.2 技术选型：PDF 提取

核心方案：Nougat + PyMuPDF 顺序降级策略 ⭐

现有架构（已实现，位于 extraction_service/）：

# 顺序降级策略
def extract_pdf(file_path: str):
    # Step 1: 检测语言
    language = detect_language(file_path)
    
    # Step 2: 中文 PDF → PyMuPDF（快速）
    if language == 'chinese':
        return extract_pdf_pymupdf(file_path)
    
    # Step 3: 英文 PDF → 尝试 Nougat
    if check_nougat_available():
        result = extract_pdf_nougat(file_path)
        
        # 质量检查（阈值 0.7）
        if result['quality_score'] >= 0.7:
            return result  # ✅ Nougat 成功
    
    # Step 4: 降级到 PyMuPDF
    return extract_pdf_pymupdf(file_path)

技术对比

方案	优势	劣势	适用场景
Nougat ⭐	• 专为科学文献设计 • 公式、表格准确率高 • 输出 Markdown 格式 • 保留文档结构	• 速度慢（1-2 分钟/20 页） • 需要 GPU 加速 • 内存占用大（~4GB）	英文医学文献全文提取
PyMuPDF	• 速度快（秒级） • 内存占用低 • 部署简单	• 公式、表格易丢失 • 纯文本输出 • 布局易混乱	中文文献、快速预览
Adobe API	• 商业级准确率 • 云端处理	• 需付费 • 网络依赖 • 隐私风险	不推荐（成本高）
Tesseract OCR	• 开源免费 • 支持多语言	• 需要图像预处理 • 准确率不稳定	扫描版 PDF（备选）

推荐方案：Nougat（主） + PyMuPDF（降级） ⭐

Nougat 核心优势（医学文献场景）

✅ 专为科学文献设计
   ├─ 训练数据：arXiv 论文 + 科学期刊
   ├─ 公式识别：LaTeX 格式输出
   ├─ 表格保留：Markdown 表格格式
   └─ 结构化输出：章节、段落清晰

✅ 输出格式：Markdown
   ├─ 标题层级：# ## ###
   ├─ 表格：| Header | Data |
   ├─ 公式：$$ formula $$
   └─ 引用：[1] [2] [3]

✅ 质量评估机制
   ├─ 自动质量评分（0-1）
   ├─ 低质量自动降级 PyMuPDF
   └─ 保证提取成功率

实现细节

服务架构：

Node.js Backend (Port 3001)
    │
    ├─ 调用 ExtractionClient.ts
    │   └─ HTTP 请求 → Python 微服务
    │
Python Extraction Service (Port 8000)
    │
    ├─ /api/extract/pdf
    │   ├─ detect_language()
    │   ├─ extract_pdf_nougat() → Nougat Model
    │   └─ extract_pdf_pymupdf() → PyMuPDF
    │
    └─ /api/health
        └─ 检查 Nougat 可用性

Node.js 调用代码：

import { extractionClient } from '@common/document/ExtractionClient';

async function extractLiteraturePDF(file: Buffer, filename: string) {
  try {
    // 方法 1: 自动选择（推荐）
    const result = await extractionClient.extractPdf(
      file, 
      filename, 
      'auto'
    );
    
    // 方法 2: 强制使用 Nougat
    // const result = await extractionClient.extractPdf(file, filename, 'nougat');
    
    return {
      text: result.text,
      method: result.method,  // "nougat" | "pymupdf"
      quality: result.metadata.quality_score,
      pageCount: result.metadata.page_count,
      hasTables: result.metadata.has_tables,
      hasFormulas: result.metadata.has_formulas
    };
  } catch (error) {
    console.error('PDF extraction failed:', error);
    throw error;
  }
}

Python 提取代码：

# extraction_service/services/nougat_extractor.py

def extract_pdf_nougat(file_path: str) -> Dict[str, Any]:
    """
    使用 Nougat 提取 PDF 文本
    
    命令行调用：
    nougat <pdf_path> -o <output_dir> --markdown --no-skipping
    """
    cmd = [
        'nougat',
        file_path,
        '-o', output_dir,
        '--markdown',      # 输出 Markdown 格式
        '--no-skipping'    # 不跳过任何页面
    ]
    
    # 执行 Nougat（超时 5 分钟）
    process = subprocess.Popen(cmd, ...)
    stdout, stderr = process.communicate(timeout=300)
    
    # 读取输出文件（.mmd）
    markdown_text = read_output_file()
    
    # 质量评估
    quality_score = evaluate_nougat_quality(markdown_text)
    
    return {
        "success": True,
        "method": "nougat",
        "text": markdown_text,
        "format": "markdown",
        "metadata": {
            "quality_score": quality_score,
            "has_tables": detect_tables(markdown_text),
            "has_formulas": detect_formulas(markdown_text)
        }
    }

2.3 文本后处理

Nougat 输出优化：

function postProcessNougatOutput(markdown: string): ProcessedText {
  return {
    // 原始 Markdown
    raw: markdown,
    
    // 章节分割
    sections: extractSections(markdown),  // {abstract, methods, results, ...}
    
    // 表格提取
    tables: extractTables(markdown),
    
    // 公式提取
    formulas: extractFormulas(markdown),
    
    // 纯文本（去除格式）
    plainText: markdownToPlainText(markdown),
    
    // 结构化数据（用于 LLM）
    structured: {
      title: extractTitle(markdown),
      abstract: extractAbstract(markdown),
      methodology: extractMethodology(markdown),
      results: extractResults(markdown),
    }
  };
}

📌 场景 4: 文献下载（Unpaywall API）⭐

3.1 技术背景

Unpaywall 是一个免费的开放获取（Open Access）文献 API，可以：

✅ 通过 DOI 查询文献是否有免费全文
✅ 获取合法的 PDF 下载链接
✅ 完全免费，无需付费
✅ 数据库覆盖 3000+ 万篇文献

官网: https://unpaywall.org/products/api

3.2 技术选型

API 调用方式

基础信息：

API 端点: https://api.unpaywall.org/v2/{doi}?email={your_email}
请求方法: GET
认证方式: 无需 API Key，仅需提供邮箱
速率限制: 100,000 次/天（免费）

示例请求：

curl "https://api.unpaywall.org/v2/10.1038/nature12373?email=YOUR_EMAIL"

响应示例：

{
  "doi": "10.1038/nature12373",
  "title": "The genome of the woodland strawberry",
  "is_oa": true,
  "oa_status": "gold",
  "best_oa_location": {
    "url": "https://www.nature.com/articles/nature12373.pdf",
    "url_for_pdf": "https://www.nature.com/articles/nature12373.pdf",
    "url_for_landing_page": "https://www.nature.com/articles/nature12373",
    "license": "cc-by",
    "version": "publishedVersion"
  },
  "oa_locations": [...]
}

Node.js 实现

服务封装：

// backend/src/common/literature/UnpaywallClient.ts

import axios from 'axios';
import { config } from '../../config/env';

export interface UnpaywallResult {
  doi: string;
  title: string;
  isOA: boolean;              // 是否开放获取
  oaStatus: string;           // "gold" | "green" | "hybrid" | "bronze" | "closed"
  pdfUrl: string | null;      // PDF 下载链接
  landingPageUrl: string;     // 文献页面链接
  license: string | null;     // 许可协议
  version: string | null;     // "publishedVersion" | "acceptedVersion"
}

class UnpaywallClient {
  private baseUrl = 'https://api.unpaywall.org/v2';
  private email: string;

  constructor(email: string = config.unpaywallEmail) {
    this.email = email;
  }

  /**
   * 通过 DOI 查询文献信息
   */
  async getByDoi(doi: string): Promise<UnpaywallResult> {
    try {
      const url = `${this.baseUrl}/${doi}?email=${this.email}`;
      const response = await axios.get(url, {
        timeout: 10000,  // 10 秒超时
      });

      const data = response.data;

      // 获取最佳下载位置
      const bestOA = data.best_oa_location;

      return {
        doi: data.doi,
        title: data.title,
        isOA: data.is_oa,
        oaStatus: data.oa_status,
        pdfUrl: bestOA?.url_for_pdf || null,
        landingPageUrl: bestOA?.url_for_landing_page || data.doi_url,
        license: bestOA?.license || null,
        version: bestOA?.version || null,
      };
    } catch (error) {
      if (axios.isAxiosError(error)) {
        if (error.response?.status === 404) {
          throw new Error(`DOI not found: ${doi}`);
        }
      }
      throw new Error(`Unpaywall API error: ${error.message}`);
    }
  }

  /**
   * 批量查询（带速率限制）
   */
  async getBatch(dois: string[]): Promise<UnpaywallResult[]> {
    const results = [];
    
    for (const doi of dois) {
      try {
        const result = await this.getByDoi(doi);
        results.push(result);
        
        // 速率限制：100ms/请求
        await new Promise(resolve => setTimeout(resolve, 100));
      } catch (error) {
        console.error(`Failed to fetch ${doi}:`, error.message);
        results.push(null);  // 失败项标记为 null
      }
    }
    
    return results.filter(r => r !== null);
  }

  /**
   * 下载 PDF 文件
   */
  async downloadPdf(pdfUrl: string, outputPath: string): Promise<void> {
    try {
      const response = await axios.get(pdfUrl, {
        responseType: 'arraybuffer',
        timeout: 60000,  // 1 分钟超时
      });

      const fs = require('fs');
      fs.writeFileSync(outputPath, response.data);
    } catch (error) {
      throw new Error(`PDF download failed: ${error.message}`);
    }
  }
}

export const unpaywallClient = new UnpaywallClient();

环境变量配置：

# .env
UNPAYWALL_EMAIL=your-email@example.com

业务集成

场景 1：批量检查文献是否可下载

async function checkLiteratureAvailability(literatures: Literature[]) {
  const dois = literatures
    .map(lit => lit.doi)
    .filter(doi => doi);  // 过滤空 DOI

  const results = await unpaywallClient.getBatch(dois);

  return literatures.map(lit => ({
    ...lit,
    downloadable: results.find(r => r.doi === lit.doi)?.isOA || false,
    pdfUrl: results.find(r => r.doi === lit.doi)?.pdfUrl || null,
  }));
}

场景 2：用户点击下载全文

async function downloadLiteratureFullText(doi: string) {
  // Step 1: 查询 Unpaywall
  const unpaywallResult = await unpaywallClient.getByDoi(doi);

  if (!unpaywallResult.pdfUrl) {
    throw new Error('该文献无免费全文');
  }

  // Step 2: 下载 PDF
  const filename = `${doi.replace(/\//g, '_')}.pdf`;
  const outputPath = `./downloads/${filename}`;
  
  await unpaywallClient.downloadPdf(unpaywallResult.pdfUrl, outputPath);

  // Step 3: 提取文本（调用 extraction_service）
  const extractionResult = await extractionClient.extractPdf(
    fs.readFileSync(outputPath),
    filename,
    'auto'
  );

  return {
    pdfPath: outputPath,
    text: extractionResult.text,
    method: extractionResult.method,
  };
}

3.3 前端集成

批量下载按钮：

// 批量检查可下载性
async function checkDownloadable(selectedRows: Literature[]) {
  setLoading(true);
  
  const results = await api.checkLiteratureAvailability(selectedRows);
  
  const downloadableCount = results.filter(r => r.downloadable).length;
  
  message.success(`发现 ${downloadableCount} 篇可下载全文`);
  setLiteratures(results);
  setLoading(false);
}

// 下载全文
async function downloadFullText(literature: Literature) {
  if (!literature.downloadable) {
    message.warning('该文献无免费全文');
    return;
  }

  try {
    const result = await api.downloadLiteratureFullText(literature.doi);
    message.success('下载成功');
    
    // 打开 PDF 查看器
    openPdfViewer(result.pdfPath);
  } catch (error) {
    message.error(`下载失败: ${error.message}`);
  }
}

🔍 补充技术点

4.1 您提到的技术点总结

技术点	状态	说明
✅ Nougat 模型	已实现	`extraction_service/services/nougat_extractor.py`
✅ PyMuPDF	已实现	`extraction_service/services/pdf_extractor.py`
✅ 顺序降级策略	已实现	英文→Nougat，中文→PyMuPDF
🆕 Unpaywall API	需新增	本文档提供实现方案
✅ Excel 解析	需新增	使用 `xlsx` 库（前端）

4.2 可能遗漏的技术点 ⭐

（1）表格提取增强

问题：Nougat 虽然保留表格结构，但 LLM 直接处理 Markdown 表格可能不准确。

解决方案：Table Transformer

# 使用微软的 Table Transformer 模型
# https://github.com/microsoft/table-transformer

from transformers import TableTransformerForObjectDetection
import torch

def extract_tables_enhanced(pdf_path: str):
    """
    使用 Table Transformer 精确定位表格
    """
    model = TableTransformerForObjectDetection.from_pretrained(
        "microsoft/table-transformer-detection"
    )
    
    # 检测表格位置
    tables = model.detect_tables(pdf_path)
    
    # 提取每个表格
    for table in tables:
        table_image = crop_table(pdf_path, table.bbox)
        table_data = ocr_table(table_image)
        
    return structured_tables

优先级：V2.0（MVP 阶段 Nougat 足够）

（2）引用解析与链接

问题：科学文献包含大量引用 [1] [2] [3]，需要解析并链接到参考文献。

解决方案：GROBID

# GROBID: 开源科学文献解析工具
# https://github.com/kermitt2/grobid

import requests

def parse_references(pdf_path: str):
    """
    使用 GROBID 解析参考文献
    """
    with open(pdf_path, 'rb') as f:
        files = {'input': f}
        response = requests.post(
            'http://localhost:8070/api/processFulltextDocument',
            files=files
        )
    
    # 返回结构化的引用列表
    return response.json()['references']

优先级：V2.0（非核心功能）

（3）公式识别与渲染

问题：Nougat 输出 LaTeX 公式，前端需要渲染。

解决方案：KaTeX / MathJax

// 前端渲染 LaTeX 公式
import katex from 'katex';
import 'katex/dist/katex.min.css';

function renderFormula(latex: string) {
  return katex.renderToString(latex, {
    throwOnError: false,
    displayMode: true,
  });
}

优先级：MVP（提升用户体验）

（4）PDF 预览与标注

问题：人工复核时需要查看原文，并高亮标注。

解决方案：PDF.js + Annotator.js

// React 组件
import { Viewer } from '@react-pdf-viewer/core';
import '@react-pdf-viewer/core/lib/styles/index.css';

function PdfViewer({ pdfUrl, annotations }) {
  return (
    <Viewer
      fileUrl={pdfUrl}
      plugins={[
        highlightPlugin({
          highlights: annotations  // 高亮位置
        })
      ]}
    />
  );
}

优先级：MVP（核心功能）

（5）文献去重

问题：Excel 上传可能包含重复文献（同一篇文献不同版本）。

解决方案：基于 DOI 和标题的去重

function deduplicateLiteratures(literatures: Literature[]) {
  const seen = new Set();
  
  return literatures.filter(lit => {
    // 优先使用 DOI
    if (lit.doi) {
      if (seen.has(lit.doi)) return false;
      seen.add(lit.doi);
      return true;
    }
    
    // 否则使用标题（标准化后）
    const normalizedTitle = normalizeTitle(lit.title);
    if (seen.has(normalizedTitle)) return false;
    seen.add(normalizedTitle);
    return true;
  });
}

function normalizeTitle(title: string): string {
  return title
    .toLowerCase()
    .replace(/[^\w\s]/g, '')  // 去除标点
    .replace(/\s+/g, ' ')      // 规范化空格
    .trim();
}

优先级：MVP（必须功能）

（6）文献元数据补全

问题：Excel 上传的数据可能不完整（缺 DOI、年份等）。

解决方案：Crossref API

// 通过标题查询 DOI
async function enrichMetadata(literature: Literature) {
  if (literature.doi) return literature;  // 已有 DOI

  // 调用 Crossref API
  const response = await axios.get(
    `https://api.crossref.org/works?query.title=${literature.title}`
  );

  const match = response.data.message.items[0];
  
  return {
    ...literature,
    doi: match.DOI,
    year: match['published-print']?.['date-parts'][0][0],
    journal: match['container-title'][0],
  };
}

优先级：V1.0（增强功能）

（7）批处理进度持久化

问题：批量筛选耗时长（1000 篇 > 10 分钟），需支持断点续传。

解决方案：Redis + 任务队列

// 使用 Bull 队列
import Queue from 'bull';

const screeningQueue = new Queue('literature-screening', {
  redis: { host: 'localhost', port: 6379 }
});

// 添加任务
screeningQueue.add({
  projectId: 'xxx',
  literatures: [...],
  protocol: {...}
});

// 处理任务
screeningQueue.process(async (job) => {
  const { projectId, literatures, protocol } = job.data;
  
  for (let i = 0; i < literatures.length; i++) {
    // 处理单篇文献
    await screenLiterature(literatures[i], protocol);
    
    // 更新进度
    job.progress((i + 1) / literatures.length * 100);
  }
});

优先级：V1.0（体验优化）

（8）错误处理与重试

问题：LLM 调用可能失败（网络、超时、限流）。

解决方案：指数退避重试

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries: number = 3
): Promise<T> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      
      // 指数退避：1s, 2s, 4s
      const delay = Math.pow(2, i) * 1000;
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}