feat(admin): Add user management and upgrade to module permission system

Features - User Management (Phase 4.1):
- Database: Add user_modules table for fine-grained module permissions
- Database: Add 4 user permissions (view/create/edit/delete) to role_permissions
- Backend: UserService (780 lines) - CRUD with tenant isolation
- Backend: UserController + UserRoutes (648 lines) - 13 API endpoints
- Backend: Batch import users from Excel
- Frontend: UserListPage (412 lines) - list/filter/search/pagination
- Frontend: UserFormPage (341 lines) - create/edit with module config
- Frontend: UserDetailPage (393 lines) - details/tenant/module management
- Frontend: 3 modal components (592 lines) - import/assign/configure
- API: GET/POST/PUT/DELETE /api/admin/users/* endpoints

Architecture Upgrade - Module Permission System:
- Backend: Add getUserModules() method in auth.service
- Backend: Login API returns modules array in user object
- Frontend: AuthContext adds hasModule() method
- Frontend: Navigation filters modules based on user.modules
- Frontend: RouteGuard checks requiredModule instead of requiredVersion
- Frontend: Remove deprecated version-based permission system
- UX: Only show accessible modules in navigation (clean UI)
- UX: Smart redirect after login (avoid 403 for regular users)

Fixes:
- Fix UTF-8 encoding corruption in ~100 docs files
- Fix pageSize type conversion in userService (String to Number)
- Fix authUser undefined error in TopNavigation
- Fix login redirect logic with role-based access check
- Update Git commit guidelines v1.2 with UTF-8 safety rules

Database Changes:
- CREATE TABLE user_modules (user_id, tenant_id, module_code, is_enabled)
- ADD UNIQUE CONSTRAINT (user_id, tenant_id, module_code)
- INSERT 4 permissions + role assignments
- UPDATE PUBLIC tenant with 8 module subscriptions

Technical:
- Backend: 5 new files (~2400 lines)
- Frontend: 10 new files (~2500 lines)
- Docs: 1 development record + 2 status updates + 1 guideline update
- Total: ~4900 lines of code

Status: User management 100% complete, module permission system operational
This commit is contained in:
2026-01-16 13:42:10 +08:00
parent 98d862dbd4
commit 66255368b7
560 changed files with 70424 additions and 52353 deletions

View File

@@ -1,86 +1,86 @@
# ASL 文献处理技术选型
> **<EFBFBD><EFBFBD><EFBFBD><EFBFBD>𧋦嚗?* V1.0
> **<EFBFBD>𥕦遣<EFBFBD><EFBFBD>嚗?* 2025-11-15
> **<EFBFBD><EFBFBD>鍂璅<EFBFBD>嚗?* AI <20><EFBFBD><E7AE84><EFBFBD>讃嚗㇁SL嚗?
> **<EFBFBD><EFBFBD>嚗?* 摰帋<E691B0><E5B88B><EFBFBD><E89098><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>蝑䜘<E89D91><E49C98><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>𣇉<EFBFBD><F0A38789><EFBFBD><EFBFBD><EFBFBD><E88880><EFBFBD><EFBFBD><EFBFBD>啗楝敺?
> **文档版本:** V1.0
> **创建日期:** 2025-11-15
> **适用模块:** AI 智能文献ASL
> **目标:** 定义初筛、全文复筛、全文提取的技术栈和实现路径
---
## 📋 文档概述
ASL <EFBFBD>瘨匧<EFBFBD>銝厩<EFBFBD>銝滚<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>瘥讐<EFBFBD><EFBFBD>箸艶<EFBFBD><EFBFBD><EFBFBD>𣬚<EFBFBD><EFBFBD><EFBFBD><EFBFBD>舐鸌<EFBFBD><EFBFBD>摰䂿緵<EFBFBD><EFBFBD>嚗?
ASL 模块涉及三种不同的文献处理场景,每种场景有不同的技术特点和实现方案:
| <EFBFBD>箸艶 | 颲枏<E9A2B2><E69E8F><EFBFBD> | <20><EFBFBD><E8A9A8><EFBFBD><EFBFBD>?| 銝餉<E98A9D><E9A489><EFBFBD> |
| 场景 | 输入格式 | 核心技术 | 主要挑战 |
|------|---------|---------|---------|
| **<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>** | Excel <EFBFBD><EFBFBD> | Excel <EFBFBD><EFBFBD> + LLM 蝑偦<EFBFBD>?| <20><EFBFBD><EFBFBD><E686AD><EFBFBD><EFBFBD><EFBFBD> |
| **<EFBFBD><EFBFBD>憭滨<EFBFBD>** | PDF <EFBFBD><EFBFBD> | PDF <EFBFBD>𣂼<EFBFBD> + LLM 蝑偦<EFBFBD>?| PDF <EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?|
| **<EFBFBD><EFBFBD><EFBFBD>唳旿<EFBFBD>𣂼<EFBFBD>** | PDF <EFBFBD><EFBFBD> | PDF <EFBFBD>𣂼<EFBFBD> + LLM 蝏𤘪<EFBFBD><EFBFBD>𡝗<EFBFBD><EFBFBD>?| 銵冽聢<E586BD><E881A2><EFBFBD>撘誩<E69298>蝖格<E89D96><E6A0BC>?|
| **标题摘要初筛** | Excel 文件 | Excel 解析 + LLM 筛选 | 批量处理效率 |
| **全文复筛** | PDF 全文 | PDF 提取 + LLM 筛选 | PDF 解析准确率 |
| **全文数据提取** | PDF 全文 | PDF 提取 + LLM 结构化提取 | 表格、公式准确提取 |
---
## 🎯 技术架构总览
```
<EFBFBD>𢞖<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
<EFBFBD>? ASL <EFBFBD><EFBFBD>讃憭<EFBFBD><EFBFBD><EFBFBD><EFBFBD> <EFBFBD>?
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
<EFBFBD>?
┌─────────────────────────────────────────────────────────┐
ASL 文献处理流程
└─────────────────────────────────────────────────────────┘
├─ 场景 1: 标题摘要初筛
<EFBFBD>? <20><EFBFBD> <20><EFBFBD>銝𠹺<E98A9D> Excel <20>?閫<><E996AB> <20>?LLM <20><EFBFBD>蝑偦<E89D91>?<3F>?撖澆枂蝏𤘪<E89D8F>
<EFBFBD>?
│ └─ 用户上传 Excel → 解析 → LLM 批量筛选 → 导出结果
├─ 场景 2: 全文复筛
<EFBFBD>? <20><EFBFBD> <20><EFBFBD>銝𠹺<E98A9D> PDF <EFBFBD>?PDF <EFBFBD>𣂼<EFBFBD> <20>?LLM 蝑偦<E89D91>?<3F>?憭齿瓲
<EFBFBD>?
│ └─ 用户上传 PDF PDF 提取 → LLM 筛选 → 复核
└─ 场景 3: 全文数据提取
<EFBFBD><EFBFBD> PDF <EFBFBD>?<3F>𣂼<EFBFBD> + 蝏𤘪<E89D8F><F0A498AA>?<3F>?LLM <20>𣂼<EFBFBD><F0A382BC>唳旿 <20>?鈭箏極憭齿瓲
└─ PDF → 提取 + 结构化 → LLM 提取数据 → 人工复核
<EFBFBD>𢞖<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
<EFBFBD>? <EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>鈭恬<EFBFBD> <EFBFBD>?
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
<EFBFBD>? <20>滨垢撅? React 19 + Ant Design 5 + xlsx/exceljs <EFBFBD>?
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
<EFBFBD>? <20>𡒊垢撅? Node.js (Fastify) + TypeScript <EFBFBD>?
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
<EFBFBD>? <20><>﹝憭<EFB99D><E686AD>撅? Python 敺格<EFBFBD><EFBFBD>?(extraction_service) <EFBFBD>?
<EFBFBD>? <EFBFBD><EFBFBD> PyMuPDF: 敹恍<EFBFBD>?PDF <EFBFBD>𣂼<EFBFBD> <EFBFBD>?
<EFBFBD>? <EFBFBD><EFBFBD> Nougat: <EFBFBD><EFBFBD>蝘穃郎<EFBFBD><EFBFBD>讃擃䁅捶<EFBFBD>𤩺<EFBFBD><EFBFBD>?潃? <EFBFBD>?
<EFBFBD>? <EFBFBD><EFBFBD> Language Detector: <EFBFBD>芸𢆡霂剛<EFBFBD><EFBFBD>瘚? <EFBFBD>?
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
<EFBFBD>? LLM 撅? DeepSeek-V3 + Qwen3 / GPT-5 + Claude-4.5 <EFBFBD>?
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
<EFBFBD>? <20>唳旿摨? PostgreSQL 15 (asl_schema) <EFBFBD>?
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
┌─────────────────────────────────────────────────────────┐
技术栈分层架构(共享)
├─────────────────────────────────────────────────────────┤
│ 前端层: React 19 + Ant Design 5 + xlsx/exceljs
├─────────────────────────────────────────────────────────┤
│ 后端层: Node.js (Fastify) + TypeScript
├─────────────────────────────────────────────────────────┤
│ 文档处理层: Python 微服务 (extraction_service)
├─ PyMuPDF: 快速 PDF 提取
├─ Nougat: 英文科学文献高质量提取 ⭐
└─ Language Detector: 自动语言检测
├─────────────────────────────────────────────────────────┤
LLM 层: DeepSeek-V3 + Qwen3 / GPT-5 + Claude-4.5
├─────────────────────────────────────────────────────────┤
│ 数据库: PostgreSQL 15 (asl_schema)
└─────────────────────────────────────────────────────────┘
```
---
## 📌 场景 1: 标题摘要初筛
### 1.1 <EFBFBD><EFBFBD><EFBFBD>舐鸌<EFBFBD>?
### 1.1 技术特点
- **输入格式**: Excel 文件 (`.xlsx` / `.xls`)
- **<EFBFBD>唳旿閫<EFBFBD>**: 50-500 <EFBFBD><EFBFBD><EFBFBD>?<3F>寞活
- **銝餉<EFBFBD>摮埈挾**: <20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><E996AC><EFBFBD>OI<4F><49><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>銵典僑隞賬<E99A9E><E8B3AC><EFBFBD><EFBFBD>?
- **数据规模**: 50-500 篇文献/批次
- **主要字段**: 标题、摘要、DOI、作者、发表年份、期刊
- **处理重点**: 批量高效处理,无需 PDF 解析
### 1.2 技术选型
#### <EFBFBD>滨垢嚗鍃xcel 銝𠹺<E98A9D>銝舘圾<E88898>?
#### 前端Excel 上传与解析
| <EFBFBD><EFBFBD><EFBFBD>?| 摨?| <20><EFBFBD>?| 隡睃飵 |
| 技术 | 库 | 用途 | 优势 |
|------|-----|------|------|
| **Excel 上传** | `antd Upload` | 文件上传组件 | 拖拽上传、进度条 |
| **Excel <EFBFBD><EFBFBD>** | `xlsx` / `exceljs` | <EFBFBD>滨垢閫<EFBFBD><EFBFBD> Excel | 蝥臬<EFBFBD>蝡臬<EFBFBD><EFBFBD><EFBFBD><EFBFBD>敹恍<EFBFBD><EFBFBD>閫?|
| **璅⊥踎撉諹<EFBFBD>** | <EFBFBD><EFBFBD>銋厰<EFBFBD><EFBFBD> | <20><EFBFBD><E28ABF><EFBFBD><E5A092>峕㺭<E5B395>格聢撘?| <20>𣂼<EFBFBD><F0A382BC>𤑳緵<F0A491B3><EFBFBD><E6BE86>躰秤 |
| **Excel 解析** | `xlsx` / `exceljs` | 前端解析 Excel | 纯前端处理,快速预览 |
| **模板验证** | 自定义逻辑 | 校验列名和数据格式 | 提前发现格式错误 |
**<EFBFBD><EFBFBD><EFBFBD><EFBFBD>嚗䫤xlsx` 摨橒<EFBFBD>SheetJS嚗?*
- <20>?<3F><EFBFBD> `.xlsx` <20>?`.xls` <20><EFBFBD>
- <20>?蝥?JavaScript<EFBFBD><EFBFBD>蝡舐凒<EFBFBD>亥圾<EFBFBD>?
- <20>?雿梶妖撠𧶏<E692A0>~600KB嚗㚁<E59A97><E39A81><EFBFBD>憟?
- <20>?<3F><EFBFBD>憭扳<E686AD>隞塚<E99A9E>1000+ 銵䕘<EFBFBD>
**推荐方案:`xlsx` 库(SheetJS**
- ✅ 支持 `.xlsx``.xls` 格式
- ✅ 纯 JavaScript,前端直接解析
- ✅ 体积小(~600KB性能好
- ✅ 支持大文件(1000+ 行)
**<EFBFBD><EFBFBD>蝷箔<EFBFBD>嚗?*
**代码示例:**
```typescript
import * as XLSX from 'xlsx';
@@ -97,15 +97,15 @@ function parseExcel(file: File): Promise<Literature[]> {
const sheetName = workbook.SheetNames[0];
const worksheet = workbook.Sheets[sheetName];
// 頧祆揢銝?JSON
// 转换为 JSON
const jsonData = XLSX.utils.sheet_to_json(worksheet);
// <EFBFBD><EFBFBD>銝箸<EFBFBD><EFBFBD><EFBFBD>聢撘?
// 映射为标准格式
const literatures = jsonData.map((row: any) => ({
title: row['Title'] || row['标题'],
abstract: row['Abstract'] || row['摘要'],
doi: row['DOI'],
authors: row['Authors'] || row['雿𡏭<E99BBF>?],
authors: row['Authors'] || row['作者'],
year: row['Year'] || row['年份'],
journal: row['Journal'] || row['期刊'],
}));
@@ -122,20 +122,20 @@ function parseExcel(file: File): Promise<Literature[]> {
}
```
#### <EFBFBD>𡒊垢嚗𡁏鸌<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
#### 后端:批量筛选处理
**<EFBFBD><EFBFBD><EFBFBD><EFBFBD>嚗?*
**处理流程:**
```
Excel <EFBFBD>唳旿 <20>?<3F><EFBFBD><E5AFA5><EFBFBD><EFBFBD>嚗?0-20 蝭?蝏<><E89D8F><EFBFBD>?撟嗉<E6929F><EFBFBD>鍂 LLM <20>?瘙<><E79899><EFBFBD><E9A48C>?
Excel 数据 → 批量分组10-20 篇/组)→ 并行调用 LLM → 汇总结果
```
**<EFBFBD>喲睸<EFBFBD><EFBFBD><EFBFBD><EFBFBD>嚗?*
1. **<2A><EFBFBD><E5AFA5><EFBFBD><EFBFBD>**嚗𡁻<E59A97><F0A181BB><EFBFBD>甈∟窈瘙<E7AA88><E79899>憭改<E686AD>10-20 蝭?蝏<><E89D8F>隡?
2. **撟嗉<E6929F><EFBFBD><E686AD>**嚗帋蝙<E5B88B>?`Promise.all` 撟嗉<E6929F><EFBFBD> LLM
3. **餈𥕦漲<F0A595A6><EFBFBD>?*嚗阳ebSocket 摰墧𧒄<E5A2A7><EFBFBD><E588B8><EFBFBD><EFBFBD><EFBFBD><EFBFBD>摨?
**关键技术点:**
1. **批量分组**避免单次请求过大10-20 篇/组最优
2. **并行处理**:使用 `Promise.all` 并行调用 LLM
3. **进度推送**WebSocket 实时推送处理进度
4. **断点续传**:支持任务中断后继续
**<EFBFBD><EFBFBD>蝷箔<EFBFBD>嚗?*
**代码示例:**
```typescript
async function batchScreening(
literatures: Literature[],
@@ -156,7 +156,7 @@ async function batchScreening(
results.push(...batchResults);
// <EFBFBD><EFBFBD><EFBFBD><EFBFBD>摨?
// 推送进度
const progress = Math.round(((i + 1) / batches.length) * 100);
progressCallback(progress);
}
@@ -165,55 +165,55 @@ async function batchScreening(
}
```
### 1.3 <EFBFBD>唳旿瘚?
### 1.3 数据流
```
用户操作 前端处理 后端处理 LLM 处理
<EFBFBD>? <EFBFBD>? <EFBFBD>? <EFBFBD>?
<EFBFBD><EFBFBD> 銝𠹺<E98A9D> Excel <EFBFBD>? <EFBFBD>? <EFBFBD>?
<EFBFBD>? <EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD> <EFBFBD>? <EFBFBD>?
<EFBFBD>? <EFBFBD><EFBFBD><><E996AB> Excel <EFBFBD>? <EFBFBD>?
<EFBFBD>? <EFBFBD><EFBFBD> 撉諹<E69289><E8ABB9><EFBFBD> <EFBFBD>? <EFBFBD>?
<EFBFBD>? <EFBFBD><EFBFBD> <20>曄內憸<E585A7><E686B8> <EFBFBD>? <EFBFBD>?
<EFBFBD>? <EFBFBD>? <EFBFBD>? <EFBFBD>?
<EFBFBD>? <EFBFBD><EFBFBD> <20>𣂷漱蝑偦<E89D91>劐遙<E58A90>? <EFBFBD>? <EFBFBD>?
<EFBFBD>? <EFBFBD>? <EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD> <EFBFBD>?
<EFBFBD>? <EFBFBD>? <EFBFBD><EFBFBD> 靽嘥<E99DBD>隞餃𦛚 <EFBFBD>?
<EFBFBD>? <EFBFBD>? <EFBFBD><EFBFBD> <20><><EFBFBD>嚗?5 蝭?蝏<><E89D8F> <20>?
<EFBFBD>? <EFBFBD>? <EFBFBD>? <EFBFBD>?
<EFBFBD>? <EFBFBD>? <EFBFBD><EFBFBD> <20>寞活 1 <EFBFBD>?
<EFBFBD>? <EFBFBD>? <EFBFBD>? <20><EFBFBD><E5A999><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>
<EFBFBD>? <EFBFBD>? <EFBFBD>? <EFBFBD><EFBFBD> DeepSeek 蝑偦<EFBFBD>?
<EFBFBD>? <EFBFBD>? <EFBFBD>? <EFBFBD><EFBFBD> Qwen3 蝑偦<EFBFBD>?
<EFBFBD>? <20>? <EFBFBD>? <EFBFBD><EFBFBD> 撖寞<E69296>蝏𤘪<E89D8F>
<EFBFBD>? <EFBFBD>? <EFBFBD>? <20><EFBFBD><E990A5><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
<EFBFBD>? <EFBFBD>? <EFBFBD><EFBFBD> 靽嘥<E99DBD>蝏𤘪<E89D8F> <EFBFBD>?
<EFBFBD>? <EFBFBD>? <EFBFBD>? <EFBFBD>?
<EFBFBD>? <EFBFBD>? <EFBFBD><EFBFBD> <20>寞活 2... <EFBFBD>?
<EFBFBD>? <EFBFBD>? <EFBFBD>? <EFBFBD>?
<EFBFBD>? <EFBFBD>? <20><EFBFBD><E990A5><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?餈𥪜<E9A488>摰峕㟲蝏𤘪<E89D8F> <EFBFBD>?
<EFBFBD>? <EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?<3F>曄內蝏𤘪<E89D8F> <EFBFBD>? <EFBFBD>?
<EFBFBD><EFBFBD> 鈭箏極憭齿瓲 <EFBFBD>? <EFBFBD>? <EFBFBD>?
├─ 上传 Excel
└──────────────→│
├─ 解析 Excel
├─ 验证格式
├─ 显示预览
├─ 提交筛选任务
└───────────────→│
├─ 保存任务
├─ 分组15 篇/组) │
├─ 批次 1
│ └──────────────→│
├─ DeepSeek 筛选
├─ Qwen3 筛选
├─ 对比结果
│ ←──────────────┘
├─ 保存结果
├─ 批次 2...
│ ←───────────────┤ 返回完整结果
←──────────────┤ 显示结果
└─ 人工复核
```
---
## <EFBFBD><EFBFBD> <20>箸艶 2 & 3: <EFBFBD><EFBFBD>憭滨<EFBFBD>銝擧㺭<EFBFBD><EFBFBD><EFBFBD>?
## 📌 场景 2 & 3: 全文复筛与数据提取
### 2.1 <EFBFBD><EFBFBD><EFBFBD>舐鸌<EFBFBD>?
### 2.1 技术特点
- **输入格式**: PDF 文件(英文医学文献)
- **文件特点**:
- 科学论文格式(标题、摘要、引言、方法、结果、讨论、参考文献)
- <20><>鉄憭齿<E686AD>銵冽聢<E586BD><E881A2><EFBFBD>撘譌<E69298><E8AD8C>㦛銵?
- <20>𡁜虜 10-30 憿?
- 包含复杂表格、公式、图表
- 通常 10-30
- **处理重点**: 高准确率提取,保留结构和格式
### 2.2 技术选型PDF 提取
#### <EFBFBD><EFBFBD><EFBFBD><EFBFBD>嚗鐭ougat + PyMuPDF 憿箏<EFBFBD><EFBFBD>滨漣蝑𣇉裦 潃?
#### 核心方案Nougat + PyMuPDF 顺序降级策略 ⭐
**<EFBFBD><EFBFBD><EFBFBD><EFBFBD>**嚗<>歇摰䂿緵嚗䔶<E59A97>鈭?`extraction_service/`嚗㚁<E59A97>
**现有架构**(已实现,位于 `extraction_service/`
```python
# 顺序降级策略
@@ -221,75 +221,75 @@ def extract_pdf(file_path: str):
# Step 1: 检测语言
language = detect_language(file_path)
# Step 2: 銝剜<EFBFBD> PDF <EFBFBD>?PyMuPDF<EFBFBD><EFBFBD><EFBFBD><EFBFBD>
# Step 2: 中文 PDF PyMuPDF(快速)
if language == 'chinese':
return extract_pdf_pymupdf(file_path)
# Step 3: <EFBFBD><EFBFBD> PDF <EFBFBD>?撠肽<E692A0> Nougat
# Step 3: 英文 PDF → 尝试 Nougat
if check_nougat_available():
result = extract_pdf_nougat(file_path)
# 韐券<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?0.7嚗?
# 质量检查(阈值 0.7
if result['quality_score'] >= 0.7:
return result # <EFBFBD>?Nougat <EFBFBD>𣂼<EFBFBD>
return result # Nougat 成功
# Step 4: <EFBFBD>滨漣<EFBFBD>?PyMuPDF
# Step 4: 降级到 PyMuPDF
return extract_pdf_pymupdf(file_path)
```
#### <EFBFBD><EFBFBD><EFBFBD>臬笆瘥?
#### 技术对比
| 方案 | 优势 | 劣势 | 适用场景 |
|------|------|------|---------|
| **Nougat** 潃?| <20>?銝㮖蛹蝘穃郎<E7A983><E9838E>讃霈曇恣<br><3E>?<3F><EFBFBD><E7A08D><EFBFBD><EFBFBD><EFBFBD>蝖桃<E89D96>擃?br><3E>?颲枏枂 Markdown <EFBFBD><EFBFBD><br><EFBFBD>?靽萘<E99DBD><E89098><EFBFBD>﹝蝏𤘪<E89D8F> | <20>?<3F>笔漲<E7AC94><EFBFBD>1-2 <EFBFBD><EFBFBD><EFBFBD>/20 憿蛛<EFBFBD><br><EFBFBD>?<3F><>閬?GPU <20>𣳇<EFBFBD>?br><3E>?<3F><><EFBFBD><EFBFBD>删鍂憭改<E686AD>~4GB嚗?| <20><EFBFBD><E69C9B>餃郎<E9A483><E9838E><EFBFBD><EFBFBD><E586BD>𣂼<EFBFBD> |
| **PyMuPDF** | <EFBFBD>?<3F>笔漲敹恬<E695B9>蝘垍漣嚗?br><3E>?<3F><><EFBFBD><EFBFBD>删鍂雿?br><3E>?<3F>函蔡蝞<E894A1><E89D9E>?| <20>?<3F><EFBFBD><E7A08D><EFBFBD><EFBFBD><EFBFBD>仃<br><3E>?蝥舀<E89DA5><E88880><EFBFBD><E7A588>?br><3E>?撣<><E692A3><EFBFBD>𤘪毽銋?| 銝剜<E98A9D><E5899C><EFBFBD><EFBFBD><E8AE83><EFBFBD><EFBFBD>閫?|
| **Adobe API** | <EFBFBD>?<3F><><EFBFBD>蝥批<E89DA5>蝖桃<E89D96><br><3E>?鈭𤑳垢憭<E59EA2><E686AD> | <20>?<3F><>隞䁅晶<br><3E>?蝵𤑳<E89DB5>靘肽<E99D98><br><3E>?<3F><EFBFBD>憌𡡞埯 | 銝齿綫<E9BDBF><EFBFBD><E7909C>鞉𧋦擃矋<E69383> |
| **Tesseract OCR** | <EFBFBD>?撘<>皞𣂼<E79A9E>韐?br><3E>?<3F><EFBFBD>憭朞祗閮<E7A597> | <20>?<3F><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><E686AD><br><3E>?<3F><><EFBFBD><EFBC86><EFBFBD>蝔喳<E89D94> | <20><EFBFBD><E680A5>?PDF嚗<46><E59A97><EFBFBD><EFBFBD> |
| **Nougat** ⭐ | • 专为科学文献设计<br>• 公式、表格准确率高<br>• 输出 Markdown 格式<br>• 保留文档结构 | • 速度慢(1-2 分钟/20 页)<br>• 需要 GPU 加速<br>• 内存占用大(~4GB | 英文医学文献全文提取 |
| **PyMuPDF** | • 速度快(秒级)<br>• 内存占用低<br>• 部署简单 | • 公式、表格易丢失<br>• 纯文本输出<br>• 布局易混乱 | 中文文献、快速预览 |
| **Adobe API** | • 商业级准确率<br>• 云端处理 | • 需付费<br>• 网络依赖<br>• 隐私风险 | 不推荐(成本高) |
| **Tesseract OCR** | • 开源免费<br>• 支持多语言 | • 需要图像预处理<br>• 准确率不稳定 | 扫描版 PDF备选 |
**<EFBFBD><EFBFBD><EFBFBD><EFBFBD>嚗鐭ougat嚗<EFBFBD>蜓嚗?+ PyMuPDF嚗<46><E59A97>蝥改<E89DA5> 潃?*
**推荐方案Nougat + PyMuPDF降级**
#### Nougat 核心优势(医学文献场景)
```
<EFBFBD>?銝㮖蛹蝘穃郎<E7A983><E9838E>讃霈曇恣
✅ 专为科学文献设计
├─ 训练数据arXiv 论文 + 科学期刊
├─ 公式识别LaTeX 格式输出
├─ 表格保留Markdown 表格格式
<EFBFBD><EFBFBD> 蝏𤘪<E89D8F><F0A498AA>𤥁<EFBFBD><F0A4A581><EFBFBD>蝡㰘<E89DA1><E3B098><EFBFBD><EFBFBD><EFBFBD><E8B3A3>?
└─ 结构化输出:章节、段落清晰
<EFBFBD>?颲枏枂<E69E8F><EFBFBD>嚗鐝arkdown
<EFBFBD><EFBFBD> <20><><EFBFBD><EFBFBD>漣嚗? ## ###
✅ 输出格式Markdown
├─ 标题层级:# ## ###
├─ 表格:| Header | Data |
<EFBFBD><EFBFBD> <20><EFBFBD>嚗?$ formula $$
├─ 公式:$$ formula $$
└─ 引用:[1] [2] [3]
<EFBFBD>?韐券<E99F90><EFBFBD><EFBFBD><EFBFBD>
<EFBFBD><EFBFBD> <20>芸𢆡韐券<E99F90><EFBFBD><E99C82>嚗?-1嚗?
<EFBFBD><EFBFBD> 雿舘捶<E88898>讛䌊<E8AE9B><EFBFBD>蝥?PyMuPDF
<EFBFBD><EFBFBD> 靽肽<E99DBD><E882BD>𣂼<EFBFBD><F0A382BC>𣂼<EFBFBD><F0A382BC>?
✅ 质量评估机制
├─ 自动质量评分0-1
├─ 低质量自动降级 PyMuPDF
└─ 保证提取成功率
```
#### 实现细节
**<EFBFBD>滚𦛚<EFBFBD><EFBFBD>嚗?*
**服务架构:**
```
Node.js Backend (Port 3001)
<EFBFBD>?
├─ 调用 ExtractionClient.ts
<EFBFBD>? <20><EFBFBD> HTTP 霂瑟<E99C82> <20>?Python 敺格<E695BA><E6A0BC>?
<EFBFBD>?
│ └─ HTTP 请求 → Python 微服务
Python Extraction Service (Port 8000)
<EFBFBD>?
├─ /api/extract/pdf
<EFBFBD>? <20><EFBFBD> detect_language()
<EFBFBD>? <20><EFBFBD> extract_pdf_nougat() <EFBFBD>?Nougat Model
<EFBFBD>? <20><EFBFBD> extract_pdf_pymupdf() <EFBFBD>?PyMuPDF
<EFBFBD>?
│ ├─ detect_language()
│ ├─ extract_pdf_nougat() Nougat Model
│ └─ extract_pdf_pymupdf() PyMuPDF
└─ /api/health
<EFBFBD><EFBFBD><><E79289>?Nougat <20>舐鍂<E88890>?
└─ 检查 Nougat 可用性
```
**Node.js <EFBFBD>鍂隞<EFBFBD><EFBFBD>嚗?*
**Node.js 调用代码:**
```typescript
import { extractionClient } from '@common/document/ExtractionClient';
@@ -320,7 +320,7 @@ async function extractLiteraturePDF(file: Buffer, filename: string) {
}
```
**Python <EFBFBD>𣂼<EFBFBD><EFBFBD><EFBFBD>嚗?*
**Python 提取代码:**
```python
# extraction_service/services/nougat_extractor.py
@@ -336,14 +336,14 @@ def extract_pdf_nougat(file_path: str) -> Dict[str, Any]:
file_path,
'-o', output_dir,
'--markdown', # 输出 Markdown 格式
'--no-skipping' # 銝滩歲餈<EFBFBD>遙雿閖△<EFBFBD>?
'--no-skipping' # 不跳过任何页面
]
# <EFBFBD><EFBFBD> Nougat<EFBFBD><EFBFBD><EFBFBD>?5 <20><><EFBFBD>嚗?
# 执行 Nougat(超时 5 分钟)
process = subprocess.Popen(cmd, ...)
stdout, stderr = process.communicate(timeout=300)
# 霂餃<EFBFBD>颲枏枂<EFBFBD><EFBFBD>辣嚗?mmd嚗?
# 读取输出文件(.mmd
markdown_text = read_output_file()
# 质量评估
@@ -362,9 +362,9 @@ def extract_pdf_nougat(file_path: str) -> Dict[str, Any]:
}
```
### 2.3 <EFBFBD><EFBFBD>𧋦<EFBFBD>𤾸<EFBFBD><EFBFBD>?
### 2.3 文本后处理
**Nougat 颲枏枂隡睃<EFBFBD>嚗?*
**Nougat 输出优化:**
```typescript
function postProcessNougatOutput(markdown: string): ProcessedText {
return {
@@ -380,10 +380,10 @@ function postProcessNougatOutput(markdown: string): ProcessedText {
// 公式提取
formulas: extractFormulas(markdown),
// 蝥舀<EFBFBD><EFBFBD>穿<EFBFBD><EFBFBD>駁膄<EFBFBD><EFBFBD>嚗?
// 纯文本(去除格式)
plainText: markdownToPlainText(markdown),
// 蝏𤘪<EFBFBD><EFBFBD>𡝗㺭<EFBFBD><EFBFBD><EFBFBD><EFBFBD> LLM嚗?
// 结构化数据(用于 LLM
structured: {
title: extractTitle(markdown),
abstract: extractAbstract(markdown),
@@ -398,13 +398,13 @@ function postProcessNougatOutput(markdown: string): ProcessedText {
## 📌 场景 4: 文献下载Unpaywall API
### 3.1 <EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
### 3.1 技术背景
**Unpaywall** <20><EFBFBD>銝芸<E98A9D>韐寧<E99F90><EFBFBD><E69298>曇繮<E69B87><EFBFBD>Open Access嚗㗇<E59A97><E39787>?API嚗<49>虾隞伐<E99A9E>
- <20>?<3F><EFBFBD> DOI <20>亥砭<E4BAA5><E7A0AD><EFBFBD>臬炏<E887AC><EFBFBD>韐孵<E99F90><E5ADB5>?
- <20>?<3F><EFBFBD><E79195><EFBFBD><EFBFBD><EFBFBD>?PDF 銝贝蝸<E8B49D>暹𦻖
- <20>?摰<><E691B0><EFBFBD>滩晶嚗峕<E59A97><E5B395><EFBFBD>隞䁅晶
- <20>?<3F>唳旿摨栞<E691A8><E6A09E>?3000+ 銝<><E98A9D><EFBFBD><EFBFBD>
**Unpaywall** 是一个免费的开放获取Open Access文献 API可以
- ✅ 通过 DOI 查询文献是否有免费全文
- ✅ 获取合法的 PDF 下载链接
- ✅ 完全免费,无需付费
- ✅ 数据库覆盖 3000+ 万篇文献
**官网**: https://unpaywall.org/products/api
@@ -412,18 +412,18 @@ function postProcessNougatOutput(markdown: string): ProcessedText {
#### API 调用方式
**<EFBFBD><EFBFBD>靽⊥<EFBFBD>嚗?*
**基础信息:**
- **API 端点**: `https://api.unpaywall.org/v2/{doi}?email={your_email}`
- **请求方法**: GET
- **认证方式**: 无需 API Key仅需提供邮箱
- **<2A><EFBFBD><E6AFBA>𣂼<EFBFBD>**: 100,000 甈?憭抬<E686AD><E68AAC>滩晶嚗?
- **速率限制**: 100,000 次/天(免费)
**蝷箔<EFBFBD>霂瑟<EFBFBD>嚗?*
**示例请求:**
```bash
curl "https://api.unpaywall.org/v2/10.1038/nature12373?email=YOUR_EMAIL"
```
**<EFBFBD><EFBFBD>蝷箔<EFBFBD>嚗?*
**响应示例:**
```json
{
"doi": "10.1038/nature12373",
@@ -443,7 +443,7 @@ curl "https://api.unpaywall.org/v2/10.1038/nature12373?email=YOUR_EMAIL"
#### Node.js 实现
**<EFBFBD>滚𦛚撠<EFBFBD><EFBFBD>嚗?*
**服务封装:**
```typescript
// backend/src/common/literature/UnpaywallClient.ts
@@ -453,7 +453,7 @@ import { config } from '../../config/env';
export interface UnpaywallResult {
doi: string;
title: string;
isOA: boolean; // <EFBFBD>臬炏撘<EFBFBD><EFBFBD>曇繮<EFBFBD>?
isOA: boolean; // 是否开放获取
oaStatus: string; // "gold" | "green" | "hybrid" | "bronze" | "closed"
pdfUrl: string | null; // PDF 下载链接
landingPageUrl: string; // 文献页面链接
@@ -476,12 +476,12 @@ class UnpaywallClient {
try {
const url = `${this.baseUrl}/${doi}?email=${this.email}`;
const response = await axios.get(url, {
timeout: 10000, // 10 蝘坿<EFBFBD><EFBFBD>?
timeout: 10000, // 10 秒超时
});
const data = response.data;
// <EFBFBD><EFBFBD><EFBFBD><EFBFBD>雿喃<EFBFBD>頧賭<EFBFBD>蝵?
// 获取最佳下载位置
const bestOA = data.best_oa_location;
return {
@@ -505,7 +505,7 @@ class UnpaywallClient {
}
/**
* <EFBFBD><EFBFBD><EFBFBD>亥砭嚗<EFBFBD><EFBFBD><EFBFBD><EFBFBD>𣂼<EFBFBD>嚗?
* 批量查询(带速率限制)
*/
async getBatch(dois: string[]): Promise<UnpaywallResult[]> {
const results = [];
@@ -515,7 +515,7 @@ class UnpaywallClient {
const result = await this.getByDoi(doi);
results.push(result);
// <EFBFBD><EFBFBD><EFBFBD>𣂼<EFBFBD>嚗?00ms/霂瑟<EFBFBD>
// 速率限制100ms/请求
await new Promise(resolve => setTimeout(resolve, 100));
} catch (error) {
console.error(`Failed to fetch ${doi}:`, error.message);
@@ -547,7 +547,7 @@ class UnpaywallClient {
export const unpaywallClient = new UnpaywallClient();
```
**<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>滨蔭嚗?*
**环境变量配置:**
```env
# .env
UNPAYWALL_EMAIL=your-email@example.com
@@ -560,7 +560,7 @@ UNPAYWALL_EMAIL=your-email@example.com
async function checkLiteratureAvailability(literatures: Literature[]) {
const dois = literatures
.map(lit => lit.doi)
.filter(doi => doi); // <EFBFBD>誘蝛?DOI
.filter(doi => doi); // 过滤空 DOI
const results = await unpaywallClient.getBatch(dois);
@@ -572,7 +572,7 @@ async function checkLiteratureAvailability(literatures: Literature[]) {
}
```
**<EFBFBD>箸艶 2嚗𡁶鍂<F0A181B6><EFBFBD><E7919E><EFBFBD>頧賢<E9A0A7><E8B3A2>?*
**场景 2用户点击下载全文**
```typescript
async function downloadLiteratureFullText(doi: string) {
// Step 1: 查询 Unpaywall
@@ -588,7 +588,7 @@ async function downloadLiteratureFullText(doi: string) {
await unpaywallClient.downloadPdf(unpaywallResult.pdfUrl, outputPath);
// Step 3: <EFBFBD>𣂼<EFBFBD><EFBFBD><EFBFBD>𧋦嚗<EFBFBD><EFBFBD><EFBFBD>?extraction_service嚗?
// Step 3: 提取文本(调用 extraction_service
const extractionResult = await extractionClient.extractPdf(
fs.readFileSync(outputPath),
filename,
@@ -605,9 +605,9 @@ async function downloadLiteratureFullText(doi: string) {
### 3.3 前端集成
**<EFBFBD><EFBFBD>銝贝蝸<EFBFBD>厰僼嚗?*
**批量下载按钮:**
```typescript
// <20><EFBFBD><EFBFBD><E79289>亙虾銝贝蝸<E8B49D>?
// 批量检查可下载性
async function checkDownloadable(selectedRows: Literature[]) {
setLoading(true);
@@ -631,7 +631,7 @@ async function downloadFullText(literature: Literature) {
const result = await api.downloadLiteratureFullText(literature.doi);
message.success('下载成功');
// <EFBFBD><EFBFBD> PDF <EFBFBD><EFBFBD><EFBFBD>?
// 打开 PDF 查看器
openPdfViewer(result.pdfPath);
} catch (error) {
message.error(`下载失败: ${error.message}`);
@@ -645,23 +645,23 @@ async function downloadFullText(literature: Literature) {
### 4.1 您提到的技术点总结
| <EFBFBD><EFBFBD><EFBFBD><EFBFBD> | <20><EFBFBD>?| 霂湔<E99C82> |
| 技术点 | 状态 | 说明 |
|--------|------|------|
| <EFBFBD>?Nougat <EFBFBD> | 撌脣<E6928C><E884A3>?| `extraction_service/services/nougat_extractor.py` |
| <EFBFBD>?PyMuPDF | 撌脣<EFBFBD><EFBFBD>?| `extraction_service/services/pdf_extractor.py` |
| <EFBFBD>?憿箏<E686BF><E7AE8F>滨漣蝑𣇉裦 | 撌脣<E6928C><E884A3>?| <20><EFBFBD><E69C9B>𩦝ougat嚗䔶葉<E494B6><E89189><EFBFBD>PyMuPDF |
| <EFBFBD><EFBFBD> Unpaywall API | <EFBFBD><EFBFBD><EFBFBD><EFBFBD> | <20><EFBFBD><EFBFBD><E78DA2>靘𥕦<E99D98><F0A595A6>唳䲮獢?|
| <EFBFBD>?Excel <EFBFBD><EFBFBD> | <20><><EFBFBD><EFBFBD> | 雿輻鍂 `xlsx` 摨橒<E691A8><E6A992>滨垢嚗?|
| Nougat 模型 | 已实现 | `extraction_service/services/nougat_extractor.py` |
| PyMuPDF | 已实现 | `extraction_service/services/pdf_extractor.py` |
| ✅ 顺序降级策略 | 已实现 | 英文→Nougat中文→PyMuPDF |
| 🆕 Unpaywall API | 需新增 | 本文档提供实现方案 |
| Excel 解析 | 需新增 | 使用 `xlsx` 库(前端) |
### 4.2 <EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD> 潃?
### 4.2 可能遗漏的技术点 ⭐
#### 嚗?嚗㕑”<E39591><EFBFBD><E6BDAD><EFBFBD>撘?
#### 1表格提取增强
**<EFBFBD><EFBFBD>**嚗鐭ougat <20><EFBFBD>靽萘<E99DBD>銵冽聢蝏𤘪<E89D8F>嚗䔶<E59A97> LLM <20>湔𦻖憭<F0A6BB96><E686AD> Markdown 銵冽聢<E586BD><EFBFBD>銝滚<E98A9D>蝖柴<E89D96>?
**问题**Nougat 虽然保留表格结构,但 LLM 直接处理 Markdown 表格可能不准确。
**解决方案Table Transformer**
```python
# 雿輻鍂敺株蔓<E6A0AA>?Table Transformer <EFBFBD>
# 使用微软的 Table Transformer 模型
# https://github.com/microsoft/table-transformer
from transformers import TableTransformerForObjectDetection
@@ -675,7 +675,7 @@ def extract_tables_enhanced(pdf_path: str):
"microsoft/table-transformer-detection"
)
# <EFBFBD>瘚贝”<EFBFBD><EFBFBD>蝵?
# 检测表格位置
tables = model.detect_tables(pdf_path)
# 提取每个表格
@@ -686,22 +686,22 @@ def extract_tables_enhanced(pdf_path: str):
return structured_tables
```
**隡睃<EFBFBD>蝥改<EFBFBD>V2.0**嚗㇈VP <20>嗆挾 Nougat 頞喳<EFBFBD>嚗?
**优先级V2.0**MVP 阶段 Nougat 足够)
#### 嚗?嚗匧<E59A97><E58CA7>刻圾<E588BB>𣂷<EFBFBD><F0A382B7>暹𦻖
#### 2引用解析与链接
**<EFBFBD><EFBFBD>**嚗𡁶<E59A97>摮行<E691AE><E8A18C><EFBFBD><E6A180>怠之<E680A0><EFBFBD><E8AAA9>?`[1] [2] [3]`嚗屸<E59A97><EFBFBD><EFBFBD>𣂼僎<F0A382BC>暹𦻖<E69AB9><EFBFBD><E595A3><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
**问题**:科学文献包含大量引用 `[1] [2] [3]`,需要解析并链接到参考文献。
**解决方案GROBID**
```python
# GROBID: <EFBFBD>皞鞟<EFBFBD>摮行<EFBFBD><EFBFBD>株圾<EFBFBD>𣂼極<EFBFBD>?
# GROBID: 开源科学文献解析工具
# https://github.com/kermitt2/grobid
import requests
def parse_references(pdf_path: str):
"""
雿輻鍂 GROBID <EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
使用 GROBID 解析参考文献
"""
with open(pdf_path, 'rb') as f:
files = {'input': f}
@@ -714,11 +714,11 @@ def parse_references(pdf_path: str):
return response.json()['references']
```
**隡睃<EFBFBD>蝥改<EFBFBD>V2.0**嚗<><E59A97><EFBFBD><EFBFBD><E8A9A8><EFBFBD>嚗?
**优先级V2.0**(非核心功能)
#### 嚗?嚗匧<E59A97>撘讛<E69298><E8AE9B><EFBFBD>皜脫<E79A9C>
#### 3公式识别与渲染
**<EFBFBD><EFBFBD>**嚗鐭ougat 颲枏枂 LaTeX <EFBFBD><EFBFBD><EFBFBD><EFBFBD>蝡舫<EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
**问题**Nougat 输出 LaTeX 公式,前端需要渲染。
**解决方案KaTeX / MathJax**
```typescript
@@ -736,9 +736,9 @@ function renderFormula(latex: string) {
**优先级MVP**(提升用户体验)
#### 嚗?嚗侨DF 憸<><E686B8>銝擧<E98A9D>瘜?
#### 4PDF 预览与标注
**<EFBFBD><EFBFBD>**嚗帋犖撌亙<E6928C><E4BA99>豢𧒄<E8B1A2><F0A79284><EFBFBD><EFBFBD><EFBFBD><E8A781><EFBFBD><EFBFBD>撟園<E6929F>鈭格<E988AD>瘜具<E7989C>?
**问题**:人工复核时需要查看原文,并高亮标注。
**解决方案PDF.js + Annotator.js**
```typescript
@@ -762,11 +762,11 @@ function PdfViewer({ pdfUrl, annotations }) {
**优先级MVP**(核心功能)
#### 嚗?嚗㗇<E59A97><E39787>桀縧<E6A180>?
#### 5文献去重
**<EFBFBD><EFBFBD>**嚗鍃xcel 銝𠹺<E98A9D><F0A0B9BA><EFBFBD><E888AA><EFBFBD><EFBFBD><EFBFBD><E6BB9A><EFBFBD>讃嚗<E8AE83><E59A97><EFBFBD><EFBFBD><E89DAD><EFBFBD><EFBFBD><E6A190>𣬚<EFBFBD><F0A3AC9A>穿<EFBFBD><E7A9BF>?
**问题**Excel 上传可能包含重复文献(同一篇文献不同版本)。
**<EFBFBD><EFBFBD><EFBFBD><EFBFBD>嚗𡁜抅鈭?DOI <20><EFBFBD>憸条<E686B8><E69DA1><EFBFBD>**
**解决方案:基于 DOI 和标题的去重**
```typescript
function deduplicateLiteratures(literatures: Literature[]) {
const seen = new Set();
@@ -791,16 +791,16 @@ function normalizeTitle(title: string): string {
return title
.toLowerCase()
.replace(/[^\w\s]/g, '') // 去除标点
.replace(/\s+/g, ' ') // <EFBFBD><EFBFBD><EFBFBD>𣇉征<EFBFBD>?
.replace(/\s+/g, ' ') // 规范化空格
.trim();
}
```
**优先级MVP**(必须功能)
#### 嚗?嚗㗇<E59A97><E39787><EFBFBD><E6A180>唳旿銵亙<E98AB5>
#### 6文献元数据补全
**<EFBFBD><EFBFBD>**嚗鍃xcel 銝𠹺<E98A9D><F0A0B9BA><EFBFBD><EFBFBD>桀虾<E6A180><EFBFBD>摰峕㟲嚗<E39FB2>撩 DOI<4F><49>僑隞賜<E99A9E>嚗剹<E59A97>?
**问题**Excel 上传的数据可能不完整(缺 DOI、年份等
**解决方案Crossref API**
```typescript
@@ -826,9 +826,9 @@ async function enrichMetadata(literature: Literature) {
**优先级V1.0**(增强功能)
#### 嚗?嚗㗇鸌憭<E9B88C><E686AD>餈𥕦漲<F0A595A6><E6BCB2><EFBFBD><EFBFBD>?
#### 7批处理进度持久化
**<EFBFBD><EFBFBD>**嚗𡁏鸌<F0A1818F><EFBFBD><E8AE90><EFBFBD>埈𧒄<E59F88><EFBFBD>1000 蝭?> 10 <EFBFBD><EFBFBD><EFBFBD>嚗㚁<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>蝏凋<EFBFBD><EFBFBD>?
**问题**:批量筛选耗时长(1000 > 10 分钟),需支持断点续传。
**解决方案Redis + 任务队列**
```typescript
@@ -862,11 +862,11 @@ screeningQueue.process(async (job) => {
**优先级V1.0**(体验优化)
#### 嚗?嚗厰<E59A97>霂臬<E99C82><E887AC><EFBFBD><EFBFBD><EFBFBD><EFBFBD>
#### 8错误处理与重试
**<EFBFBD><EFBFBD>**嚗匁LM 靚<><EFBFBD><EFBFBD>憭梯揖嚗<E68F96><E59A97>蝏栶<E89D8F><E6A0B6><EFBFBD><EFBFBD><EFBFBD><E5979A><EFBFBD><EFBFBD><E7989A><EFBFBD>?
**问题**LLM 调用可能失败(网络、超时、限流)。
**<EFBFBD><EFBFBD><EFBFBD><EFBFBD>嚗𡁏<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>霂?*
**解决方案:指数退避重试**
```typescript
async function retryWithBackoff<T>(
fn: () => Promise<T>,
@@ -892,30 +892,30 @@ async function retryWithBackoff<T>(
## 📊 技术选型总结
### MVP <EFBFBD>嗆挾敹<EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
### MVP 阶段必选技术
| <EFBFBD>漣 | <20><><EFBFBD>?| <20><EFBFBD>?|
| 层级 | 技术 | 用途 |
|------|------|------|
| **前端** | `xlsx` | Excel 解析 |
| **前端** | `PDF.js` | PDF 预览 |
| **前端** | `KaTeX` | 公式渲染 |
| **<EFBFBD>𡒊垢** | `ExtractionClient` | <EFBFBD> Python 敺格<EFBFBD><EFBFBD>?|
| **后端** | `ExtractionClient` | 调用 Python 微服务 |
| **后端** | `UnpaywallClient` | 文献下载 |
| **Python** | `Nougat` | 英文 PDF 提取 |
| **Python** | `PyMuPDF` | 敹恍<EFBFBD>?PDF <EFBFBD>𣂼<EFBFBD> |
| **<EFBFBD>唳旿摨?* | `asl_schema` | <EFBFBD>唳旿摮睃<EFBFBD> |
| **Python** | `PyMuPDF` | 快速 PDF 提取 |
| **数据库** | `asl_schema` | 数据存储 |
### V1.0 憓𧼮撩<EFBFBD><EFBFBD><EFBFBD>?
### V1.0 增强技术
| <EFBFBD><EFBFBD><EFBFBD>?| <20><EFBFBD>?|
| 技术 | 用途 |
|------|------|
| Crossref API | <EFBFBD><EFBFBD><EFBFBD><EFBFBD>?|
| Crossref API | 元数据补全 |
| Bull Queue | 任务队列 |
| Redis | 餈𥕦漲<EFBFBD><EFBFBD><EFBFBD><EFBFBD>?|
| Redis | 进度持久化 |
### V2.0 擃条漣<EFBFBD><EFBFBD><EFBFBD>?
### V2.0 高级技术
| <EFBFBD><EFBFBD><EFBFBD>?| <20><EFBFBD>?|
| 技术 | 用途 |
|------|------|
| Table Transformer | 表格精确提取 |
| GROBID | 引用解析 |
@@ -932,63 +932,63 @@ AIclinicalresearch/docs/03-业务模块/ASL-AI智能文献/
└── 05-测试文档/
├── 01-测试计划.md
├── 02-标题摘要初筛测试用例.md
<EFBFBD><EFBFBD><EFBFBD><EFBFBD> 03-瘚贝<E7989A><E8B49D>唳旿/ <20>?<3F>啣遣<E595A3><E981A3>辣憭?
<EFBFBD><EFBFBD><EFBFBD><EFBFBD> README.md <EFBFBD>?霂湔<E99C82><E6B994><EFBFBD>
└── 03-测试数据/ ← 新建文件夹
├── README.md ← 说明文档
├── screening-test-data/
<EFBFBD>? <20><EFBFBD><E98EBF><EFBFBD> literature-list-199.xlsx <EFBFBD>?199 <EFBFBD><EFBFBD><EFBFBD><EFBFBD>銵?
<EFBFBD>? <20><EFBFBD><E98EBF><EFBFBD> picos-criteria.txt <EFBFBD>?PICOS <EFBFBD><EFBFBD><EFBFBD>
<EFBFBD>? <20><EFBFBD><E5A999><EFBFBD> expected-results.json <EFBFBD>?憸<><E686B8>蝏𤘪<E89D8F><EFBFBD><E59A97><EFBFBD><EFBFBD><EFBFBD>嚗?
│ ├── literature-list-199.xlsx 199 篇文献列表
│ ├── picos-criteria.txt PICOS 标准
│ └── expected-results.json ← 预期结果(金标准)
├── pdf-samples/
<EFBFBD>? <20><EFBFBD><E98EBF><EFBFBD> sample-rct-01.pdf
<EFBFBD>? <20><EFBFBD><E98EBF><EFBFBD> sample-cohort-01.pdf
<EFBFBD>? <20><EFBFBD><E5A999><EFBFBD> README.md
│ ├── sample-rct-01.pdf
│ ├── sample-cohort-01.pdf
│ └── README.md
└── extraction-test-data/
└── README.md
```
**<EFBFBD><EFBFBD>蝏𤘪<EFBFBD>嚗?*
**推荐结构:**
```
05-测试文档/
├── 01-测试计划.md
├── 02-标题摘要初筛测试用例.md
└── 03-测试数据/
<EFBFBD><EFBFBD><EFBFBD><EFBFBD> README.md <EFBFBD>?<3F><EFBFBD><EFBFBD><EFBFBD><EFBFBD>霂閙㺭<E99699>格䔉皞僐<E79A9E><E58390><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>冽䲮瘜?
├── README.md ← 重要!说明测试数据来源、版权、使用方法
├── screening/
<EFBFBD>? <20><EFBFBD><E98EBF><EFBFBD> literature-list-199.xlsx
<EFBFBD>? <20><EFBFBD><E98EBF><EFBFBD> picos-criteria.txt
<EFBFBD>? <20><EFBFBD><E98EBF><EFBFBD> inclusion-criteria.txt
<EFBFBD>? <20><EFBFBD><E98EBF><EFBFBD> exclusion-criteria.txt
<EFBFBD>? <20><EFBFBD><E5A999><EFBFBD> gold-standard.json <EFBFBD>?鈭箏極<E7AE8F><E6A5B5><EFBFBD><E987A3>迤蝖桃<E89D96>獢?
│ ├── literature-list-199.xlsx
│ ├── picos-criteria.txt
│ ├── inclusion-criteria.txt
│ ├── exclusion-criteria.txt
│ └── gold-standard.json ← 人工标注的正确答案
└── pdf-extraction/
├── sample-01-high-quality.pdf
├── sample-02-with-tables.pdf
└── sample-03-chinese.pdf
```
**README.md 蝷箔<EFBFBD>嚗?*
**README.md 示例:**
```markdown
# ASL 瘚贝<EFBFBD><EFBFBD>唳旿<EFBFBD>?
# ASL 测试数据集
## 📋 数据说明
### 1. 标题摘要初筛测试数据
- **文件**: `literature-list-199.xlsx`
- **<2A><EFBFBD>**: 199 蝭<><EFBFBD><E3989A>龫摮行<E691AE><E8A18C>?
- **摮埈挾**: <20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><E996AC><EFBFBD>OI<4F><49><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>僑隞賬<E99A9E><E8B3AC><EFBFBD><EFBFBD>?
- **数量**: 199 篇英文医学文献
- **字段**: 标题、摘要、DOI、作者、年份、期刊
- **来源**: [描述数据来源]
- **版权**: [说明版权信息]
### 2. PICOS 标准
- **文件**: `picos-criteria.txt`
- **内容**: Population, Intervention, Comparison, Outcome, Study Design
- **蝥喳<E89DA5><E596B3><EFBFBD><EFBFBD>**: 5 <EFBFBD>?
- **<2A>㘾膄<E398BE><E88684><EFBFBD>**: 8 <EFBFBD>?
- **纳入标准**: 5
- **排除标准**: 8
### 3. <EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>鈭箏極<EFBFBD><EFBFBD>釣蝏𤘪<EFBFBD>嚗?
### 3. 金标准(人工标注结果)
- **文件**: `gold-standard.json`
- **<2A><>釣鈭?*: [<5B><>釣銝枏振靽⊥<E99DBD>]
- **标注人**: [标注专家信息]
- **标注时间**: [时间]
- **憸<><E686B8><EFBFBD><EFBFBD><EFBFBD>?*: <EFBFBD>?90%
- **预期准确率**: 90%
## 🎯 使用方法
@@ -997,15 +997,15 @@ AIclinicalresearch/docs/03-业务模块/ASL-AI智能文献/
npm run test:asl:screening
```
### <EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
### 评估准确率
```bash
npm run test:asl:evaluate -- --gold-standard gold-standard.json
```
## 📊 预期结果
- 蝥喳<E89DA5>: 45 蝭?
- <20>㘾膄: 132 蝭?
- 銝滨&摰? 22 蝭?
- 纳入: 45
- 排除: 132
- 不确定: 22 篇
```
---
@@ -1013,13 +1013,13 @@ npm run test:asl:evaluate -- --gold-standard gold-standard.json
## 📚 相关文档
- [质量保障与可追溯策略](./06-质量保障与可追溯策略.md)
- [<EFBFBD>唳旿摨栞挽霈(./01-<2D>唳旿摨栞挽霈?md)
- [数据库设计](./01-数据库设计.md)
- [API 设计规范](./02-API设计规范.md)
- [文档提取微服务](../../../../extraction_service/README.md)
---
**<EFBFBD>湔鰵<EFBFBD><EFBFBD>**嚗?
**更新日志**
- 2025-11-15: 创建文档,定义初筛、全文处理、文献下载技术选型