Files
AIclinicalresearch/extraction_service
HaHafeng 66255368b7 feat(admin): Add user management and upgrade to module permission system
Features - User Management (Phase 4.1):
- Database: Add user_modules table for fine-grained module permissions
- Database: Add 4 user permissions (view/create/edit/delete) to role_permissions
- Backend: UserService (780 lines) - CRUD with tenant isolation
- Backend: UserController + UserRoutes (648 lines) - 13 API endpoints
- Backend: Batch import users from Excel
- Frontend: UserListPage (412 lines) - list/filter/search/pagination
- Frontend: UserFormPage (341 lines) - create/edit with module config
- Frontend: UserDetailPage (393 lines) - details/tenant/module management
- Frontend: 3 modal components (592 lines) - import/assign/configure
- API: GET/POST/PUT/DELETE /api/admin/users/* endpoints

Architecture Upgrade - Module Permission System:
- Backend: Add getUserModules() method in auth.service
- Backend: Login API returns modules array in user object
- Frontend: AuthContext adds hasModule() method
- Frontend: Navigation filters modules based on user.modules
- Frontend: RouteGuard checks requiredModule instead of requiredVersion
- Frontend: Remove deprecated version-based permission system
- UX: Only show accessible modules in navigation (clean UI)
- UX: Smart redirect after login (avoid 403 for regular users)

Fixes:
- Fix UTF-8 encoding corruption in ~100 docs files
- Fix pageSize type conversion in userService (String to Number)
- Fix authUser undefined error in TopNavigation
- Fix login redirect logic with role-based access check
- Update Git commit guidelines v1.2 with UTF-8 safety rules

Database Changes:
- CREATE TABLE user_modules (user_id, tenant_id, module_code, is_enabled)
- ADD UNIQUE CONSTRAINT (user_id, tenant_id, module_code)
- INSERT 4 permissions + role assignments
- UPDATE PUBLIC tenant with 8 module subscriptions

Technical:
- Backend: 5 new files (~2400 lines)
- Frontend: 10 new files (~2500 lines)
- Docs: 1 development record + 2 status updates + 1 guideline update
- Total: ~4900 lines of code

Status: User management 100% complete, module permission system operational
2026-01-16 13:42:10 +08:00
..

文档提取微服务

基于FastAPI的文档文本提取服务支持PDF、Docx、Txt格式。

功能特性

  • PDF提取使用PyMuPDF快速提取PDF文本
  • Docx提取使用Mammoth提取Word文档Day 3
  • Txt提取支持多种编码Day 3
  • 语言检测自动检测PDF语言Day 2
  • Nougat集成高质量学术PDF解析Day 2

快速开始

1. 安装依赖

cd extraction_service

# 创建虚拟环境(推荐)
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 安装依赖
pip install -r requirements.txt

2. 配置环境变量

# 复制示例配置
cp .env.example .env

# 编辑配置(可选)
# SERVICE_PORT=8000
# DEBUG=True

3. 启动服务

# 开发模式(自动重载)
python main.py

# 或使用uvicorn
uvicorn main:app --reload --port 8000

服务将在 http://localhost:8000 启动

4. 测试服务

健康检查

curl http://localhost:8000/api/health

返回:

{
  "status": "healthy",
  "checks": {
    "pymupdf": {
      "available": true,
      "version": "1.23.8"
    },
    "temp_dir": {
      "path": "/tmp/extraction_service",
      "writable": true
    }
  }
}

PDF文本提取

curl -X POST http://localhost:8000/api/extract/pdf \
  -F "file=@test.pdf"

返回:

{
  "success": true,
  "method": "pymupdf",
  "text": "提取的文本内容...",
  "metadata": {
    "page_count": 20,
    "char_count": 50000,
    "file_size": 1024000,
    "filename": "test.pdf"
  }
}

API文档

启动服务后访问:

项目结构

extraction_service/
├── main.py              # 主应用入口
├── requirements.txt     # Python依赖
├── .env.example         # 环境变量示例
├── README.md           # 本文件
├── services/           # 服务模块
│   ├── __init__.py
│   ├── pdf_extractor.py      # PDF提取PyMuPDF
│   ├── nougat_extractor.py   # Nougat提取Day 2
│   ├── docx_extractor.py     # Docx提取Day 3
│   ├── txt_extractor.py      # Txt提取Day 3
│   ├── language_detector.py  # 语言检测Day 2
│   └── file_utils.py         # 文件工具
└── tests/              # 测试文件(待添加)

开发计划

Day 1已完成

  • FastAPI项目搭建
  • PyMuPDF集成
  • PDF文本提取功能
  • 健康检查API

Day 2进行中

  • 安装Nougat
  • 语言检测功能
  • Nougat提取逻辑
  • 顺序降级机制

Day 3

  • Docx提取Mammoth
  • Txt提取多编码
  • 文件格式验证

依赖说明

版本 用途
fastapi 0.104.1 Web框架
uvicorn 0.24.0 ASGI服务器
PyMuPDF 1.23.8 PDF文本提取
pdfplumber 0.10.3 PDF语言检测
mammoth 1.6.0 Docx提取
langdetect 1.0.9 语言检测
loguru 0.7.2 日志管理

性能指标

操作 目标时间
20页PDFPyMuPDF <30秒
10页Docx <10秒
1MB Txt <5秒

常见问题

Q: PyMuPDF安装失败

A: 确保Python版本>=3.8使用pip安装pip install PyMuPDF

Q: 服务无法启动?

A: 检查端口8000是否被占用可修改.env中的SERVICE_PORT

Q: 临时文件在哪里?

A: 默认在/tmp/extraction_service目录可通过TEMP_DIR环境变量配置

License

MIT