Files
AIclinicalresearch/docs/02-通用能力层/02-文档处理引擎/01-文档处理引擎设计方案.md
HaHafeng 40c2f8e148 feat(rag): Complete RAG engine implementation with pgvector
Major Features:
- Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk
- Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors)
- Implemented ChunkService (smart Markdown chunking)
- Implemented VectorSearchService (multi-query + hybrid search)
- Implemented RerankService (qwen3-rerank)
- Integrated DeepSeek V3 QueryRewriter for cross-language search
- Python service: Added pymupdf4llm for PDF-to-Markdown conversion
- PKB: Dual-mode adapter (pgvector/dify/hybrid)

Architecture:
- Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector)
- Cross-language support: Chinese query matches English documents
- Small Embedding (1024) + Strong Reranker strategy

Performance:
- End-to-end latency: 2.5s
- Cost per query: 0.0025 RMB
- Accuracy improvement: +20.5% (cross-language)

Tests:
- test-embedding-service.ts: Vector embedding verified
- test-rag-e2e.ts: Full pipeline tested
- test-rerank.ts: Rerank quality validated
- test-query-rewrite.ts: Cross-language search verified
- test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf)

Documentation:
- Added 05-RAG-Engine-User-Guide.md
- Added 02-Document-Processing-User-Guide.md
- Updated system status documentation

Status: Production ready
2026-01-21 20:24:29 +08:00

1344 lines
40 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 文档处理引擎设计方案
> **文档版本:** v1.1
> **创建日期:** 2026-01-20
> **最后更新:** 2026-01-20
> **文档目的:** 定义统一的文档处理策略,将各类文档转换为 LLM 友好的 Markdown 格式
> **适用范围:** PKB 知识库、ASL 智能文献、DC 数据清洗、AIA 附件处理
> **核心原则:** 极轻量、零 OCR、聚焦核心格式
---
## 📋 概述
### 设计理念
构建一个 **"极轻量、零 OCR、LLM 友好"** 的文档解析微服务。
**核心原则(适合 2 人小团队):**
- **抓大放小** - 确保 PDF/Word/Excel 的绝对准确,冷门格式按需扩展
- **零 OCR** - 只处理电子版文档,放弃扫描件支持,换取极致部署速度
- **容错优雅** - 解析失败时返回 LLM 友好的提示,不中断流程
### 设计目标
1. **聚焦核心格式** - PDF、Word、Excel、PPT 覆盖 95% 使用场景
2. **LLM 友好输出** - 统一转换为结构化 Markdown包含上下文信息
3. **表格保真** - 完整保留文献中的表格信息(临床试验核心数据)
4. **极致轻量** - Docker 镜像 < 300MB资源占用 < 512MB 内存
### 架构概览
```
┌─────────────────────────────────────────────────────────────┐
│ DocumentProcessor │
│ (统一入口:自动检测文件类型,调用对应处理器) │
├─────────────────────────────────────────────────────────────┤
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ PDF │ │ Word │ │ PPT │ │ Excel │ │
│ │ Processor │ │ Processor │ │ Processor │ │ Processor │ │
│ │pymupdf4llm│ │ mammoth │ │python-pptx│ │ pandas │ │
│ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ CSV │ │ HTML │ │ 文献引用 │ │ 医学 │ │
│ │ Processor │ │ Processor │ │ Processor │ │ Processor │ │
│ │ pandas │ │markdownify│ │bibtexparser│ │ pydicom │ │
│ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │
├─────────────────────────────────────────────────────────────┤
│ 输出: 统一 Markdown 格式 │
└─────────────────────────────────────────────────────────────┘
```
---
## 📄 支持格式与工具选型
### 格式覆盖矩阵
| 分类 | 格式 | 推荐工具 | 优先级 | 状态 |
|------|------|----------|--------|------|
| **文档类** | PDF (.pdf) | `pymupdf4llm` | P0 | ✅ 推荐 |
| | Word (.docx) | `mammoth` | P0 | ✅ 推荐 |
| | PowerPoint (.pptx) | `python-pptx` | P1 | ✅ 推荐 |
| | 纯文本 (.txt/.md) | 直接读取 | P0 | ✅ 内置 |
| | 富文本 (.rtf) | `striprtf` | P2 | 🔜 待实现 |
| **表格类** | Excel (.xlsx) | `pandas` + `openpyxl` | P0 | ✅ 推荐 |
| | CSV (.csv) | `pandas` | P0 | ✅ 推荐 |
| | SAS (.sas7bdat) | `pandas` + `sas7bdat` | P2 | 🔜 待实现 |
| | SPSS (.sav) | `pandas` + `pyreadstat` | P2 | 🔜 待实现 |
| | Stata (.dta) | `pandas.read_stata()` | P2 | 🔜 待实现 |
| **网页类** | HTML (.html) | `beautifulsoup4` + `markdownify` | P1 | ✅ 推荐 |
| | 电子书 (.epub) | `ebooklib` | P2 | 🔜 待实现 |
| **引用类** | BibTeX (.bib) | `bibtexparser` | P1 | ✅ 推荐 |
| | RIS (.ris) | `rispy` | P1 | ✅ 推荐 |
| | EndNote (.enw) | 自定义解析 | P2 | 🔜 待实现 |
| **医学类** | DICOM (.dcm) | `pydicom` | P2 | 🔜 待实现 |
| | HL7/FHIR | `hl7` / `fhirclient` | P3 | 📋 规划中 |
| **数据类** | JSON (.json/.jsonl) | `json` 标准库 | P1 | ✅ 内置 |
| | XML (.xml) | `lxml` | P1 | ✅ 推荐 |
---
## 🔧 详细实现方案
### 1. PDF 文档处理
#### 工具选择:`pymupdf4llm`
**关键决策:只保留 `pymupdf4llm`,移除独立的 `PyMuPDF` 和 `Nougat`**
| 对比项 | PyMuPDF (旧) | pymupdf4llm (新) | Nougat |
|--------|-------------|------------------|--------|
| 表格提取 | 基础文本 | ✅ Markdown 表格 | ✅ LaTeX 表格 |
| 图片处理 | 提取二进制 | ✅ 自动 base64 | ✅ 支持 |
| 数学公式 | ❌ | ✅ 保留 LaTeX | ✅ 原生 LaTeX |
| 多栏布局 | 需手动处理 | ✅ 自动重排 | ✅ 支持 |
| 处理速度 | 快 | 快 | 慢GPU |
| 依赖复杂度 | 低 | 低 | 高 |
| 扫描版 PDF | ❌ | ❌ | ✅ |
**说明**
- `pymupdf4llm``PyMuPDF` 的上层封装,安装时自动包含 `pymupdf` 依赖
- 对于普通 PDF文本型`pymupdf4llm` 已完全满足需求
- **扫描版 PDF 策略**:检测后返回友好提示,不阻断流程(零 OCR 原则)
#### 代码实现
```python
# pdf_processor.py
import pymupdf4llm
import logging
from pathlib import Path
from typing import Optional, List, Dict, Any
logger = logging.getLogger(__name__)
class PdfProcessor:
"""PDF 文档处理器 - 基于 pymupdf4llm仅支持电子版"""
# 扫描件检测阈值:提取文本少于此字符数视为扫描件
MIN_TEXT_THRESHOLD = 50
def __init__(self, image_dir: str = "./images"):
self.image_dir = image_dir
def to_markdown(
self,
pdf_path: str,
page_chunks: bool = False,
extract_images: bool = True,
dpi: int = 150
) -> str:
"""
PDF 转 Markdown仅支持电子版
Args:
pdf_path: PDF 文件路径
page_chunks: 是否按页分块
extract_images: 是否提取图片
dpi: 图片分辨率
Returns:
Markdown 格式的文本
Note:
扫描版 PDF 会返回友好提示,不会抛出异常
"""
try:
md_text = pymupdf4llm.to_markdown(
pdf_path,
page_chunks=page_chunks,
write_images=extract_images,
image_path=self.image_dir,
dpi=dpi,
show_progress=False
)
# 如果返回的是列表page_chunks=True合并为字符串
if isinstance(md_text, list):
md_text = "\n\n---\n\n".join([
f"## Page {i+1}\n\n{page['text']}"
for i, page in enumerate(md_text)
])
# 质量检查:检测是否为扫描件
if len(md_text.strip()) < self.MIN_TEXT_THRESHOLD:
logger.warning(f"PDF 文本过少 ({len(md_text.strip())} 字符),可能为扫描件: {pdf_path}")
return self._scan_pdf_hint(pdf_path, len(md_text.strip()))
return md_text
except Exception as e:
logger.error(f"PDF 解析失败: {pdf_path}, 错误: {e}")
raise ValueError(f"PDF 解析失败: {str(e)}")
def _scan_pdf_hint(self, pdf_path: str, char_count: int) -> str:
"""生成扫描件友好提示(让 LLM 知道文件无法读取)"""
filename = Path(pdf_path).name
return f"""> **系统提示**:文档 `{filename}` 似乎是扫描件(图片型 PDF
>
> - 提取文本量:{char_count} 字符
> - 本系统暂不支持扫描版 PDF 的文字识别
> - 建议:请上传电子版 PDF或将扫描件转换为可编辑格式后重新上传"""
def extract_tables(self, pdf_path: str) -> List[Dict[str, Any]]:
"""
提取 PDF 中的所有表格
Returns:
表格列表,每个表格包含页码和 Markdown 格式内容
"""
import fitz # pymupdf
tables = []
doc = fitz.open(pdf_path)
for page_num, page in enumerate(doc, 1):
# pymupdf 4.x 原生表格提取
page_tables = page.find_tables()
for idx, table in enumerate(page_tables):
df = table.to_pandas()
tables.append({
"page": page_num,
"table_index": idx,
"markdown": df.to_markdown(index=False),
"rows": len(df),
"cols": len(df.columns)
})
doc.close()
return tables
def get_metadata(self, pdf_path: str) -> Dict[str, Any]:
"""提取 PDF 元数据"""
import fitz
doc = fitz.open(pdf_path)
metadata = doc.metadata
metadata["page_count"] = len(doc)
doc.close()
return metadata
```
#### 配置依赖
```txt
# requirements.txt
pymupdf4llm>=0.0.10 # 自动包含 pymupdf 依赖
```
---
### 2. Word 文档处理 (.docx)
#### 工具选择:`mammoth`(推荐)
| 对比项 | python-docx | mammoth |
|--------|------------|---------|
| 输出格式 | 需手动转换 | ✅ 直接 Markdown/HTML |
| 表格处理 | 精确控制 | ✅ 自动转换 |
| 样式保留 | 完整 | 基础样式 |
| 复杂度 | 高 | ✅ 低 |
| 适用场景 | 需要精细控制 | ✅ 快速转换 |
**建议**:主用 `mammoth``python-docx` 作为备选(复杂文档)
#### 代码实现
```python
# docx_processor.py
import mammoth
import logging
from pathlib import Path
from typing import Optional, Dict, Any
logger = logging.getLogger(__name__)
class DocxProcessor:
"""Word 文档处理器 - 基于 mammoth"""
def to_markdown(self, docx_path: str) -> str:
"""
Word 转 Markdown
Args:
docx_path: Word 文件路径
Returns:
Markdown 格式的文本
Note:
空文档会返回友好提示
"""
try:
with open(docx_path, "rb") as f:
result = mammoth.convert_to_markdown(f)
# 记录转换警告
if result.messages:
for msg in result.messages:
logger.warning(f"[Word 转换警告] {msg.message}")
# 空文档检测
if not result.value.strip():
filename = Path(docx_path).name
return f"> **系统提示**Word 文档 `{filename}` 内容为空或无法识别。"
return result.value
except Exception as e:
logger.error(f"Word 解析失败: {docx_path}, 错误: {e}")
raise ValueError(f"Word 解析失败: {str(e)}")
def to_html(self, docx_path: str) -> str:
"""Word 转 HTML保留更多样式"""
with open(docx_path, "rb") as f:
result = mammoth.convert_to_html(f)
return result.value
def extract_images(self, docx_path: str, output_dir: str) -> list:
"""提取 Word 中的图片"""
from docx import Document
import os
doc = Document(docx_path)
images = []
for idx, rel in enumerate(doc.part.rels.values()):
if "image" in rel.target_ref:
image_data = rel.target_part.blob
ext = rel.target_ref.split(".")[-1]
image_path = os.path.join(output_dir, f"image_{idx}.{ext}")
with open(image_path, "wb") as f:
f.write(image_data)
images.append(image_path)
return images
```
#### 配置依赖
```txt
# requirements.txt
mammoth>=1.6.0
python-docx>=0.8.11 # 备选,用于复杂文档
```
---
### 3. PowerPoint 文档处理 (.pptx)
#### 工具选择:`python-pptx`
#### 代码实现
```python
# pptx_processor.py
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE
from typing import List, Dict, Any
import os
class PptxProcessor:
"""PowerPoint 文档处理器 - 基于 python-pptx"""
def to_markdown(
self,
pptx_path: str,
extract_images: bool = False,
image_dir: str = "./images"
) -> str:
"""
PPT 转 Markdown
Args:
pptx_path: PPT 文件路径
extract_images: 是否提取图片
image_dir: 图片保存目录
Returns:
Markdown 格式的文本
"""
prs = Presentation(pptx_path)
md_parts = []
image_count = 0
for slide_num, slide in enumerate(prs.slides, 1):
md_parts.append(f"## Slide {slide_num}")
# 提取幻灯片标题
if slide.shapes.title:
md_parts.append(f"### {slide.shapes.title.text}")
# 遍历所有形状
for shape in slide.shapes:
# 文本框
if shape.has_text_frame:
for para in shape.text_frame.paragraphs:
text = para.text.strip()
if text and text != slide.shapes.title.text if slide.shapes.title else True:
# 根据层级添加缩进
level = para.level
prefix = " " * level + "- " if level > 0 else ""
md_parts.append(f"{prefix}{text}")
# 表格
if shape.has_table:
md_parts.append(self._table_to_markdown(shape.table))
# 图片
if extract_images and shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
image_count += 1
image_path = os.path.join(image_dir, f"slide{slide_num}_img{image_count}.png")
self._save_image(shape, image_path)
md_parts.append(f"![Image]({image_path})")
md_parts.append("") # 空行分隔
return "\n".join(md_parts)
def _table_to_markdown(self, table) -> str:
"""表格转 Markdown"""
rows = []
for row_idx, row in enumerate(table.rows):
cells = [cell.text.strip() for cell in row.cells]
rows.append("| " + " | ".join(cells) + " |")
# 表头分隔行
if row_idx == 0:
rows.append("| " + " | ".join(["---"] * len(cells)) + " |")
return "\n".join(rows)
def _save_image(self, shape, output_path: str):
"""保存图片"""
os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, "wb") as f:
f.write(shape.image.blob)
def get_outline(self, pptx_path: str) -> List[Dict[str, Any]]:
"""获取 PPT 大纲"""
prs = Presentation(pptx_path)
outline = []
for slide_num, slide in enumerate(prs.slides, 1):
slide_info = {
"slide_number": slide_num,
"title": slide.shapes.title.text if slide.shapes.title else None,
"text_count": sum(
len(shape.text_frame.text)
for shape in slide.shapes
if shape.has_text_frame
)
}
outline.append(slide_info)
return outline
```
#### 配置依赖
```txt
# requirements.txt
python-pptx>=0.6.23
```
---
### 4. Excel 文档处理 (.xlsx)
#### 工具选择:`pandas` + `openpyxl`
#### 代码实现
```python
# excel_processor.py
import pandas as pd
import logging
from pathlib import Path
from typing import List, Dict, Any, Optional
logger = logging.getLogger(__name__)
class ExcelProcessor:
"""Excel 文档处理器 - 基于 pandas + openpyxl"""
def to_markdown(
self,
xlsx_path: str,
sheet_names: Optional[List[str]] = None,
max_rows: int = 200
) -> str:
"""
Excel 转 Markdown包含丰富上下文信息
Args:
xlsx_path: Excel 文件路径
sheet_names: 指定 Sheet 名称列表None 表示全部
max_rows: 最大行数限制(防止超大文件,默认 200 行)
Returns:
Markdown 格式的文本(包含文件名、行列数等上下文)
"""
filename = Path(xlsx_path).name
md_parts = []
try:
xlsx = pd.ExcelFile(xlsx_path, engine='openpyxl')
sheets_to_process = sheet_names or xlsx.sheet_names
for sheet_name in sheets_to_process:
if sheet_name not in xlsx.sheet_names:
continue
df = pd.read_excel(xlsx, sheet_name=sheet_name)
total_rows = len(df)
# 添加数据来源上下文LLM 友好)
md_parts.append(f"## 数据来源: {filename} - {sheet_name}")
md_parts.append(f"- **行列**: {total_rows}× {len(df.columns)}")
# 截断提示
if total_rows > max_rows:
md_parts.append(f"> ⚠️ 数据量较大,仅显示前 {max_rows} 行(共 {total_rows} 行)")
df = df.head(max_rows)
md_parts.append("")
# 处理空值,避免 NaN 显示
df = df.fillna('')
md_parts.append(df.to_markdown(index=False))
md_parts.append("\n---\n")
return "\n".join(md_parts)
except Exception as e:
logger.error(f"Excel 解析失败: {xlsx_path}, 错误: {e}")
return f"> **系统提示**Excel 文件 `{filename}` 解析失败: {str(e)}"
def get_sheet_info(self, xlsx_path: str) -> List[Dict[str, Any]]:
"""获取 Excel 所有 Sheet 信息"""
xlsx = pd.ExcelFile(xlsx_path, engine='openpyxl')
sheets = []
for sheet_name in xlsx.sheet_names:
df = pd.read_excel(xlsx, sheet_name=sheet_name)
sheets.append({
"name": sheet_name,
"rows": len(df),
"columns": len(df.columns),
"column_names": df.columns.tolist()
})
return sheets
def extract_sheet(
self,
xlsx_path: str,
sheet_name: str,
as_dict: bool = False
) -> Any:
"""提取单个 Sheet 数据"""
df = pd.read_excel(xlsx_path, sheet_name=sheet_name, engine='openpyxl')
if as_dict:
return df.to_dict(orient='records')
return df
```
#### 配置依赖
```txt
# requirements.txt
pandas>=2.0.0
openpyxl>=3.1.2
tabulate>=0.9.0 # pandas.to_markdown() 依赖
```
---
### 5. CSV 文件处理
#### 代码实现
```python
# csv_processor.py
import pandas as pd
import logging
from pathlib import Path
from typing import Optional, List
import chardet
logger = logging.getLogger(__name__)
class CsvProcessor:
"""CSV 文件处理器 - 基于 pandas"""
def to_markdown(
self,
csv_path: str,
encoding: Optional[str] = None,
max_rows: int = 200,
delimiter: str = ','
) -> str:
"""
CSV 转 Markdown包含丰富上下文信息
Args:
csv_path: CSV 文件路径
encoding: 文件编码(自动检测)
max_rows: 最大行数限制(默认 200 行)
delimiter: 分隔符
Returns:
Markdown 格式的文本
"""
filename = Path(csv_path).name
try:
# 自动检测编码
if encoding is None:
encoding = self._detect_encoding(csv_path)
df = pd.read_csv(csv_path, encoding=encoding, delimiter=delimiter)
total_rows = len(df)
md_parts = [
f"## 数据来源: {filename}",
f"- **行列**: {total_rows}× {len(df.columns)}",
f"- **编码**: {encoding}",
]
# 截断提示
if total_rows > max_rows:
md_parts.append(f"> ⚠️ 数据量较大,仅显示前 {max_rows} 行(共 {total_rows} 行)")
df = df.head(max_rows)
md_parts.append("")
df = df.fillna('')
md_parts.append(df.to_markdown(index=False))
return "\n".join(md_parts)
except Exception as e:
logger.error(f"CSV 解析失败: {csv_path}, 错误: {e}")
return f"> **系统提示**CSV 文件 `{filename}` 解析失败: {str(e)}"
def _detect_encoding(self, file_path: str) -> str:
"""自动检测文件编码"""
with open(file_path, 'rb') as f:
raw_data = f.read(10000) # 读取前 10KB
result = chardet.detect(raw_data)
encoding = result['encoding']
# 常见编码映射
encoding_map = {
'GB2312': 'gbk',
'gb2312': 'gbk',
'GBK': 'gbk',
'GB18030': 'gb18030',
}
return encoding_map.get(encoding, encoding or 'utf-8')
```
#### 配置依赖
```txt
# requirements.txt
pandas>=2.0.0
chardet>=5.0.0
```
---
### 6. HTML 文档处理
#### 工具选择:`beautifulsoup4` + `markdownify`
#### 代码实现
```python
# html_processor.py
from bs4 import BeautifulSoup
from markdownify import markdownify as md
from typing import Optional
class HtmlProcessor:
"""HTML 文档处理器 - 基于 beautifulsoup4 + markdownify"""
def to_markdown(
self,
html_content: str,
strip_tags: Optional[list] = None
) -> str:
"""
HTML 转 Markdown
Args:
html_content: HTML 内容
strip_tags: 要移除的标签列表
Returns:
Markdown 格式的文本
"""
# 预处理:移除脚本和样式
soup = BeautifulSoup(html_content, 'html.parser')
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
if strip_tags:
for tag_name in strip_tags:
for tag in soup(tag_name):
tag.decompose()
# 转换为 Markdown
markdown = md(str(soup), heading_style="ATX", bullets="-")
# 清理多余空行
lines = markdown.split('\n')
cleaned_lines = []
prev_empty = False
for line in lines:
is_empty = not line.strip()
if is_empty and prev_empty:
continue
cleaned_lines.append(line)
prev_empty = is_empty
return '\n'.join(cleaned_lines)
def from_file(self, html_path: str, encoding: str = 'utf-8') -> str:
"""从文件读取 HTML 并转换"""
with open(html_path, 'r', encoding=encoding) as f:
html_content = f.read()
return self.to_markdown(html_content)
def extract_text(self, html_content: str) -> str:
"""仅提取纯文本"""
soup = BeautifulSoup(html_content, 'html.parser')
return soup.get_text(separator='\n', strip=True)
```
#### 配置依赖
```txt
# requirements.txt
beautifulsoup4>=4.12.0
markdownify>=0.11.6
lxml>=4.9.0 # BeautifulSoup 的高性能解析器
```
---
### 7. 文献引用格式处理
#### 代码实现
```python
# reference_processor.py
import bibtexparser
import rispy
from typing import List, Dict, Any
class ReferenceProcessor:
"""文献引用格式处理器"""
def bib_to_markdown(self, bib_path: str) -> str:
"""
BibTeX 转 Markdown
Args:
bib_path: .bib 文件路径
Returns:
Markdown 格式的参考文献列表
"""
with open(bib_path, 'r', encoding='utf-8') as f:
bib_database = bibtexparser.load(f)
md_parts = ["# 参考文献\n"]
for idx, entry in enumerate(bib_database.entries, 1):
# 格式化为引用格式
authors = entry.get('author', 'Unknown')
title = entry.get('title', 'No title')
year = entry.get('year', 'N/A')
journal = entry.get('journal', entry.get('booktitle', ''))
citation = f"{idx}. {authors}. **{title}**. "
if journal:
citation += f"*{journal}*. "
citation += f"({year})"
md_parts.append(citation)
md_parts.append("")
return "\n".join(md_parts)
def ris_to_markdown(self, ris_path: str) -> str:
"""
RIS 转 Markdown
Args:
ris_path: .ris 文件路径
Returns:
Markdown 格式的参考文献列表
"""
with open(ris_path, 'r', encoding='utf-8') as f:
entries = rispy.load(f)
md_parts = ["# 参考文献\n"]
for idx, entry in enumerate(entries, 1):
authors = ', '.join(entry.get('authors', ['Unknown']))
title = entry.get('title', entry.get('primary_title', 'No title'))
year = entry.get('year', entry.get('publication_year', 'N/A'))
journal = entry.get('journal_name', entry.get('secondary_title', ''))
citation = f"{idx}. {authors}. **{title}**. "
if journal:
citation += f"*{journal}*. "
citation += f"({year})"
md_parts.append(citation)
md_parts.append("")
return "\n".join(md_parts)
def parse_bib(self, bib_path: str) -> List[Dict[str, Any]]:
"""解析 BibTeX 返回结构化数据"""
with open(bib_path, 'r', encoding='utf-8') as f:
bib_database = bibtexparser.load(f)
return bib_database.entries
```
#### 配置依赖
```txt
# requirements.txt
bibtexparser>=1.4.0
rispy>=0.7.0
```
---
### 8. 医学数据格式处理(扩展)
#### DICOM 元数据提取
```python
# dicom_processor.py
import pydicom
from typing import Dict, Any
class DicomProcessor:
"""DICOM 医学影像元数据处理器"""
def extract_metadata(self, dcm_path: str) -> Dict[str, Any]:
"""
提取 DICOM 元数据
Args:
dcm_path: DICOM 文件路径
Returns:
元数据字典
"""
dcm = pydicom.dcmread(dcm_path)
# 提取关键元数据
metadata = {
"patient_name": str(dcm.PatientName) if hasattr(dcm, 'PatientName') else None,
"patient_id": dcm.PatientID if hasattr(dcm, 'PatientID') else None,
"study_date": dcm.StudyDate if hasattr(dcm, 'StudyDate') else None,
"modality": dcm.Modality if hasattr(dcm, 'Modality') else None,
"study_description": dcm.StudyDescription if hasattr(dcm, 'StudyDescription') else None,
"series_description": dcm.SeriesDescription if hasattr(dcm, 'SeriesDescription') else None,
"institution_name": dcm.InstitutionName if hasattr(dcm, 'InstitutionName') else None,
"manufacturer": dcm.Manufacturer if hasattr(dcm, 'Manufacturer') else None,
}
return {k: v for k, v in metadata.items() if v is not None}
def to_markdown(self, dcm_path: str) -> str:
"""DICOM 元数据转 Markdown"""
metadata = self.extract_metadata(dcm_path)
md_parts = ["# DICOM 影像信息\n"]
for key, value in metadata.items():
label = key.replace('_', ' ').title()
md_parts.append(f"- **{label}**: {value}")
return "\n".join(md_parts)
```
#### 统计软件数据格式
```python
# stats_data_processor.py
import pandas as pd
from typing import Optional
class StatsDataProcessor:
"""统计软件数据格式处理器SAS/SPSS/Stata"""
def sas_to_markdown(
self,
sas_path: str,
max_rows: int = 1000
) -> str:
"""SAS 数据转 Markdown"""
df = pd.read_sas(sas_path)
return self._df_to_markdown(df, "SAS", sas_path, max_rows)
def spss_to_markdown(
self,
sav_path: str,
max_rows: int = 1000
) -> str:
"""SPSS 数据转 Markdown"""
import pyreadstat
df, meta = pyreadstat.read_sav(sav_path)
md = self._df_to_markdown(df, "SPSS", sav_path, max_rows)
# 添加变量标签信息
if meta.column_labels:
md += "\n\n## 变量标签\n\n"
for col, label in zip(meta.column_names, meta.column_labels):
if label:
md += f"- **{col}**: {label}\n"
return md
def stata_to_markdown(
self,
dta_path: str,
max_rows: int = 1000
) -> str:
"""Stata 数据转 Markdown"""
df = pd.read_stata(dta_path)
return self._df_to_markdown(df, "Stata", dta_path, max_rows)
def _df_to_markdown(
self,
df: pd.DataFrame,
source_type: str,
file_path: str,
max_rows: int
) -> str:
"""DataFrame 转 Markdown 通用方法"""
if len(df) > max_rows:
df = df.head(max_rows)
truncated = True
else:
truncated = False
md_parts = [
f"# {source_type} 数据\n",
f"**文件**: {file_path}",
f"**行数**: {len(df)} | **列数**: {len(df.columns)}",
]
if truncated:
md_parts.append(f"**注意**: 数据已截断,仅显示前 {max_rows}")
md_parts.extend(["", df.to_markdown(index=False)])
return "\n".join(md_parts)
```
#### 配置依赖
```txt
# requirements.txt
pydicom>=2.4.0 # DICOM
pyreadstat>=1.2.0 # SPSS/SAS/Stata
sas7bdat>=2.2.3 # SAS 格式支持
```
---
## 🏗️ 统一处理器架构
### 主入口类
```python
# document_processor.py
from pathlib import Path
from typing import Optional, Dict, Any
import mimetypes
class DocumentProcessor:
"""
统一文档处理器
自动检测文件类型,调用对应处理器
"""
# 文件扩展名与处理器映射
PROCESSOR_MAP = {
'.pdf': 'pdf',
'.docx': 'docx',
'.doc': 'docx',
'.pptx': 'pptx',
'.xlsx': 'excel',
'.xls': 'excel',
'.csv': 'csv',
'.txt': 'text',
'.md': 'text',
'.html': 'html',
'.htm': 'html',
'.bib': 'bibtex',
'.ris': 'ris',
'.json': 'json',
'.jsonl': 'jsonl',
'.xml': 'xml',
'.dcm': 'dicom',
'.sas7bdat': 'sas',
'.sav': 'spss',
'.dta': 'stata',
}
def __init__(self, config: Optional[Dict[str, Any]] = None):
self.config = config or {}
self._init_processors()
def _init_processors(self):
"""初始化各处理器"""
from .pdf_processor import PdfProcessor
from .docx_processor import DocxProcessor
from .pptx_processor import PptxProcessor
from .excel_processor import ExcelProcessor
from .csv_processor import CsvProcessor
from .html_processor import HtmlProcessor
from .reference_processor import ReferenceProcessor
self.processors = {
'pdf': PdfProcessor(),
'docx': DocxProcessor(),
'pptx': PptxProcessor(),
'excel': ExcelProcessor(),
'csv': CsvProcessor(),
'html': HtmlProcessor(),
'reference': ReferenceProcessor(),
}
def to_markdown(self, file_path: str, **kwargs) -> str:
"""
将任意文档转换为 Markdown
Args:
file_path: 文件路径
**kwargs: 传递给具体处理器的参数
Returns:
Markdown 格式的文本
Raises:
ValueError: 不支持的文件格式
"""
ext = Path(file_path).suffix.lower()
processor_type = self.PROCESSOR_MAP.get(ext)
if not processor_type:
raise ValueError(f"不支持的文件格式: {ext}")
# 纯文本文件直接读取
if processor_type == 'text':
return self._read_text(file_path)
# JSON 文件
if processor_type in ('json', 'jsonl'):
return self._read_json(file_path, processor_type == 'jsonl')
# 引用文件
if processor_type in ('bibtex', 'ris'):
ref_processor = self.processors['reference']
if processor_type == 'bibtex':
return ref_processor.bib_to_markdown(file_path)
else:
return ref_processor.ris_to_markdown(file_path)
# 其他格式
processor = self.processors.get(processor_type)
if processor:
return processor.to_markdown(file_path, **kwargs)
raise ValueError(f"处理器未实现: {processor_type}")
def _read_text(self, file_path: str) -> str:
"""读取纯文本文件"""
with open(file_path, 'r', encoding='utf-8') as f:
return f.read()
def _read_json(self, file_path: str, is_jsonl: bool = False) -> str:
"""读取 JSON 文件并转为 Markdown"""
import json
with open(file_path, 'r', encoding='utf-8') as f:
if is_jsonl:
data = [json.loads(line) for line in f]
else:
data = json.load(f)
# 格式化为 Markdown 代码块
return f"```json\n{json.dumps(data, ensure_ascii=False, indent=2)}\n```"
def get_supported_formats(self) -> list:
"""获取所有支持的格式"""
return list(self.PROCESSOR_MAP.keys())
def is_supported(self, file_path: str) -> bool:
"""检查文件是否支持"""
ext = Path(file_path).suffix.lower()
return ext in self.PROCESSOR_MAP
```
---
## 📦 依赖清单
### 核心依赖(极简版)
```txt
# requirements.txt - 文档处理引擎(极简版)
# 体积预估Docker 镜像压缩后 200-300MB
# ===== 核心解析库 =====
pymupdf4llm>=0.0.17 # PDF自动包含 pymupdf
mammoth>=1.8.0 # Word
python-pptx>=1.0.2 # PPT
pandas>=2.2.0 # Excel/CSV
openpyxl>=3.1.5 # Excel 引擎
tabulate>=0.9.0 # Markdown 表格输出
# ===== 基础工具 =====
chardet>=5.2.0 # 编码检测
# ===== Web 服务 =====
fastapi>=0.109.0 # API 框架
uvicorn>=0.27.0 # ASGI 服务器
python-multipart>=0.0.9 # 文件上传
```
### 扩展依赖(按需安装)
```txt
# ===== HTML 处理P1=====
beautifulsoup4>=4.12.0
markdownify>=0.11.6
lxml>=4.9.0
# ===== 文献引用P1=====
bibtexparser>=1.4.0
rispy>=0.7.0
# ===== 医学数据P2可选=====
# pydicom>=2.4.0 # DICOM
# pyreadstat>=1.2.0 # SPSS/SAS
```
### 依赖体积对比
| 方案 | 镜像大小 | 说明 |
|------|----------|------|
| **极简版**(推荐) | ~200-300MB | 核心依赖,覆盖 95% 场景 |
| 完整版 | ~400-500MB | 包含 HTML、引用、医学格式 |
| ~~带 OCR~~ | ~~1.5GB+~~ | ❌ 不推荐,放弃扫描件支持 |
---
## 🚀 部署建议
### Docker 配置
```dockerfile
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
# 安装依赖
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
```
### 资源配置
| 配置项 | 推荐值 | 说明 |
|--------|--------|------|
| **CPU** | 0.5 核 | 极简版资源占用低 |
| **内存** | 512MB | 足够处理常规文档 |
| **磁盘** | 1GB | 镜像 + 临时文件 |
### 用户引导
> 💡 **前端上传界面建议添加提示:**
>
> "目前仅支持电子版 PDF暂不支持扫描件或图片型文档"
这比在后端搞复杂的 OCR 性价比高得多。
---
## 🎯 使用示例
### 基础使用
```python
from document_processor import DocumentProcessor
# 创建处理器
processor = DocumentProcessor()
# 转换 PDF
md = processor.to_markdown("research_paper.pdf")
# 转换 Word
md = processor.to_markdown("report.docx")
# 转换 Excel指定 Sheet
md = processor.to_markdown("data.xlsx", sheet_names=["Sheet1", "Results"])
# 检查支持格式
if processor.is_supported("unknown.xyz"):
md = processor.to_markdown("unknown.xyz")
else:
print("格式不支持")
```
### 批量处理
```python
from pathlib import Path
from document_processor import DocumentProcessor
processor = DocumentProcessor()
def batch_convert(input_dir: str, output_dir: str):
"""批量转换目录下的所有文档"""
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
for file in input_path.iterdir():
if processor.is_supported(str(file)):
try:
md = processor.to_markdown(str(file))
# 保存为 .md 文件
output_file = output_path / f"{file.stem}.md"
output_file.write_text(md, encoding='utf-8')
print(f"{file.name} -> {output_file.name}")
except Exception as e:
print(f"{file.name}: {e}")
# 使用
batch_convert("./documents", "./markdown_output")
```
---
## 📊 与 PKB 知识库集成
### 文档入库流程
```
用户上传文档
DocumentProcessor.to_markdown()
文本分块 (ChunkService)
向量化 (EmbeddingService)
存储到 PostgreSQL + pgvector
```
### 示例代码
```python
async def ingest_document(file_path: str, knowledge_base_id: str):
"""文档入库完整流程"""
# 1. 转换为 Markdown
processor = DocumentProcessor()
markdown_content = processor.to_markdown(file_path)
# 2. 分块
chunks = chunk_service.split_text(
markdown_content,
chunk_size=512,
overlap=50
)
# 3. 向量化
embeddings = await embedding_service.embed_batch(
[chunk.text for chunk in chunks]
)
# 4. 存储
for chunk, embedding in zip(chunks, embeddings):
await prisma.ekbChunk.create({
"knowledgeBaseId": knowledge_base_id,
"content": chunk.text,
"embedding": embedding,
"metadata": {
"source_file": file_path,
"chunk_index": chunk.index
}
})
```
---
## 📅 更新日志
### v1.1 (2026-01-20)
**吸收同事建议,优化设计:**
- 🔄 **设计理念更新**:强调"极轻量、零 OCR、聚焦核心格式"
-**PDF 扫描件检测**:添加字符数阈值检测,返回 LLM 友好提示
-**Word 空文档检测**:空内容返回友好提示
-**Excel 上下文增强**:添加文件名、行列数、截断提示
-**空值处理**`fillna('')` 避免 NaN 显示
- 📦 **依赖版本更新**pymupdf4llm 0.0.17, mammoth 1.8.0 等
- 🚀 **部署建议**:添加资源配置、镜像大小估算
- 💡 **用户引导**:前端提示不支持扫描件
### v1.0 (2026-01-20)
- 🆕 初始版本
- 🆕 PDF 处理pymupdf4llm 替代 PyMuPDF + Nougat
- 🆕 Word 处理mammoth
- 🆕 PPT 处理python-pptx
- 🆕 Excel/CSV 处理pandas
- 🆕 HTML 处理beautifulsoup4 + markdownify
- 🆕 文献引用处理bibtexparser + rispy
- 🆕 统一处理器架构
---
**维护人:** 技术架构师
**相关文档:**
- [PKB 个人知识库](../../03-业务模块/PKB-个人知识库/00-模块当前状态与开发指南.md)
- [Dify 替换为 pgvector 开发计划](../../03-业务模块/PKB-个人知识库/04-开发计划/01-Dify替换为pgvector开发计划.md)