文档转换与处理工具调研报告

调研概述

本文档整理了文档格式转换、PDF生成、内容提取、文档对比等领域的主流工具。这些工具能够自动化文档处理流程，提高文档管理效率，解决跨格式文档转换的痛点。

核心价值：

自动化文档格式转换，节省人工时间
批量处理文档，提升效率
保持格式一致性，避免手动错误
支持多种文档格式互转

📖 快速导航

按类型分类

Markdown 转换工具 - 万能文档转换器
- Pandoc、marked、markdown-it
PDF 生成工具 - HTML/文档转PDF
- Puppeteer、Playwright、WeasyPrint、Gotenberg
文档提取工具 - 从PDF/Office提取内容
- pdfplumber、Apache Tika、Textract、Marker
文档对比工具 - 版本对比和差异分析
- diff-pdf、Diffchecker、Beyond Compare
Office 文档处理 - Excel/Word/PPT
- Apache POI、python-docx、openpyxl

按使用场景选择

Markdown → 各种格式 → Pandoc（万能转换）
网页 → PDF → Puppeteer（开发者）/ Gotenberg（API服务）
PDF → 文本/图片 → pdfplumber（Python）/ Marker（AI增强）
文档版本对比 → Beyond Compare（GUI）/ diff-so-fancy（CLI）
批量文档处理 → Apache Tika（企业级）

一、Markdown 转换工具

1. Pandoc ⭐⭐⭐⭐⭐

官网: https://pandoc.org/
GitHub: https://github.com/jgm/pandoc
Stars: 34k+
语言: Haskell
开源协议: GPL-2.0
平台支持: Windows、macOS、Linux

核心特点

万能文档转换器
- 支持 40+ 输入格式
- 支持 60+ 输出格式
- 被称为"文档转换的瑞士军刀"
支持的格式

输入格式：
- Markdown（CommonMark、GitHub、Pandoc扩展）
- reStructuredText
- HTML、LaTeX
- Microsoft Word (.docx)
- EPUB、ODT
- Jupyter Notebook (.ipynb)
输出格式：
- Markdown（各种方言）
- HTML5、LaTeX、PDF
- Microsoft Word (.docx)
- EPUB、ODT
- PowerPoint (.pptx)
- MediaWiki、DokuWiki
- reStructuredText
- AsciiDoc
强大的自定义
- 自定义模板
- Lua 过滤器（扩展功能）
- CSS 样式
- 元数据支持
学术写作支持
- 引用和参考文献（BibTeX）
- 交叉引用
- 脚注和尾注
- 数学公式（LaTeX）

使用示例

# 安装
brew install pandoc

# 基础转换
pandoc input.md -o output.pdf           # Markdown → PDF
pandoc input.md -o output.docx          # Markdown → Word
pandoc input.md -o output.html          # Markdown → HTML
pandoc input.docx -o output.md          # Word → Markdown

# 高级选项
pandoc input.md -o output.pdf \
  --pdf-engine=xelatex \                # 使用 XeLaTeX（支持中文）
  --toc \                               # 生成目录
  --number-sections \                   # 章节编号
  -V geometry:margin=1in \              # 页边距
  -V mainfont="SimSun" \                # 中文字体
  -V fontsize=12pt                      # 字体大小

# 使用模板
pandoc input.md -o output.pdf \
  --template=mytemplate.tex

# 使用过滤器
pandoc input.md -o output.html \
  --lua-filter=myfilter.lua

# 批量转换
for file in *.md; do
  pandoc "$file" -o "${file%.md}.pdf"
done

常用转换场景

场景 1: Markdown → PDF（学术论文）

pandoc paper.md -o paper.pdf \
  --pdf-engine=xelatex \
  --toc \
  --number-sections \
  --citeproc \
  --bibliography=refs.bib \
  --csl=ieee.csl \
  -V geometry:margin=1in \
  -V fontsize=11pt \
  -V documentclass=article

场景 2: Markdown → Word（公司文档）

pandoc report.md -o report.docx \
  --reference-doc=template.docx \      # 使用 Word 模板
  --toc \
  --number-sections

场景 3: Markdown → HTML（静态网站）

pandoc index.md -o index.html \
  --standalone \                       # 完整 HTML
  --css=style.css \                    # 自定义样式
  --toc \
  --template=template.html             # 自定义模板

场景 4: Markdown → EPUB（电子书）

pandoc book.md -o book.epub \
  --toc \
  --epub-cover-image=cover.jpg \
  --metadata title="My Book" \
  --metadata author="Author Name"

场景 5: Jupyter Notebook → Markdown

pandoc notebook.ipynb -o output.md

Lua 过滤器示例

-- remove-images.lua
-- 移除所有图片
function Image(elem)
  return {}
end

-- 使用
-- pandoc input.md -o output.html --lua-filter=remove-images.lua

-- word-count.lua
-- 统计字数
local words = 0

function Str(el)
  _, count = el.text:gsub("%S+", "")
  words = words + count
end

function Doc(doc)
  print("Word count: " .. words)
  return doc
end

模板系统

自定义 LaTeX 模板（适用于 PDF）

% template.tex
\documentclass{article}
\usepackage{xeCJK}
\setCJKmainfont{SimSun}

\title{$title$}
\author{$author$}
\date{$date$}

\begin{document}
\maketitle
$body$
\end{document}

自定义 HTML 模板

<!DOCTYPE html\>
<html\>
<head\>
  \<meta charset="utf-8"\>
  <title\>$title$</title\>
  $if(css)$
  \<link rel="stylesheet" href="$css$"\>
  $endif$
</head\>
<body\>
$if(toc)$
  <nav\>$toc$</nav\>
$endif$
  $body$
</body\>
</html\>

优势与劣势

✅ 优势

支持格式最全（40+ 输入，60+ 输出）
功能最强大
学术写作支持完善
命令行友好，易于自动化
高度可定制（模板、过滤器）
完全开源免费
社区活跃，文档详尽

❌ 劣势

命令行工具，无 GUI
学习曲线较陡
依赖外部程序（PDF 需要 LaTeX）
某些格式转换可能丢失样式
复杂文档可能需要手动调整
中文支持需要额外配置

依赖安装

# PDF 支持（需要 LaTeX）
# macOS
brew install --cask basictex
# 或完整版
brew install --cask mactex

# Ubuntu
sudo apt install texlive-xetex

# 引用支持
brew install pandoc-citeproc

# 图表支持（可选）
brew install librsvg

性能优化

# 大文档优化
pandoc large.md -o large.pdf \
  --pdf-engine=xelatex \
  --pdf-engine-opt=-shell-escape \
  --verbose

# 并行批量转换
ls *.md | parallel pandoc {} -o {.}.pdf

评分

功能完整性: ⭐⭐⭐⭐⭐ (5/5)
格式支持: ⭐⭐⭐⭐⭐ (5/5)
易用性: ⭐⭐⭐ (3/5 - 命令行)
可定制性: ⭐⭐⭐⭐⭐ (5/5)
性能: ⭐⭐⭐⭐ (4/5)
学习曲线: ⭐⭐ (2/5 - 较陡峭)

2. marked ⭐⭐⭐⭐

GitHub: https://github.com/markedjs/marked
Stars: 33k+
语言: JavaScript
开源协议: MIT
平台支持: Node.js、Browser

核心特点

轻量级 Markdown 解析器
- 纯 JavaScript 实现
- 浏览器和 Node.js 双支持
- 性能优秀
高度可扩展
- 自定义渲染器
- 扩展语法支持
- Hooks 系统
标准兼容
- CommonMark 规范
- GitHub Flavored Markdown（GFM）
- 可配置严格模式

使用示例

// Node.js
const marked = require('marked');

// 基础使用
const html = marked.parse('# Hello World');
console.log(html); // <h1\\>Hello World</h1\\>

// 配置选项
marked.setOptions({
  gfm: true,              // GitHub Flavored Markdown
  breaks: true,           // 换行转 <br /\>
  headerIds: true,        // 标题 ID
  mangle: false,          // 不混淆邮箱
  sanitize: false,        // 不清理 HTML
});

// 异步解析
marked.parse('# Async', (err, html) =\> {
  if (err) throw err;
  console.log(html);
});

自定义渲染器：

const renderer = new marked.Renderer();

// 自定义标题渲染
renderer.heading = (text, level) =\> {
  const escapedText = text.toLowerCase().replace(/[^\w]+/g, '-');
  return `
    \<h${level} id="${escapedText}"\>
      \<a href="#${escapedText}"\>${text}</a\>
    </h${level}\>
  `;
};

// 自定义代码块
renderer.code = (code, language) =\> {
  return `
    \<pre class="language-${language}"\>
      <code\>${code}</code\>
    </pre\>
  `;
};

marked.setOptions({ renderer });

扩展语法：

// 自定义扩展
marked.use({
  extensions: [{
    name: 'emoji',
    level: 'inline',
    start(src) { return src.match(/:/)?.index; },
    tokenizer(src) {
      const match = src.match(/^:(\w+):/);
      if (match) {
        return {
          type: 'emoji',
          raw: match[0],
          name: match[1]
        };
      }
    },
    renderer(token) {
      return `\<span class="emoji"\>${getEmoji(token.name)}</span\>`;
    }
  }]
});

浏览器使用

<!DOCTYPE html\>
<html\>
<head\>
  \<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"\></script\>
</head\>
<body\>
  \<div id="content"\></div\>
  <script\>
    const markdown = '# Hello\n\nThis is **bold**';
    document.getElementById('content').innerHTML = marked.parse(markdown);
  </script\>
</body\>
</html\>

优势与劣势

✅ 优势

轻量快速
浏览器和 Node.js 双支持
高度可定制
API 简单易用
社区插件丰富
性能优秀

❌ 劣势

仅支持 Markdown → HTML
不支持其他格式转换
功能相对 Pandoc 简单

评分

性能: ⭐⭐⭐⭐⭐ (5/5)
易用性: ⭐⭐⭐⭐⭐ (5/5)
可扩展性: ⭐⭐⭐⭐⭐ (5/5)
功能范围: ⭐⭐⭐ (3/5 - 仅 MD→HTML)

3. markdown-it ⭐⭐⭐⭐⭐

GitHub: https://github.com/markdown-it/markdown-it
Stars: 18k+
语言: JavaScript
开源协议: MIT

核心特点

插件化架构
- 100+ 官方和社区插件
- 语法扩展容易
- 高度模块化
性能和安全
- 速度极快
- 内置 XSS 防护
- 可配置的安全级别
CommonMark 兼容
- 严格遵循标准
- 可启用扩展语法
- 支持 GFM

使用示例

const MarkdownIt = require('markdown-it');

// 基础使用
const md = new MarkdownIt();
const html = md.render('# Hello World');

// 预设配置
const md = new MarkdownIt('commonmark');  // 严格 CommonMark
const md = new MarkdownIt('zero');        // 禁用所有规则
const md = new MarkdownIt('default');     // 默认配置

// 自定义配置
const md = new MarkdownIt({
  html: true,           // 允许 HTML 标签
  linkify: true,        // 自动链接
  typographer: true,    // 智能标点
  breaks: false,        // 换行转 <br /\>
  highlight: function (str, lang) {
    // 代码高亮
    return hljs.highlight(str, { language: lang }).value;
  }
});

插件生态：

// 表格支持
md.use(require('markdown-it-table'));

// Emoji 支持
md.use(require('markdown-it-emoji'));

// 脚注
md.use(require('markdown-it-footnote'));

// 任务列表
md.use(require('markdown-it-task-lists'));

// 数学公式
md.use(require('markdown-it-katex'));

// 容器（自定义块）
md.use(require('markdown-it-container'), 'warning', {
  render: (tokens, idx) =\> {
    if (tokens[idx].nesting === 1) {
      return '\<div class="warning"\>\n';
    } else {
      return '</div\>\n';
    }
  }
});

自定义插件：

// 简单插件示例
function myPlugin(md, options) {
  md.core.ruler.after('inline', 'my-rule', (state) =\> {
    // 处理 tokens
  });
}

md.use(myPlugin, { option: 'value' });

常用插件推荐

// 1. markdown-it-anchor - 标题锚点
md.use(require('markdown-it-anchor'), {
  permalink: true,
  permalinkBefore: true,
  permalinkSymbol: '§'
});

// 2. markdown-it-toc-done-right - 目录生成
md.use(require('markdown-it-toc-done-right'));

// 3. markdown-it-attrs - 属性添加
md.use(require('markdown-it-attrs'));
// 用法: # Header {#custom-id .class}

// 4. markdown-it-abbr - 缩写
md.use(require('markdown-it-abbr'));

// 5. markdown-it-deflist - 定义列表
md.use(require('markdown-it-deflist'));

// 6. markdown-it-mark - 高亮
md.use(require('markdown-it-mark'));
// 用法: ==marked text==

// 7. markdown-it-ins - 插入文本
md.use(require('markdown-it-ins'));
// 用法: ++inserted text++

// 8. markdown-it-sub / markdown-it-sup - 上下标
md.use(require('markdown-it-sub'));    // H~2~O
md.use(require('markdown-it-sup'));    // 29^th^

完整示例

const MarkdownIt = require('markdown-it');
const hljs = require('highlight.js');

const md = new MarkdownIt({
  html: true,
  linkify: true,
  typographer: true,
  highlight: (str, lang) =\> {
    if (lang && hljs.getLanguage(lang)) {
      try {
        return hljs.highlight(str, { language: lang }).value;
      } catch (__) {}
    }
    return '';
  }
})
  .use(require('markdown-it-emoji'))
  .use(require('markdown-it-footnote'))
  .use(require('markdown-it-anchor'))
  .use(require('markdown-it-toc-done-right'))
  .use(require('markdown-it-container'), 'warning')
  .use(require('markdown-it-container'), 'tip');

const markdown = `
# Table of Contents
\${toc}

## Introduction
This is a **test** with :smile: emoji.

::: warning
This is a warning!
:::

[^1]: This is a footnote.

## Code
\`\`\`javascript
console.log('Hello');
\`\`\`
`;

const html = md.render(markdown);

优势与劣势

✅ 优势

插件生态最丰富
性能优秀
高度可配置
安全性好
文档完善
社区活跃

❌ 劣势

仅支持 Markdown → HTML
学习曲线中等（高级用法）

评分

插件生态: ⭐⭐⭐⭐⭐ (5/5)
性能: ⭐⭐⭐⭐⭐ (5/5)
可扩展性: ⭐⭐⭐⭐⭐ (5/5)
易用性: ⭐⭐⭐⭐ (4/5)

Markdown 工具对比

工具	语言	格式支持	适用场景	推荐指数
Pandoc	Haskell	40+输入/60+输出	万能转换、学术写作	⭐⭐⭐⭐⭐
marked	JavaScript	MD→HTML	Web应用、实时预览	⭐⭐⭐⭐
markdown-it	JavaScript	MD→HTML	需要扩展、插件丰富	⭐⭐⭐⭐⭐

二、PDF 生成工具

1. Puppeteer ⭐⭐⭐⭐⭐

GitHub: https://github.com/puppeteer/puppeteer
Stars: 88k+
语言: JavaScript/TypeScript
开源协议: Apache 2.0
维护者: Google Chrome 团队

核心特点

无头浏览器
- 基于 Chromium
- 完整的浏览器环境
- 支持现代 Web 特性
PDF 生成
- HTML/CSS → PDF
- 支持复杂布局
- 媒体查询（@media print）
- 页眉页脚
- 页码
- 背景图形
截图功能
- 全页截图
- 元素截图
- 视口截图
- 支持透明背景
自动化能力
- 表单填写
- 点击操作
- 等待元素
- 执行 JavaScript

安装

# 安装 Puppeteer（包含 Chromium）
npm install puppeteer

# 或使用系统浏览器
npm install puppeteer-core

基础 PDF 生成

const puppeteer = require('puppeteer');

(async () =\> {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // 从 URL 生成
  await page.goto('https://example.com', {
    waitUntil: 'networkidle0'
  });

  await page.pdf({
    path: 'output.pdf',
    format: 'A4'
  });

  await browser.close();
})();

高级 PDF 配置

await page.pdf({
  path: 'document.pdf',
  format: 'A4',                    // 纸张大小
  landscape: false,                // 横向
  printBackground: true,           // 打印背景
  margin: {                        // 边距
    top: '20mm',
    right: '20mm',
    bottom: '20mm',
    left: '20mm'
  },
  displayHeaderFooter: true,       // 显示页眉页脚
  headerTemplate: `
    \<div style="font-size: 10px; text-align: center; width: 100%;"\>
      \<span class="title"\></span\>
    </div\>
  `,
  footerTemplate: `
    \<div style="font-size: 10px; text-align: center; width: 100%;"\>
      Page \<span class="pageNumber"\></span\> of \<span class="totalPages"\></span\>
    </div\>
  `,
  preferCSSPageSize: true,         // 优先使用 CSS 页面大小
  scale: 1,                        // 缩放比例
  pageRanges: '1-5, 8, 11-13'     // 页面范围
});

从 HTML 字符串生成

const puppeteer = require('puppeteer');

(async () =\> {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  const html = `
    <!DOCTYPE html\>
    <html\>
    <head\>
      <style\>
        @page {
          size: A4;
          margin: 20mm;
        }
        body {
          font-family: Arial, sans-serif;
        }
        .header {
          background: #333;
          color: white;
          padding: 20px;
        }
        .content {
          margin: 20px 0;
        }
        @media print {
          .no-print { display: none; }
        }
      </style\>
    </head\>
    <body\>
      \<div class="header"\>
        <h1\\>Document Title</h1\\>
      </div\>
      \<div class="content"\>
        <p\>This is the content.</p\>
      </div\>
    </body\>
    </html\>
  `;

  await page.setContent(html, {
    waitUntil: 'networkidle0'
  });

  await page.pdf({
    path: 'output.pdf',
    format: 'A4',
    printBackground: true
  });

  await browser.close();
})();

实用示例

场景 1: Markdown → PDF（通过 HTML）

const puppeteer = require('puppeteer');
const marked = require('marked');
const fs = require('fs');

async function markdownToPDF(mdFile, pdfFile) {
  const markdown = fs.readFileSync(mdFile, 'utf8');
  const html = marked.parse(markdown);

  const fullHTML = `
    <!DOCTYPE html\>
    <html\>
    <head\>
      \<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/github-markdown-css/5.1.0/github-markdown.min.css"\>
      <style\>
        .markdown-body { padding: 40px; }
        @page { margin: 20mm; }
      </style\>
    </head\>
    \<body class="markdown-body"\>
      ${html}
    </body\>
    </html\>
  `;

  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setContent(fullHTML, { waitUntil: 'networkidle0' });
  await page.pdf({ path: pdfFile, format: 'A4' });
  await browser.close();
}

markdownToPDF('README.md', 'README.pdf');

场景 2: 带目录的 PDF

async function generatePDFWithTOC() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('http://localhost:3000/document', {
    waitUntil: 'networkidle0'
  });

  // 执行 JavaScript 生成目录
  await page.evaluate(() =\> {
    const headings = document.querySelectorAll('h1, h2, h3');
    const toc = document.createElement('div');
    toc.id = 'toc';

    headings.forEach((heading, index) =\> {
      heading.id = `heading-${index}`;
      const link = document.createElement('a');
      link.href = `#heading-${index}`;
      link.textContent = heading.textContent;
      link.className = heading.tagName.toLowerCase();
      toc.appendChild(link);
      toc.appendChild(document.createElement('br'));
    });

    document.body.insertBefore(toc, document.body.firstChild);
  });

  await page.pdf({
    path: 'document-with-toc.pdf',
    format: 'A4',
    printBackground: true
  });

  await browser.close();
}

场景 3: 批量生成发票 PDF

async function generateInvoices(invoices) {
  const browser = await puppeteer.launch();

  for (const invoice of invoices) {
    const page = await browser.newPage();

    const html = `
      <!DOCTYPE html\>
      <html\>
      <body\>
        <h1\\>Invoice #${invoice.id}</h1\\>
        <p\>Customer: ${invoice.customer}</p\>
        <p\>Amount: $${invoice.amount}</p\>
        <table\>
          ${invoice.items.map(item =\> `
            <tr\>
              <td\>${item.name}</td\>
              <td\>$${item.price}</td\>
            </tr\>
          `).join('')}
        </table\>
      </body\>
      </html\>
    `;

    await page.setContent(html);
    await page.pdf({
      path: `invoice-${invoice.id}.pdf`,
      format: 'A4'
    });

    await page.close();
  }

  await browser.close();
}

场景 4: 截图（额外功能）

async function takeScreenshot() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com');

  // 全页截图
  await page.screenshot({
    path: 'fullpage.png',
    fullPage: true
  });

  // 元素截图
  const element = await page.$('.logo');
  await element.screenshot({ path: 'logo.png' });

  // 视口截图（带透明背景）
  await page.screenshot({
    path: 'viewport.png',
    omitBackground: true  // 透明背景
  });

  await browser.close();
}

性能优化

// 禁用不必要的资源
await page.setRequestInterception(true);
page.on('request', (req) =\> {
  if (['image', 'stylesheet', 'font'].includes(req.resourceType())) {
    req.abort();
  } else {
    req.continue();
  }
});

// 设置视口大小
await page.setViewport({
  width: 1200,
  height: 800,
  deviceScaleFactor: 2  // 高清截图
});

// 缓存浏览器实例（多次生成时）
let browserInstance;
async function getBrowser() {
  if (!browserInstance) {
    browserInstance = await puppeteer.launch();
  }
  return browserInstance;
}

Docker 部署

FROM node:18-slim

# 安装 Chromium 依赖
RUN apt-get update && apt-get install -y \
    chromium \
    fonts-ipafont-gothic \
    fonts-wqy-zenhei \
    fonts-thai-tlwg \
    fonts-kacst \
    fonts-freefont-ttf \
    --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

COPY . .

CMD ["node", "server.js"]

// server.js - Docker 环境配置
const browser = await puppeteer.launch({
  executablePath: '/usr/bin/chromium',
  args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage'
  ]
});

优势与劣势

✅ 优势

完整的浏览器环境，支持所有 Web 特性
生成的 PDF 质量高
支持复杂 CSS 布局
可执行 JavaScript
Google 官方维护
文档完善
社区活跃

❌ 劣势

资源占用大（启动浏览器）
速度相对较慢
需要 Chromium 依赖（Docker 镜像大）
并发生成需要管理浏览器实例

评分

功能完整性: ⭐⭐⭐⭐⭐ (5/5)
PDF 质量: ⭐⭐⭐⭐⭐ (5/5)
性能: ⭐⭐⭐ (3/5)
易用性: ⭐⭐⭐⭐ (4/5)
资源占用: ⭐⭐ (2/5)

2. Playwright ⭐⭐⭐⭐⭐

GitHub: https://github.com/microsoft/playwright
Stars: 66k+
语言: TypeScript
开源协议: Apache 2.0
维护者: Microsoft

核心特点

多浏览器支持
- Chromium
- Firefox
- WebKit（Safari）
- 统一 API
现代化 API
- 自动等待
- 重试机制
- 更好的错误处理
- TypeScript 原生支持
PDF 生成
- 与 Puppeteer 类似
- 支持所有浏览器引擎

使用示例

const { chromium } = require('playwright');

(async () =\> {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com');

  await page.pdf({
    path: 'output.pdf',
    format: 'A4'
  });

  await browser.close();
})();

Playwright vs Puppeteer

特性	Playwright	Puppeteer
浏览器支持	Chromium, Firefox, WebKit	Chromium
维护者	Microsoft	Google
API 设计	更现代	经典
自动等待	内置	需手动
性能	略快	快
社区	快速增长	成熟

评分

功能完整性: ⭐⭐⭐⭐⭐ (5/5)
多浏览器: ⭐⭐⭐⭐⭐ (5/5)
现代化: ⭐⭐⭐⭐⭐ (5/5)
PDF 质量: ⭐⭐⭐⭐⭐ (5/5)

3. WeasyPrint ⭐⭐⭐⭐

官网: https://weasyprint.org/
GitHub: https://github.com/Kozea/WeasyPrint
Stars: 7k+
语言: Python
开源协议: BSD

核心特点

Python PDF 生成
- 纯 Python 实现
- 不依赖浏览器
- 资源占用低
CSS 支持
- CSS Paged Media
- CSS Generated Content
- 适合打印的 CSS
轻量快速
- 比 Puppeteer 快
- 资源占用少
- 适合批量生成

使用示例

from weasyprint import HTML, CSS

# 从 URL 生成
HTML('https://example.com').write_pdf('output.pdf')

# 从 HTML 字符串生成
html_content = """
<!DOCTYPE html\>
<html\>
<head\>
  <style\>
    @page {
      size: A4;
      margin: 2cm;
    }
    body {
      font-family: Arial, sans-serif;
    }
  </style\>
</head\>
<body\>
  <h1\\>Title</h1\\>
  <p\>Content</p\>
</body\>
</html\>
"""

HTML(string=html_content).write_pdf('output.pdf')

# 添加自定义 CSS
HTML('input.html').write_pdf(
    'output.pdf',
    stylesheets=[CSS(string='body { font-size: 14pt; }')]
)

高级功能：

from weasyprint import HTML, CSS
from weasyprint.text.fonts import FontConfiguration

# 字体配置（中文支持）
font_config = FontConfiguration()

html = HTML(string=html_content)
css = CSS(string='''
    @font-face {
      font-family: 'SimSun';
      src: url('/path/to/simsun.ttf');
    }
    body {
      font-family: 'SimSun', sans-serif;
    }
''', font_config=font_config)

html.write_pdf('output.pdf', stylesheets=[css], font_config=font_config)

页眉页脚：

html_with_header = """
<style\>
  @page {
    @top-center {
      content: "Document Title";
    }
    @bottom-center {
      content: "Page " counter(page) " of " counter(pages);
    }
  }
</style\>
"""

优势与劣势

✅ 优势

不需要浏览器，资源占用低
速度快（比 Puppeteer 快 3-5 倍）
Python 生态友好
CSS Paged Media 支持好
适合服务器批量生成

❌ 劣势

CSS 支持不如浏览器全面
JavaScript 不支持
复杂布局可能有问题
中文字体需手动配置

评分

性能: ⭐⭐⭐⭐⭐ (5/5)
资源占用: ⭐⭐⭐⭐⭐ (5/5)
CSS 支持: ⭐⭐⭐ (3/5)
易用性: ⭐⭐⭐⭐ (4/5)

4. Gotenberg ⭐⭐⭐⭐⭐

官网: https://gotenberg.dev/
GitHub: https://github.com/gotenberg/gotenberg
Stars: 8k+
语言: Go
开源协议: MIT

核心特点

Docker API 服务
- HTTP API 接口
- 支持多种输入格式
- 语言无关
多引擎支持
- Chromium（网页→PDF）
- LibreOffice（Office→PDF）
- PDFtk（PDF 合并）
生产就绪
- 水平扩展
- 健康检查
- Prometheus 指标

使用示例

# 启动 Gotenberg
docker run -p 3000:3000 gotenberg/gotenberg:7

# HTML → PDF（curl）
curl \
  --request POST 'http://localhost:3000/forms/chromium/convert/html' \
  --form 'files=@"index.html"' \
  -o output.pdf

# URL → PDF
curl \
  --request POST 'http://localhost:3000/forms/chromium/convert/url' \
  --form 'url="https://example.com"' \
  -o output.pdf

# Markdown → PDF
curl \
  --request POST 'http://localhost:3000/forms/chromium/convert/markdown' \
  --form 'files=@"document.md"' \
  -o output.pdf

JavaScript/Python 客户端：

// Node.js
const FormData = require('form-data');
const fs = require('fs');
const axios = require('axios');

async function convertToPDF() {
  const form = new FormData();
  form.append('files', fs.createReadStream('index.html'));

  const response = await axios.post(
    'http://localhost:3000/forms/chromium/convert/html',
    form,
    {
      headers: form.getHeaders(),
      responseType: 'stream'
    }
  );

  response.data.pipe(fs.createWriteStream('output.pdf'));
}

# Python
import requests

with open('index.html', 'rb') as f:
    response = requests.post(
        'http://localhost:3000/forms/chromium/convert/html',
        files={'files': f}
    )

with open('output.pdf', 'wb') as f:
    f.write(response.content)

高级选项：

curl \
  --request POST 'http://localhost:3000/forms/chromium/convert/html' \
  --form 'files=@"index.html"' \
  --form 'paperWidth=8.27' \
  --form 'paperHeight=11.69' \
  --form 'marginTop=0.39' \
  --form 'marginBottom=0.39' \
  --form 'marginLeft=0.39' \
  --form 'marginRight=0.39' \
  --form 'preferCssPageSize=true' \
  --form 'printBackground=true' \
  --form 'landscape=false' \
  --form 'scale=1.0' \
  --form 'nativePageRanges=1-5' \
  -o output.pdf

Docker Compose 部署

version: "3.8"

services:
  gotenberg:
    image: gotenberg/gotenberg:7
    ports:
      - "3000:3000"
    environment:
      - LOG_LEVEL=info
    restart: unless-stopped

优势与劣势

✅ 优势

API 服务，语言无关
支持多种输入格式（HTML、URL、Office、Markdown）
生产就绪，易于扩展
性能优秀
Docker 部署简单
完全开源免费

❌ 劣势

需要运行 Docker 服务
资源占用相对较大
配置相对复杂

评分

易用性: ⭐⭐⭐⭐⭐ (5/5 - API)
功能完整性: ⭐⭐⭐⭐⭐ (5/5)
生产就绪: ⭐⭐⭐⭐⭐ (5/5)
性能: ⭐⭐⭐⭐ (4/5)

PDF 生成工具对比

工具	语言	性能	资源占用	适用场景	推荐指数
Puppeteer	JS	⭐⭐⭐	大	复杂网页、JS渲染	⭐⭐⭐⭐⭐
Playwright	JS	⭐⭐⭐⭐	大	多浏览器、现代化	⭐⭐⭐⭐⭐
WeasyPrint	Python	⭐⭐⭐⭐⭐	小	批量生成、Python项目	⭐⭐⭐⭐
Gotenberg	Go	⭐⭐⭐⭐	中	API服务、微服务	⭐⭐⭐⭐⭐

选择建议：

Node.js 项目 → Puppeteer / Playwright
Python 项目 → WeasyPrint
微服务架构 → Gotenberg
简单文档 → WeasyPrint
复杂网页 → Puppeteer

三、文档提取工具

1. pdfplumber ⭐⭐⭐⭐⭐

GitHub: https://github.com/jsvine/pdfplumber
Stars: 6k+
语言: Python
开源协议: MIT

核心特点

PDF 信息提取
- 文本提取
- 表格识别
- 布局保留
- 元数据读取
精确控制
- 按页处理
- 按区域提取
- 自定义表格识别策略
可视化调试
- 生成调试图像
- 显示检测结果

使用示例

import pdfplumber

# 基础文本提取
with pdfplumber.open('document.pdf') as pdf:
    # 提取所有文本
    full_text = ''
    for page in pdf.pages:
        full_text += page.extract_text()
    print(full_text)

# 提取表格
with pdfplumber.open('report.pdf') as pdf:
    first_page = pdf.pages[0]
    tables = first_page.extract_tables()

    for table in tables:
        for row in table:
            print(row)

# 提取特定区域
with pdfplumber.open('invoice.pdf') as pdf:
    page = pdf.pages[0]

    # 定义区域（x0, top, x1, bottom）
    bbox = (100, 100, 500, 500)
    cropped = page.crop(bbox)
    text = cropped.extract_text()

高级功能：

# 自定义表格识别策略
table_settings = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "text",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
}

tables = page.extract_tables(table_settings)

# 提取元数据
with pdfplumber.open('document.pdf') as pdf:
    metadata = pdf.metadata
    print(f"Title: {metadata.get('Title')}")
    print(f"Author: {metadata.get('Author')}")
    print(f"Pages: {len(pdf.pages)}")

# 查找文本
with pdfplumber.open('document.pdf') as pdf:
    for page in pdf.pages:
        words = page.extract_words()
        for word in words:
            if 'keyword' in word['text'].lower():
                print(f"Found on page {page.page_number}: {word}")

# 可视化调试
with pdfplumber.open('document.pdf') as pdf:
    page = pdf.pages[0]
    im = page.to_image()
    im.draw_rects(page.extract_tables()[0])
    im.save('debug.png')

实用示例

提取发票信息：

import pdfplumber
import re

def extract_invoice_data(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[0]
        text = page.extract_text()

        # 提取发票号
        invoice_number = re.search(r'Invoice #(\d+)', text).group(1)

        # 提取日期
        date = re.search(r'Date: (\d{4}-\d{2}-\d{2})', text).group(1)

        # 提取表格
        tables = page.extract_tables()
        items = []
        for row in tables[0][1:]:  # 跳过表头
            items.append({
                'description': row[0],
                'quantity': row[1],
                'price': row[2]
            })

        return {
            'invoice_number': invoice_number,
            'date': date,
            'items': items
        }

data = extract_invoice_data('invoice.pdf')

优势与劣势

✅ 优势

表格识别准确
布局保留好
可视化调试
Python 生态友好
文档详细

❌ 劣势

仅支持 PDF
扫描版 PDF 需要 OCR
复杂布局可能有问题

评分

表格提取: ⭐⭐⭐⭐⭐ (5/5)
文本提取: ⭐⭐⭐⭐ (4/5)
易用性: ⭐⭐⭐⭐⭐ (5/5)
文档: ⭐⭐⭐⭐⭐ (5/5)

2. Marker ⭐⭐⭐⭐⭐

GitHub: https://github.com/VikParuchuri/marker
Stars: 17k+
语言: Python
开源协议: GPL-3.0

核心特点

AI 增强的 PDF 转换
- PDF → Markdown
- 高质量文本提取
- 保留格式和结构
- OCR 支持
智能识别
- 表格转 Markdown
- 公式转 LaTeX
- 图片提取
- 代码块识别
批量处理
- 命令行工具
- 多文件并行
- GPU 加速

使用示例

# 安装
pip install marker-pdf

# 单文件转换
marker_single /path/to/file.pdf /output/dir

# 批量转换
marker /input/dir /output/dir

# GPU 加速
marker /input/dir /output/dir --use_gpu

Python API：

from marker.convert import convert_single_pdf
from marker.models import load_all_models

# 加载模型
model_lst = load_all_models()

# 转换
full_text, images, out_meta = convert_single_pdf(
    'document.pdf',
    model_lst
)

print(full_text)  # Markdown 格式

# 保存图片
for filename, image in images.items():
    image.save(f'output/{filename}')

输出示例

输入：PDF 文档（包含表格、公式、图片）

输出 Markdown：

# Document Title

## Section 1

This is regular text with **bold** and *italic*.

### Table

| Header 1 | Header 2 |
|----------|----------|
| Data 1   | Data 2   |
| Data 3   | Data 4   |

### Formula

$$E = mc^2$$

### Code

```python
def hello():
    print("Hello, World!")

#### 优势与劣势

✅ **优势**
- AI 增强，识别准确
- 输出 Markdown 格式好
- 表格、公式、代码识别强
- 支持 OCR
- GPU 加速快
- 开源免费

❌ **劣势**
- 需要较大内存和 GPU
- 首次运行需下载模型
- 仅支持 PDF → Markdown

#### 评分

- **AI 能力**: ⭐⭐⭐⭐⭐ (5/5)
- **输出质量**: ⭐⭐⭐⭐⭐ (5/5)
- **性能**: ⭐⭐⭐⭐ (4/5 - 需GPU)
- **易用性**: ⭐⭐⭐⭐ (4/5)

---

### 3. **Apache Tika** ⭐⭐⭐⭐⭐

- **官网**: https://tika.apache.org/
- **GitHub**: https://github.com/apache/tika
- **语言**: Java
- **开源协议**: Apache 2.0

#### 核心特点

- **企业级文档解析**
  - 支持 1000+ 文件类型
  - 统一 API
  - Apache 基金会项目

- **支持格式**
  - PDF、Office（Word、Excel、PPT）
  - 图片（OCR）
  - 音频、视频（元数据）
  - 压缩包、邮件

- **多种使用方式**
  - Java 库
  - REST API（Tika Server）
  - 命令行工具

#### 使用示例

**命令行**：
```bash
# 下载
wget https://dlcdn.apache.org/tika/tika-app-2.9.0.jar

# 提取文本
java -jar tika-app-2.9.0.jar -t document.pdf \> output.txt

# 提取元数据
java -jar tika-app-2.9.0.jar -m document.pdf

# 检测文件类型
java -jar tika-app-2.9.0.jar -d document.pdf

Tika Server（REST API）：

# 启动服务器
java -jar tika-server-2.9.0.jar

# 提取文本（curl）
curl -X PUT --data-binary @document.pdf http://localhost:9998/tika

# Python 客户端
import requests

with open('document.pdf', 'rb') as f:
    response = requests.put(
        'http://localhost:9998/tika',
        data=f,
        headers={'Accept': 'text/plain'}
    )
    text = response.text

Java API：

import org.apache.tika.Tika;

Tika tika = new Tika();

// 提取文本
String text = tika.parseToString(new File("document.pdf"));

// 检测类型
String mimeType = tika.detect(new File("unknown.file"));

// 提取元数据
Metadata metadata = new Metadata();
tika.parse(new File("document.pdf"), metadata);

Python 绑定：

from tika import parser

# 解析文档
parsed = parser.from_file('document.pdf')

# 提取文本
text = parsed['content']

# 提取元数据
metadata = parsed['metadata']
print(metadata.get('pdf:PDFVersion'))
print(metadata.get('Author'))

优势与劣势

✅ 优势

支持格式最多（1000+）
企业级稳定性
Apache 背书
多语言支持
REST API 方便集成

❌ 劣势

Java 依赖
资源占用较大
表格提取不如 pdfplumber
配置相对复杂

评分

格式支持: ⭐⭐⭐⭐⭐ (5/5)
稳定性: ⭐⭐⭐⭐⭐ (5/5)
易用性: ⭐⭐⭐ (3/5 - Java)
企业级: ⭐⭐⭐⭐⭐ (5/5)

文档提取工具对比

工具	语言	格式支持	特色功能	推荐场景
pdfplumber	Python	PDF	表格识别强	PDF表格提取
Marker	Python	PDF	AI增强	PDF→Markdown
Apache Tika	Java	1000+	企业级	多格式解析

四、文档对比工具

1. diff-pdf ⭐⭐⭐⭐

GitHub: https://github.com/vslavik/diff-pdf
Stars: 900+
语言: C++
开源协议: GPL-2.0

核心特点

PDF 可视化对比
- 逐像素对比
- 高亮差异
- 生成对比 PDF

使用示例

# 安装
brew install diff-pdf

# 对比两个 PDF
diff-pdf file1.pdf file2.pdf

# 生成对比 PDF
diff-pdf --output-diff=diff.pdf file1.pdf file2.pdf

# 仅查看是否有差异（退出码）
if diff-pdf file1.pdf file2.pdf; then
  echo "PDFs are identical"
else
  echo "PDFs are different"
fi

2. Beyond Compare ⭐⭐⭐⭐⭐

官网: https://www.scootersoftware.com/
平台: Windows、macOS、Linux
开源协议: 商业软件

核心特点

GUI 对比工具
- 文本对比
- 文件夹对比
- 图片对比
- 表格对比（Excel）
强大功能
- 三向合并
- 语法高亮
- 忽略规则
- 批量操作

定价

标准版：$60
专业版：$90

3. diff-so-fancy ⭐⭐⭐⭐⭐

GitHub: https://github.com/so-fancy/diff-so-fancy
Stars: 17k+
开源协议: MIT

核心特点

美化 Git Diff
- 更易读的输出
- 语法高亮
- 逐字符对比

使用示例

# 安装
brew install diff-so-fancy

# 配置 Git
git config --global core.pager "diff-so-fancy | less --tabs=4 -RFX"
git config --global interactive.diffFilter "diff-so-fancy --patch"

# 使用
git diff

五、Office 文档处理工具

1. python-docx ⭐⭐⭐⭐⭐

GitHub: https://github.com/python-openxml/python-docx
Stars: 4k+
语言: Python

使用示例

from docx import Document

# 创建文档
doc = Document()
doc.add_heading('Document Title', 0)
doc.add_paragraph('A plain paragraph.')
doc.add_paragraph('A paragraph with bold and italic', style='IntenseQuote')

# 表格
table = doc.add_table(rows=3, cols=3)
table.rows[0].cells[0].text = 'Header 1'

# 保存
doc.save('output.docx')

# 读取
doc = Document('existing.docx')
for para in doc.paragraphs:
    print(para.text)

2. openpyxl ⭐⭐⭐⭐⭐

GitHub: https://foss.heptapod.net/openpyxl/openpyxl
Stars: 2k+
语言: Python

使用示例

from openpyxl import Workbook, load_workbook

# 创建 Excel
wb = Workbook()
ws = wb.active
ws['A1'] = 'Hello'
ws['B1'] = 'World'
wb.save('sample.xlsx')

# 读取 Excel
wb = load_workbook('existing.xlsx')
ws = wb.active

for row in ws.iter_rows(values_only=True):
    print(row)

六、工具选择建议

按使用场景

场景	推荐工具	理由
Markdown → 多种格式	Pandoc	万能转换
网页 → PDF	Puppeteer	完整渲染
批量 PDF 生成	WeasyPrint	速度快
API 服务	Gotenberg	语言无关
PDF 表格提取	pdfplumber	准确度高
PDF → Markdown	Marker	AI 增强
多格式解析	Apache Tika	格式全
Git Diff 美化	diff-so-fancy	易读
Word/Excel 处理	python-docx/openpyxl	Python 友好

组合方案

文档工作流自动化：

Markdown (编写)
  ↓ Pandoc
HTML (预览)
  ↓ Puppeteer
PDF (分发)
  ↓ pdfplumber
数据提取 (归档)

企业文档处理：

多格式文档
  ↓ Apache Tika
统一文本
  ↓ 处理/分析
  ↓ Pandoc
标准格式输出

七、最佳实践

1. 性能优化

# 批量处理时复用资源
browser = await puppeteer.launch()
for file in files:
    page = await browser.newPage()
    # 处理
    await page.close()
await browser.close()

2. 错误处理

import pdfplumber

try:
    with pdfplumber.open('document.pdf') as pdf:
        text = pdf.pages[0].extract_text()
except Exception as e:
    print(f"Error: {e}")

3. 并发处理

# 使用多进程
from multiprocessing import Pool

def process_pdf(filename):
    # 处理逻辑
    pass

with Pool(4) as p:
    p.map(process_pdf, filenames)

八、总结

核心推荐

必备工具：

Pandoc - 万能转换器
Puppeteer - PDF 生成
pdfplumber - PDF 提取

场景化选择：

开发者 → Puppeteer + pdfplumber
Python 项目 → WeasyPrint + pdfplumber
企业级 → Gotenberg + Apache Tika
AI 增强 → Marker

最后更新时间：2025-11-08 调研范围：15+ 文档处理工具 重点推荐：Pandoc、Puppeteer、pdfplumber、Marker、Gotenberg

调研概述​

📖 快速导航​

按类型分类​

按使用场景选择​

一、Markdown 转换工具​

1. Pandoc ⭐⭐⭐⭐⭐​

核心特点​

使用示例​

常用转换场景​

Lua 过滤器示例​

模板系统​

优势与劣势​

依赖安装​

性能优化​

评分​

2. marked ⭐⭐⭐⭐​

核心特点​

使用示例​

浏览器使用​

优势与劣势​

评分​

3. markdown-it ⭐⭐⭐⭐⭐​

核心特点​

使用示例​

常用插件推荐​

完整示例​

优势与劣势​

评分​

Markdown 工具对比​

二、PDF 生成工具​

1. Puppeteer ⭐⭐⭐⭐⭐​

核心特点​

安装​

基础 PDF 生成​

高级 PDF 配置​

从 HTML 字符串生成​

实用示例​

性能优化​

Docker 部署​

优势与劣势​

评分​

2. Playwright ⭐⭐⭐⭐⭐​

核心特点​

使用示例​

Playwright vs Puppeteer​

评分​

3. WeasyPrint ⭐⭐⭐⭐​

核心特点​

使用示例​

优势与劣势​

评分​

4. Gotenberg ⭐⭐⭐⭐⭐​

核心特点​

使用示例​

Docker Compose 部署​

优势与劣势​

评分​

PDF 生成工具对比​

三、文档提取工具​

1. pdfplumber ⭐⭐⭐⭐⭐​

核心特点​

使用示例​

实用示例​

优势与劣势​

评分​

2. Marker ⭐⭐⭐⭐⭐​

核心特点​

使用示例​

输出示例​

优势与劣势​

评分​

文档提取工具对比​

四、文档对比工具​

1. diff-pdf ⭐⭐⭐⭐​

核心特点​

使用示例​

2. Beyond Compare ⭐⭐⭐⭐⭐​

核心特点​

定价​

3. diff-so-fancy ⭐⭐⭐⭐⭐​

调研概述

📖 快速导航

按类型分类

按使用场景选择

一、Markdown 转换工具

1. Pandoc ⭐⭐⭐⭐⭐

核心特点

使用示例

常用转换场景

Lua 过滤器示例

模板系统

优势与劣势

依赖安装

性能优化

评分

2. marked ⭐⭐⭐⭐

核心特点

使用示例

浏览器使用

优势与劣势

评分

3. markdown-it ⭐⭐⭐⭐⭐

核心特点

使用示例

常用插件推荐

完整示例

优势与劣势

评分

Markdown 工具对比

二、PDF 生成工具

1. Puppeteer ⭐⭐⭐⭐⭐

核心特点

安装

基础 PDF 生成

高级 PDF 配置

从 HTML 字符串生成

实用示例

性能优化

Docker 部署

优势与劣势

评分

2. Playwright ⭐⭐⭐⭐⭐

核心特点

使用示例

Playwright vs Puppeteer

评分

3. WeasyPrint ⭐⭐⭐⭐

核心特点

使用示例

优势与劣势

评分

4. Gotenberg ⭐⭐⭐⭐⭐

核心特点

使用示例

Docker Compose 部署

优势与劣势

评分

PDF 生成工具对比

三、文档提取工具

1. pdfplumber ⭐⭐⭐⭐⭐

核心特点

使用示例

实用示例

优势与劣势

评分

2. Marker ⭐⭐⭐⭐⭐

核心特点

使用示例

输出示例

优势与劣势

评分

文档提取工具对比

四、文档对比工具

1. diff-pdf ⭐⭐⭐⭐

核心特点

使用示例

2. Beyond Compare ⭐⭐⭐⭐⭐

核心特点

定价

3. diff-so-fancy ⭐⭐⭐⭐⭐