pandoc-php/pandoc
最新稳定版本:v1.1.0
Composer 安装命令:
composer require pandoc-php/pandoc
包简介
A native PHP 8.4 port of the Pandoc document converter.
README 文档
README
A native PHP 8.4 port of the Pandoc document converter. This library allows you to convert documents between different formats (currently focusing on Word .docx, HTML .html, and Markdown .md to LaTeX) without requiring the system-level Pandoc binary.
Features
- Native PHP 8.4 Implementation: Uses modern PHP features like
readonlyclasses, Enums, and property hooks. - AST-Centric Architecture: Mirrors Pandoc's Abstract Syntax Tree (AST) for robust and accurate conversions.
- Modular Reader System: Uses a factory pattern and unified
ReaderInterfacefor easy expansion to new formats. - Deep Docx Parsing: Extracts paragraphs, headers, tables, lists, images/media, and advanced text styling (bold, italic, underline, strikeout, superscript/subscript, and colors).
- LaTeX Generation: Produces clean LaTeX code, available as both standalone documents and body fragments.
- Media Support: Automatically extracts images from documents and includes them in the AST's
MediaBag. The web interface bundles these into a ZIP archive alongside the LaTeX source. - Improved Robustness: Resilient Docx parsing that handles malformed XML, missing styles, and relationship collisions (e.g., images in headers/footers).
- No External Dependencies: Works purely in PHP 8.4+, making it easy to deploy in shared hosting or restricted environments.
Installation
Ensure you have PHP 8.4 or higher.
composer require pandoc-php/pandoc
Basic Usage
Converting a Word Document to LaTeX
use Pandoc\Reader\DocxReader; use Pandoc\Writer\LatexWriter; $reader = new DocxReader(); $writer = new LatexWriter(); // 1. Read the Docx file into an AST $doc = $reader->read('document.docx'); // 2. Convert AST to LaTeX string (standalone document) $latex = $writer->write($doc, standalone: true); file_put_contents('document.tex', $latex);
Converting Markdown to LaTeX Fragment
use Pandoc\Reader\MarkdownReader; use Pandoc\Writer\LatexWriter; $reader = new MarkdownReader(); $writer = new LatexWriter(); $markdown = "# Hello World\nThis is a paragraph."; $doc = $reader->read($markdown); // Output just the body (no preamble) $latexFragment = $writer->write($doc, standalone: false);
Converting HTML to LaTeX
use Pandoc\Reader\HtmlReader; use Pandoc\Writer\LatexWriter; $reader = new HtmlReader(); $writer = new LatexWriter(); $html = "<h1>Hello</h1><p>World</p>"; $doc = $reader->read($html); $latex = $writer->write($doc);
Converting Jupyter Notebooks to LaTeX
use Pandoc\Reader\IpynbReader; use Pandoc\Writer\LatexWriter; $reader = new IpynbReader(); $writer = new LatexWriter(); $json = file_get_contents('notebook.ipynb'); $doc = $reader->read($json); $latex = $writer->write($doc);
Web Interface
The project includes a simple web-based demonstration tool in the web/ directory.
- Point your web server to the
php-pandoc/web/folder. - Open
index.htmlin your browser. - Upload a
.docx,.html,.ipynbor.mdfile. - Choose the output format (Standalone or Fragment).
- Download the converted
.texfile. If the document contains images, you will receive a.ziparchive containing the LaTeX file and all media files in the same directory.
Supported Structures
For a detailed list of Word document features handled by this port, see SUPPORTED_STRUCTURES.md. Highlights include:
- Headers: Heading 1-6 and Title mapping.
- Text Styling: Bold, Italic, Underline, Strikeout, Superscript, Subscript.
- Colors: Text color and background (highlight/shading).
- Lists: Bulleted and Ordered lists.
- Images/Media: Automatic extraction from Word documents, HTML, and Jupyter Notebooks.
- Headers & Footers: Extraction of content from Docx headers and footers.
- Tables: Multi-body tables with header row detection.
- Horizontal Rules: Detection of underscore sequences as rules.
Development and Testing
The project uses PHPUnit for testing. To run the test suite:
./vendor/bin/phpunit
Tests cover:
- AST Integrity: Ensuring immutability and correct structure.
- Reader/Writer Modularity: Testing the
ReaderFactoryand interface consistency. - Writer Accuracy: Verifying LaTeX output and character escaping.
- Reader Reliability: Testing against standardized Docx samples to ensure parity with Pandoc's behavior.
Credits
This project is a port of Pandoc, originally created by John MacFarlane.
License
This project is licensed under the GPL v2 or later, mirroring the original Pandoc license.
统计信息
- 总下载量: 2
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 1
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: GPL-2.0-or-later
- 更新时间: 2026-01-07