定制 raphaelramosds/pdf-to-txt 二次开发

按需修改功能、优化性能、对接业务系统,提供一站式技术支持

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

raphaelramosds/pdf-to-txt

Composer 安装命令:

composer require raphaelramosds/pdf-to-txt

包简介

A simple package for converting a PDF file into TXT

README 文档

README

PdfToTxt is a simple package for converting a PDF file into TXT with PHP

composer require raphaelramosds/pdf-to-txt

Dependencies

Unfortunately, this package can only be used in a Linux environment. Additionally, you will need to install the following dependencies

Tesseract OCR package

# Install Tesseract OCR and its support to PT-BR language
sudo apt install tesseract-ocr tesseract-ocr-por

ImageMagick

# Install
sudo apt install imagemagick php-imagick

# Enable imagick extension
sudo phpenmod imagick

# (Optional) Check if it is enabled
php -m | grep imagick

How does it work?

It uses ImageMagick to convert all PDF pages into JPG format, extracts their content using Tesseract OCR and compiles the results into a single TXT file.

Ghostscript support?

While some PDF files use standard fonts that can be easily mapped to text, others rely on custom fonts which often store characters as vector graphics. In such cases, OCR becomes necessary to extract readable content. Therefore, in the future, I plan to add Ghostscript support to this package as an alternative method for handling these PDFs without relying solely on OCR.

You can use the following Ghostscript command to convert a PDF into a plain text file

gs -sDEVICE=txtwrite -o file.txt file.pdf

Before using this approach, it's recommended to check which fonts are used in the PDF. You can do that with the following command

gs -DPDFINFO file.pdf

Example

Converts file.pdf into file.txt and save it on path/to/txt directory

$ptt = new PdfToTxt('path/to/file.pdf', 'path/to/txt', 'file');
$ptt->convert();

Tests

Unit tests were written with PHPUnit

./vendor/bin/phpunit tests

统计信息

  • 总下载量: 4
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 0
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: Apache-2.0
  • 更新时间: 2025-05-09