joest8/pdfinterpreter
最新稳定版本:v1.0
Composer 安装命令:
composer require joest8/pdfinterpreter
包简介
This class is designed to convert multiple PDF files, whether image-based or text-based, into an array of data.The class uses user-defined templates containing regular expressions to control the data extraction process, allowing for customized and flexible output.
README 文档
README
Introduction
This class is designed to convert multiple PDF files, whether image-based or text-based, into an array of data. The class uses user-defined templates containing regular expressions to control the data extraction process, allowing for customized and flexible output.
Table of Contents
This README is divided into several sections:
Installation
composer require joest8/pdfinterpreter
Console Applications
To use this class, you'll need to install the following applications:
- Poppler (necessary to convert pdf to text and get information about number of pages in file)
- Tesseract (necessary to read and interpret png file)
- ImageMagick (necessary to convert pdf->png)
Make sure you have a package-manager installed on your system.
Automated installation
Run the following code from the source folder to autoinstall all dependencies and tesseract language files:
php install/install_dependencies.php
Manual installation with homebrew
If homebrew is installed run the following commands to install the Homebrew packages:
brew install poppler tesseract imagemagick
Manual installation of Tesseract Language Files
You also need to install the required Tesseract language files. You can check the available languages at: https://github.com/tesseract-ocr/tessdata_best/
Download the necessary language files and place them in the appropriate directory. To find the directory use:
tesseract --list-langs
Usage
Create Object
<?php require_once '../vendor/autoload.php'; use PdfInterpreter\PdfInterpreter; //get path from terminal: 'echo $PATH' $path_env = "/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/homebrew/bin:/opt/homebrew/bin"; $pdf = new PdfInterpreter($path_env);
Get Sample Output
Using the get_sample_output-Method will allow you to get a sample of a text output without any interpretation of patterns.
<?php require_once '../vendor/autoload.php'; use PdfInterpreter\PdfInterpreter; //get path from terminal: 'echo $PATH' $path_env = "/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/homebrew/bin:/opt/homebrew/bin"; $pdf = new PdfInterpreter($path_env); print_r($pdf->get_sample_output());
Set new template
Using the add_new_template-Method will help you to create a new template.
For more informations about the demanded parameters read the DocBloc of the method.
<?php require_once '../vendor/autoload.php'; use PdfInterpreter\PdfInterpreter; //get path from terminal: 'echo $PATH' $path_env = "/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/homebrew/bin:/opt/homebrew/bin"; $pdf = new PdfInterpreter($path_env); $pdf->add_new_template("sample","Sample","/[Cc]ompany[\W]?[Aa][Bb][Cc]/","1","eng");
Add pattern to template
Using the add_pattern_to_template-Method will help you to add a new pattern to an existing template.
For more informations about the demanded parameters read the DocBloc of the method.
$pdf->add_pattern_to_template("sample","invoice_no","/INVOICE # *([\d]*)/","1"); $pdf->add_pattern_to_template("sample","date","/INVOICE DATE *([\d]{2}.[\d]{2}.[\d]{4})/","1"); $pdf->add_pattern_to_template("sample","positions","/([\d]{1,4}) *(.*?) *([\d]{1,8},[\d]{2}) *([\d]{1,8},[\d]{2})/m","a",true,['pieces','item','price','amount']);
Get Template
Using the get_template-Method will return the entire template.
For more informations about the demanded parameters read the DocBloc of the method.
print_r($pdf->get_template("sample"));
Delete Template
Using the delete_template-Method will delete the entire template.
For more informations about the demanded parameters read the DocBloc of the method.
print_r($pdf->delete_template("sample"));
Convert Files from Folder
Using the convert_folder-Method will convert all files from a folder into an array of data.
For more informations about the demanded parameters read the DocBloc of the method.
print_r(print_r($pdf->convert_folder("/../docs/",true,false,ocr_lang: "eng")));
Convert File
Using the convert_file-Method will convert a single file into an array of data.
For more informations about the demanded parameters read the DocBloc of the method.
print_r($pdf->convert_file("/../docs/sample-bill.pdf",true,false));
统计信息
- 总下载量: 6
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 1
- 点击次数: 0
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2023-11-05