stevebauman/hypertext
最新稳定版本:v1.1.2
Composer 安装命令:
composer require stevebauman/hypertext
包简介
The best HTML to text transformer
README 文档
README
A PHP HTML to pure text transformer that beautifully handles various and malformed HTML.
Hypertext is excellent at pulling text content out of any HTML based document and automatically:
- Removes CSS
- Removes scripts
- Removes headers
- Removes non-HTML based content
- Preserves spacing
- Preserves links (optional)
- Preserves new lines (optional)
It is directed at using the output in LLM related tasks, such as prompts and embeddings.
Installation
composer require stevebauman/hypertext
Usage
use Stevebauman\Hypertext\Transformer; $transformer = new Transformer(); // (Optional) Filter out specific elements by their XPath. $transformer->filter("//*[@id='some-element']"); // (Optional) Retain new line characters. $transformer->keepNewLines(); // (Optional) Retain anchor tags and their href attribute. $transformer->keepLinks(); $text = $transformer->toText($html);
Example
For larger examples, please view the tests/Fixtures directory.
Input:
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>My Blog</title> </head> <body> <h1>Welcome to My Blog</h1> <p>This is a paragraph of text on my webpage.</p> <a href="https://blog.com/posts">Click here</a> to view my posts. </body> </html>
Output (Pure Text):
echo (new Transformer)->toText($html);
Welcome to My Blog This is a paragraph of text on my webpage. Click here to view my posts.
Output (Keep New Lines):
echo (new Transformer)->keepNewLines()->toText($html);
Welcome to My Blog
This is a paragraph of text on my webpage.
Click here to view my posts.
Output (Keep Links):
echo (new Transformer)->keepLinks()->toText($html);
Welcome to My Blog This is a paragraph of text on my webpage. <a href="https://blog.com/posts">Click Here</a> to view my posts.
Output (Keep Both):
echo (new Transformer) ->keepLinks() ->keepNewLines() ->toText($html);
Welcome to My Blog
This is a paragraph of text on my webpage.
<a href="https://blog.com/posts">Click Here</a> to view my posts.
统计信息
- 总下载量: 385.86k
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 169
- 点击次数: 1
- 依赖项目数: 2
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2023-10-22