pforret/pf-article-extractor
最新稳定版本:0.3.3
Composer 安装命令:
composer require pforret/pf-article-extractor
包简介
PfArticleExtractor. Boilerplate Removal and Fulltext Extraction from HTML pages
README 文档
README
Boilerplate Removal and Fulltext Extraction from HTML pages.
Rewrite of dotpack/php-boiler-pipe for PHP8.2 and up, with tests.
Installation
composer require pforret/pf-article-extractor
Usage
use Pforret\PfArticleExtractor\ArticleExtractor; $articleData = ArticleExtractor::getArticle($html); /* * $articleData = Pforret\PfArticleExtractor\Formats\ArticleContentsDTO Object ( [title] => Film Podcast: Wicked Little Letters Named Film of the Month [content] => UK Film Club was back in March with a new episode of their film podcast. (...) [date] => [images] => Array ( [0] => https://static.wixstatic.com/media/.../b19cd0_dde0d59546f84127865267f43994f39b~mv2.jpg ) [links] => Array ( [0] => https://www.chrisolson.co.uk/ (...) ) ) */
Under the hood
- package accepts a full HTML page as input
- it will walk the DOM tree and try to find the main article content
- it will remove boilerplate content (like headers, footers, sidebars, ...)
- it will try to extract the main article content
- it will try to extract the title, date, images and links from the article
Rights now it's tested with example pages for
- Blogger
- Drupal
- Jekyll
- Mkdocs
- Wix
- WordPress
Similar packages
- beautifulsoup4 - Python, MIT
- html-text - Python, MIT
- kohlschutter/boilerpipe - Java, Apache 2.0
- fivefilters/readability.php - PHP, GPL-3.0
- miso-belica/jusText - Python, BSD2
- codelucas/newspaper - Python, Apache
统计信息
- 总下载量: 1.11k
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 4
- 点击次数: 2
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2024-06-02
