承接 vladshut/php-goose 相关项目开发

从需求分析到上线部署,全程专人跟进,保证项目质量与交付效率

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

vladshut/php-goose

最新稳定版本:0.3.0

Composer 安装命令:

composer require vladshut/php-goose

包简介

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

README 文档

README

#PHP Goose - Article Extractor Scrutinizer Code Quality Build Status

##Intro

PHP Goose is a port of Goose originally developed in Java and converted to Scala by GravityLabs. Portions have also been ported from the Python port python-goose. It's mission is to take any news article or article type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

The extraction goal is to try and get the purest extraction from the beginning of the article for servicing flipboard/pulse type applications that need to show the first snippet of a web article along with an image.

Goose will try to extract the following information:

  • Main text of an article
  • Main image of article
  • Any Youtube/Vimeo movies embedded in article
  • Meta Description
  • Meta tags
  • Publish Date

The php version was rewritten by:

  • Andrew Scott

##Requirement

  • PHP 5.4 or later
  • PSR-4 compatible autoloader

##Install

This library is designed to be installed via Composer.

Add the dependency into your projects composer.json.

{
  "require": {
    "scotteh/php-goose": "dev-master"
  }
}

Download the composer.phar

curl -sS https://getcomposer.org/installer | php

Install the library.

php composer.phar install

##Autoloading

This library requires an autoloader, if you aren't already using one you can include Composers autoloader.

require('vendor/autoload.php');

##Usage

use Goose\Client as GooseClient;

$goose = new GooseClient();
$article = $goose->extractContent('http://url.to/article');

$title = $article->getTitle();
$metaDescription = $article->getMetaDescription();
$metaKeywords = $article->getMetaKeywords();
$canonicalLink = $article->getCanonicalLink();
$domain = $article->getDomain();
$tags = $article->getTags();
$links = $article->getLinks();
$movies = $article->getMovies();
$articleText = $article->getCleanedArticleText();
$entities = $article->getPopularWords();
$image = $article->getTopImage();
$allImages = $article->getAllImages();

##Configuration

All config options are not required and are optional. Default (fallback) values have been used below.

$goose = new GooseClient([
    // Language - Selects common word dictionary
    //   Supported languages:
    //     ar, da, de, en, es, fi, fr, hu, id, it, ko,
    //     nb, nl, no, pl, pt, ru, sv, zh
    'language' => 'en',
    // Minimum image size (bytes)
    'image_min_bytes' => 4500,
    // Maximum image size (bytes)
    'image_max_bytes' => 5242880,
    // Minimum image size (pixels)
    'image_min_width' => 120,
    // Maximum image size (pixels)
    'image_min_height' => 120,
    // Fetch best image
    'image_fetch_best' => true,
    // Fetch all images
    'image_fetch_all' => false,
    // Guzzle configuration - All values are passed directly to Guzzle
    //   See: http://guzzle.readthedocs.org/en/latest/clients.html#request-options
    'browser' => [
        'timeout' => 60,
        'connect_timeout' => 30
    ]
]);

##Licensing

PHP Goose is licensed by Gravity.com under the Apache 2.0 license, see the LICENSE file for more details

统计信息

  • 总下载量: 16
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 1
  • 点击次数: 0
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 1
  • Forks: 118
  • 开发语言: PHP

其他信息

  • 授权协议: Apache-2.0
  • 更新时间: 2015-08-31