承接 ssola/crawly 相关项目开发

从需求分析到上线部署,全程专人跟进,保证项目质量与交付效率

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

ssola/crawly

最新稳定版本:1.0.0

Composer 安装命令:

composer require ssola/crawly

包简介

Simple web crawler library

README 文档

README

Crawly is a simple web crawler able to extract and follow links depending on the discovers.

Simple Example

require_once("vendor/autoload.php");

// Create a new Crawly object
$crawler = Crawly\Factory::generic();

// Discovers are allows you to extract links to follow
$crawler->attachDiscover(
    new Crawly\Discovers\CssSelector('nav.pagination > ul > li > a')
);

// After we scrapped and discovered links you can add your own closures to handle the data
$crawler->attachExtractor(
    function($response) {
        // here we have the response, work with it!
    }
);

// set seed page
$crawler->setSeed("http://www.webpage.com/test/");

// start the crawler
$crawler->run();

Crawler object

You can create a simple crawler with the Crawler Factory, it will generate a Crawly object using Guzzle as Http client.

$crawler = Crawly\Factory::generic();

You can create a personalized crawler specified which Http client, Url queue and Visited link collection to use.

$crawler = Crawly\Factory::create(new MyHttpClass(), new MyUrlQueue(), new MyVisitedCollection());

Discovers

Discovers are used to extract from the html a set of links to include to the queue. You can include as many discovers as you want and you can create your own discovers classes too.

At the moment Crawly only includes a Css Selector discover.

Create your own discover

Just create a new class that implements the Discoverable interface. This new class should look like this example:

class MyOwnDiscover implements Discoverable
{
    private $configuration;

    public function __construct($configuration) 
    {
        $this->configuration = $configuration;
    }

    public function find(Crawly &$crawler,  $response)
    {
        // $response has the crawled url content
        // do some magin on the response and get a colleciton of links
        
        foreach($links as $node) {
            $uri = new Uri($node->getAttribute('href'), $crawler->getHost());

            // if url was not visited we should include this new links to the Url Queue
            if(!$crawler->getVisitedUrl()->seen($uri->toString())) {
                $crawler->getUrlQueue()->push($uri);
            }
        }
    }
}

Limiters

Limiters are used to limit the crawler actions. For instance, we can limit how many links can been crawled or which is the maximum amout of bandwitdth to use.

Extractors

统计信息

  • 总下载量: 18
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 4
  • 点击次数: 0
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 4
  • Watchers: 1
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: Unknown
  • 更新时间: 2014-11-27