johnroyer/crawler-php 问题修复 & 功能扩展

解决BUG、新增功能、兼容多环境部署,快速响应你的开发需求

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

johnroyer/crawler-php

最新稳定版本:0.3.6

Composer 安装命令:

composer require johnroyer/crawler-php

包简介

crawler implement in PHP

README 文档

README

Web crawler in simple.

Note: this is a site project. Do NOT use in production.

Usage

Create handler from AbstractHandler, and set domain which handler should handles:

class MyHandler extends \Zeroplex\Crawler\Handler\AbstractHandler
{
    public function getDomain(): string
    {
        return 'test.com';
    }

    public function shouldFetch(\Psr\Http\Message\RequestInterface $request): bool
    {
        if (1 === preg_match('/(css|js|jpg|png|gif)$/', $request->getUri())) {
            // ignore css, js and common images
            return false;
        }
        return true;
    }

    public function handle(\Psr\Http\Message\ResponseInterface $response): void
    {
        // get content using $response->getBody()->getContents()
    }
}

Then setup crawler and run:

$crawler = new \Zeroplex\Crawler\Crawler();

$crawler->setDelay(0)
    ->setTimeout(3)
    ->setFollowRedirect(true)
    ->setUserAgent('Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/100.1');

$crawler->addHandler(new BlogHandler());

// URL to start
$crawler->run('https://test.com');

Extends

For example, implement URL queue by Predis.

composer install:

composer require predis/predis

Implement UrlQueueInterface:

class RedisQueue implements Zeroplex\Crawler\UrlQueue\UrlQueueInterface
{
    private $redis;
    public function __construct(string $host, int $port) { }

    public function push(string $url): void
    {
        $this->redis->lpush($url);
    }

    public function pop(): string
    {
        return $this->redis->lpop();
    }

    // and so on
}

统计信息

  • 总下载量: 65
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 3
  • 点击次数: 0
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2023-01-12