定制 fievel/webspider 二次开发

按需修改功能、优化性能、对接业务系统,提供一站式技术支持

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

fievel/webspider

最新稳定版本:0.1.0

Composer 安装命令:

composer require fievel/webspider

包简介

webspider

README 文档

README

This repository wraps Guzzle and some Symfony components providing an easy way for spidering websites.

Requirements

  • PHP >=5.5
  • Guzzle >= 6.0
  • Doctrine ORM >= 2.2
  • Symfony Components >= 2.7

Installation

Add fievel/webspider as a require dependency in your composer.json file:

composer require fievel/webspider

Usage

Extend class WebSpiderAbstract as needed implementing these methods:

getDataFromResponse: used to extract data from response, default behaviour treats body as plain text;

protected function getDataFromResponse(ResponseInterface $response)
{
    return (string) $response->getBody();
}

parseData: used to extract data information, it's possible to initialize Symfony DomCrawler if needed;

protected function parseData($data)
{
    $this->crawler->addHtmlContent($data);

    $node = $this->crawler->filter('input');

    $value = null;
    if ($node->count() > 0) {
        $value = $node->first()->attr('value');
    }

    return $value;
}

handleException: used to handle Guzzle exceptions;

protected function handleException(\Exception $e)
{
    return null;
}

The only remaining thing to do is launch the spider created, in order to do that you can use the SpiderManager service.

$manager = $this->container->get('fievel_web_spider.manager.spider');
$manager->setLogger($this->logger);

$response = null;
try {
    $response = $manager->runSpider([
        AppBundle\Spiders\CustomSpider::class,  // Spider class created
        'http://localhost/test-spider',         // URL to spidering
        'post',                                 // Http method supported by Guzzle
        ['cookies' => true],                    // Custom config supported by Guzzle Client
        [                                       // Custom options supported by Guzzle Client
            RequestOptions::FORM_PARAMS => [
                'full_name' => 'John Doe'
            ]
        ]
    ]);
} catch(\Exception $e) {
}

Features

It's possible to share a storage between subsequent spiders call.

$storage = new SpiderStorage();
$storage->add($sharedData);

$response = $manager->runSpider([
    AppBundle\Spiders\CustomSpider::class,  // Spider class created
    'http://localhost/test-spider',         // URL to spidering
    'post',                                 // Http method supported by Guzzle
    ['cookies' => true],                    // Custom config supported by Guzzle Client
    [                                       // Custom options supported by Guzzle Client
        RequestOptions::FORM_PARAMS => [
            'full_name' => 'John Doe'
        ]
    ],
    $storage                                // Shared storage
]);

It's even possible to create queues and leave the entire execution to the manager.

$queue = new SpiderCallQueue();

$queue->enqueue(
    AppBundle\Spiders\FirstPageSpider::class,
    'http://localhost/test-spider',
    'post',
    ['cookies' => true],
    [
        RequestOptions::FORM_PARAMS => [
            'full_name' => 'John Doe'
        ]
    ]
);
$queue->enqueue(
    AppBundle\Spiders\SecondPageSpider::class,
    'http://localhost/test-spider',
    'get',
    ['cookies' => true],
    []
);

$response = $manager->runSpiderQueue($queue);

Last but not least, the SpiderManager will handle retries on failure using a custom GuzzleMiddleware.

Proxy

Links

统计信息

  • 总下载量: 13.74k
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 1
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2016-07-13