octopoda/octopus 问题修复 & 功能扩展

解决BUG、新增功能、兼容多环境部署,快速响应你的开发需求

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

octopoda/octopus

最新稳定版本:0.11.1

Composer 安装命令:

composer require octopoda/octopus

包简介

PHP Sitemap crawler

README 文档

README

Small PHP tool to crawl collections of URLs in a Sitemap using the PHPReact library for asynchronous loading of the URLs. Both plain text files and XML Sitemaps are supported.

Logo

Usage from the Command Line Interface (CLI)

Crawl the URLs in a Sitemap with verbose logging (-vvv).

php application.php http://www.domain.ext/sitemap.xml -vvv

Using 15 concurrent connections instead of the default 5 concurrent connections:

php application.php http://www.domain.ext/sitemap.xml --concurrency 15 -vvv

Use a HTTP GET request instead of the default HTTP HEAD. Note that HTTP HEAD requests involve less data transfer since no body is involved:

php application.php http://www.domain.ext/sitemap.xml --requestType GET -vvv

Use a timeout of 3 seconds instead of the default 10 seconds:

php application.php http://www.domain.ext/sitemap.xml --timeout 3 -vvv

Use a specific UserAgent instead of the default Octopus/1.0, for example, to simulate a search engine crawling a sitemap:

php application.php http://www.domain.ext/sitemap.xml --userAgent 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' -vvv

Use the TablePresenter to display intermediate results instead of the default EchoPresenter:

php application.php http://www.domain.ext/sitemap.xml --presenter Octopus\\Presenter\\TablePresenter -vvv

Usage from your own application

You can easily integrate sitemap crawling in your own application, have a look at the Config class for all possible configuration options. If required you can use a PSR3-Logger for logging purposes.

use Octopus\Config;
use Octopus\Processor;

$config = new Config();
$config->concurrency = 2;
$config->targetFile = 'https://www.domain.ext/sitemap.xml';
$config->additionalResponseHeadersToCount = array(
    'CF-Cache-Status', //Useful to check CloudFlare edge server cache status
);
$config->requestHeaders = array(
    'User-Agent' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', //Simulate Google's webcrawler
);
$processor = new Processor($config, $this->logger); //A PSR3 Logger can be injected if required
$processor->run();

$this->logger->info('Statistics: ' . print_r($processor->result->getStatusCodes(), true));
$this->logger->info('Applied concurrency: ' . $config->concurrency);
$this->logger->info('Total amount of processed data: ' . $processor->result->getTotalData());
$this->logger->info('Failed to load #URLs: ' . count($processor->result->getBrokenUrls()));

Limitations

Currently, Octopus is mainly an experimental / educational tool. Advanced use cases in HTTP response handling might not be supported.

Tests

To run the test suite, you first need to clone this repository and then install all dependencies using Composer:

$ composer install

To run the test suite, go to the project root and run:

$ make test

统计信息

  • 总下载量: 4.71k
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 11
  • 点击次数: 0
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 11
  • Watchers: 2
  • Forks: 1
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2017-12-06