定制 nizek/crawler 二次开发

按需修改功能、优化性能、对接业务系统,提供一站式技术支持

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

nizek/crawler

最新稳定版本:1.1.2

Composer 安装命令:

composer require nizek/crawler

包简介

README 文档

README

A PHP package to automate web crawling and element retrieval using Selenium WebDriver. This package allows you to connect to a Selenium server, navigate to web pages, and interact with elements by tags, classes, IDs, and CSS selectors. The package is set up for Chrome in headless mode, making it suitable for use in server environments.

Requirements

  • PHP 7.4 or newer
  • Selenium Server
  • ChromeDriver
  • chromium

Using Docker

If you prefer to use Docker, a compatible Dockerfile can be found here. Simply run the following commands to build and start the Selenium server

docker build -t selenium .
docker run -d --name selenium -p 4444:4444 selenium:latest

After running these commands, you can access the Selenium server by opening http://localhost:4444 in your browser.

Usage

Initialization

To use the Crawler, instantiate it using the init static method, which provides a preconfigured WebDriver instance connected to Selenium.

use Nizek\Crawler\Crawler;

$crawler = Crawler::init();

Setting the URL

To set the URL for the crawler to navigate to:

$crawler->setUrl('https://example.com');

Methods

The following methods are provided for interacting with and retrieving elements from the webpage:

setUrl(string $url)

Sets the URL for the crawler to visit.

Parameters:
    $url: The URL to navigate to.
Returns:
    Returns the Crawler instance for method chaining.

Example:

$crawler->setUrl('https://example.com');

parseXMLUrls()

Parses all URLs from XML content in tags. Useful for sitemaps.

Returns:
    An array of URLs found in <loc> tags on the page.

Example:

$crawler->setUrl('https://example.com/sitemap.xml');
$urls = $crawler->parseXMLUrls();

foreach ($urls as $url) {
    echo $url . PHP_EOL;
}

getElementByTagName(string $tagName)

Finds the first element with the given tag name.

Parameters:
    $tagName: The name of the tag to search for.
Returns:
    A WebElement object representing the element.

Example:

$crawler->setUrl('https://example.com');
$element = $crawler->getElementByTagName('h1');
echo $element->getText();

getElementsByTagName(string $tagName)

Finds all elements with the given tag name.

Parameters:
    $tagName: The name of the tag to search for.
Returns:
    An array of WebElement objects representing the elements.

Example:

$crawler->setUrl('https://example.com');
$elements = $crawler->getElementsByTagName('a');

foreach ($elements as $element) {
    echo $element->getAttribute('href') . PHP_EOL;
}

getElementBySelector(string $selector)

Finds the first element with the given selector.

Parameters:
    $className: The name of the class to search for.
Returns:
    A WebElement object representing the element.

Example:

$crawler->setUrl('https://example.com');
$element = $crawler->getElementBySelector('img.img-thumb[alt="Apple iphone 12"]');
echo $element->getAttribute('src');

This will find all img tags with class name img.thumb with alt value equals Apple iphone 12(just like filtering page element)

getElementsBySelector(string $selector)

Finds all elements with the given selector.

Parameters:
    $className: The name of the class to search for.
Returns:
    An array of WebElement objects representing the elements.

Example:

$crawler->setUrl('https://example.com');
$elements = $crawler->getElementsBySelector('list-item');

foreach ($elements as $element) {
    echo $element->getText() . PHP_EOL;
}

getElementByClassName(string $className)

Finds the first element with the given class name.

Parameters:
    $className: The name of the class to search for.
Returns:
    A WebElement object representing the element.

Example:

$crawler->setUrl('https://example.com');
$element = $crawler->getElementByClassName('header');
echo $element->getText();

getElementsByClassName(string $className)

Finds all elements with the given class name.

Parameters:
    $className: The name of the class to search for.
Returns:
    An array of WebElement objects representing the elements.

Example:

$crawler->setUrl('https://example.com');
$elements = $crawler->getElementsByClassName('list-item');

foreach ($elements as $element) {
    echo $element->getText() . PHP_EOL;
}

getElementById(string $id)

Finds the element with the given ID.

Parameters:
    $id: The ID of the element to search for.
Returns:
    A WebElement object representing the element.

Example:

$crawler->setUrl('https://example.com');
$element = $crawler->getElementById('main-content');
echo $element->getText();

getPageContent()

Retrieves the inner HTML content of the current page.

Returns:
    A string containing the inner HTML of the page.

Example:

$crawler->setUrl('https://example.com');
$content = $crawler->getPageContent();
echo $content;

统计信息

  • 总下载量: 9
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 0
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: Unknown
  • 更新时间: 2024-11-06