crwlr/crawler
最新稳定版本:v3.5.6
Composer 安装命令:
composer require crwlr/crawler
包简介
Web crawling and scraping library.
README 文档
README
Library for Rapid (Web) Crawler and Scraper Development
This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.
To give you an overview, here's a list of things that it helps you with:
- Crawler Politeness 😇 (respecting robots.txt, throttling,...)
- Load URLs using
- a (PSR-18) HTTP client (default is of course Guzzle)
- or a headless browser (chrome) to get source after Javascript execution
- Get absolute links from HTML documents 🔗
- Get sitemaps from robots.txt and get all URLs from those sitemaps
- Crawl (load) all pages of a website 🕷
- Use cookies (or don't) 🍪
- Use any HTTP methods (GET, POST,...) and send any headers or body
- Easily iterate over paginated list pages 🔁
- Extract data from:
- Extract schema.org structured data in JSON-LD format from HTML documents
- Keep memory usage low by using PHP Generators 💪
- Cache HTTP responses during development, so you don't have to load pages again and again after every code change
- Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)
- And a lot more...
Documentation
You can find the documentation at crwlr.software.
Contributing
If you consider contributing something to this package, read the contribution guide (CONTRIBUTING.md).
统计信息
- 总下载量: 12.7k
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 371
- 点击次数: 1
- 依赖项目数: 2
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2022-04-18
