iprodev/sitemap-generator-pro
最新稳定版本:v3.0.0
Composer 安装命令:
composer require iprodev/sitemap-generator-pro
包简介
A professional, production-ready PHP XML sitemap generator with advanced features: concurrency, robots.txt, caching, database storage, change detection, SEO analysis, JavaScript rendering, proxy support, webhooks, and more.
关键字:
README 文档
README
A professional, production-ready PHP sitemap generator by iProDev (Hemn Chawroka) — supports concurrency, robots.txt, gzip compression, sitemap index files, and comprehensive error handling.
🚀 What's New in v3.0
- ✨ Database Storage with change detection and historical tracking
- 🔄 Resume Capability with checkpoint system
- 🎯 SEO Analysis and content quality checking
- 📊 Performance Metrics and detailed analytics
- 🖼️ Multi-format Sitemaps (Images, Videos, News)
- 🌐 JavaScript Rendering support for SPAs
- 🔐 Proxy Support with rotation
- 🔔 Webhook Notifications for events
- 📅 Scheduled Crawling with cron integration
- 🎨 Interactive Mode for easy configuration
- ⚡ Caching System (File & Redis)
- 🎛️ Smart Filtering with priority rules
- 📈 Rate Limiting with retry handling
📋 Table of Contents
- Features
- Installation
- Quick Start
- CLI Usage
- Advanced Features
- Programmatic Usage
- Examples
- Configuration
- API Reference
- Testing
- Docker
- Troubleshooting
- Contributing
✨ Features
Core Features
- 🚀 High Performance - Concurrent HTTP requests
- 🤖 Robots.txt Compliant - Respects crawling rules
- 📦 Gzip Compression - Automatic compression
- 📊 Sitemap Index - Multiple sitemap files
- 🛡️ Error Handling - Comprehensive error management
- 📝 PSR-3 Logging - Standard logging interface
Advanced Features
- 💾 Database Storage - SQLite/MySQL/PostgreSQL support
- 🔄 Change Detection - Track URL changes over time
- 📈 SEO Analysis - Analyze pages for SEO issues
- 🔍 Quality Checks - Find duplicates, broken links
- 🎯 Smart Filtering - Include/exclude patterns
- ⚡ Caching - File and Redis cache support
- 📍 Resume Support - Continue interrupted crawls
- 🔔 Webhooks - Real-time notifications
- 📅 Scheduling - Automated periodic crawls
- 🌐 JavaScript - Render SPAs with headless Chrome
- 🔐 Proxy Support - HTTP/SOCKS proxies with rotation
- 🎨 Interactive Mode - User-friendly configuration
Sitemap Types
- 📄 Standard XML Sitemap
- 🖼️ Image Sitemap
- 🎬 Video Sitemap
- 📰 News Sitemap
📥 Installation
composer require iprodev/sitemap-generator-pro
Requirements
- PHP >= 8.0
- Extensions: curl, xml, mbstring, zlib, pdo
- Optional: redis, posix
🚀 Quick Start
Basic Usage
php bin/sitemap --url=https://www.example.com
Interactive Mode
php bin/sitemap --interactive
With All Features
php bin/sitemap \ --url=https://www.example.com \ --out=./sitemaps \ --concurrency=20 \ --cache-enabled \ --db-enabled \ --seo-analysis \ --image-sitemap \ --webhook-url=https://example.com/webhook \ --verbose
🖥️ CLI Usage
Basic Options
--url=<URL> # Starting URL (required) --out=<PATH> # Output directory --concurrency=<N> # Concurrent requests (1-100) --max-pages=<N> # Maximum pages to crawl --max-depth=<N> # Maximum link depth --public-base=<URL> # Public base URL for sitemap index --verbose, -v # Verbose output --help, -h # Show help
Caching
--cache-enabled # Enable caching --cache-driver=file # Cache driver: file|redis --cache-ttl=3600 # Cache TTL in seconds
Database & Change Detection
--db-enabled # Enable database storage --db-dsn=<DSN> # Database DSN --detect-changes # Compare with previous crawl --only-changed # Only include changed URLs
Resume Support
--resume # Resume from checkpoint --checkpoint-interval=<N> # Save checkpoint every N pages
Rate Limiting
--rate-limit=<N> # Requests per minute --delay=<MS> # Delay between requests (ms)
Filtering
--exclude=<PATTERNS> # Exclude patterns (comma-separated) --include=<PATTERNS> # Include only patterns --priority-rules=<JSON> # Priority rules as JSON
SEO & Analysis
--seo-analysis # Enable SEO analysis --check-quality # Check content quality --find-duplicates # Find duplicate content --find-broken-links # Find broken links
Advanced Sitemaps
--image-sitemap # Generate image sitemap --video-sitemap # Generate video sitemap --news-sitemap # Generate news sitemap
JavaScript Rendering
--enable-javascript # Enable JS rendering --chrome-path=<PATH> # Path to Chrome/Chromium --wait-for-ajax=<MS> # Wait time for AJAX
Proxy Support
--proxy=<URL> # Proxy URL --proxy-file=<PATH> # Load proxies from file --rotate-proxies # Rotate through proxies
Webhooks
--webhook-url=<URL> # Webhook for notifications --notify-on-complete # Notify when complete --notify-on-error # Notify on errors
🎯 Advanced Features
1. Database Storage & Change Detection
Track changes over time:
php bin/sitemap \ --url=https://example.com \ --db-enabled \ --detect-changes
The system will:
- Store all URLs in database
- Compare with previous crawl
- Generate change report (new, modified, deleted)
- Track SEO metrics over time
2. Resume Interrupted Crawls
Large crawls can be resumed:
php bin/sitemap \ --url=https://example.com \ --resume \ --checkpoint-interval=1000
3. SEO Analysis
Analyze pages for SEO issues:
php bin/sitemap \ --url=https://example.com \ --seo-analysis \ --find-duplicates \ --find-broken-links
Reports include:
- Missing title/meta descriptions
- Duplicate content
- Broken links
- Page load times
- Mobile optimization
- Structured data
4. JavaScript Rendering
For SPAs (React, Vue, Angular):
php bin/sitemap \ --url=https://spa.example.com \ --enable-javascript \ --chrome-path=/usr/bin/chromium \ --wait-for-ajax=5000
5. Scheduled Crawling
Setup automated crawls:
use IProDev\Sitemap\Scheduler\CronScheduler; $scheduler = new CronScheduler(); $scheduler->addSchedule('daily-crawl', [ 'url' => 'https://example.com', 'schedule' => 'daily', // or cron: '0 2 * * *' 'out' => './sitemaps', 'db_enabled' => true ]); // Add to crontab: // * * * * * php bin/scheduler
6. Webhooks
Get notified of events:
php bin/sitemap \ --url=https://example.com \ --webhook-url=https://example.com/webhook \ --notify-on-complete \ --notify-on-error
Webhook payload:
{
"event": "crawl.completed",
"timestamp": "2025-01-20T10:30:00Z",
"data": {
"url": "https://example.com",
"stats": {
"pages": 1523,
"duration": 45.3
}
}
}
7. Proxy Support
Use proxies for crawling:
# Single proxy php bin/sitemap \ --url=https://example.com \ --proxy=http://proxy.example.com:8080 # Proxy file with rotation php bin/sitemap \ --url=https://example.com \ --proxy-file=./proxies.txt \ --rotate-proxies
Proxy file format:
http://proxy1.example.com:8080
http://proxy2.example.com:8080|username:password
socks5://proxy3.example.com:1080
8. Smart Filtering
Control what gets crawled:
php bin/sitemap \ --url=https://example.com \ --exclude="/admin/*,/test/*,*.pdf" \ --include="/products/*,/blog/*" \ --priority-rules='{"homepage":1.0,"/products/*":0.8}'
💻 Programmatic Usage
Basic Example
use IProDev\Sitemap\Fetcher; use IProDev\Sitemap\Crawler; use IProDev\Sitemap\SitemapWriter; use IProDev\Sitemap\RobotsTxt; $fetcher = new Fetcher(['concurrency' => 20]); $robots = RobotsTxt::fromUrl('https://example.com', $fetcher); $crawler = new Crawler($fetcher, $robots); $pages = $crawler->crawl('https://example.com', 10000, 5); $files = SitemapWriter::write($pages, './sitemaps');
With Database & Change Detection
use IProDev\Sitemap\Database\Database; use IProDev\Sitemap\ChangeDetector; // Initialize database $db = new Database('sqlite:./sitemap.db'); $db->createTables(); // Start crawl $domain = 'example.com'; $crawlId = $db->startCrawl($domain, 'https://example.com', []); // Crawl and save foreach ($pages as $page) { $db->saveUrl($crawlId, $page); } // Detect changes $prevCrawl = $db->getPreviousCrawl($domain, $crawlId); if ($prevCrawl) { $detector = new ChangeDetector($db); $changes = $detector->detectChanges($prevCrawl['id'], $crawlId); print_r($changes); }
With SEO Analysis
use IProDev\Sitemap\Analyzer\SeoAnalyzer; $analyzer = new SeoAnalyzer(); foreach ($pages as $page) { $analysis = $analyzer->analyze( $page['url'], $page['html'], $page['status_code'] ); echo "Score: {$analysis['score']}/100\n"; echo "Issues: " . count($analysis['issues']) . "\n"; }
With Caching
use IProDev\Sitemap\Cache\FileCache; use IProDev\Sitemap\Cache\RedisCache; // File cache $cache = new FileCache('./cache', 3600); // Redis cache $cache = new RedisCache('127.0.0.1', 6379); // Use in fetcher $fetcher = new Fetcher(['cache' => $cache]);
⚙️ Configuration
Configuration File
Create sitemap.config.php:
<?php return [ 'url' => 'https://example.com', 'out' => './sitemaps', 'concurrency' => 20, 'max_pages' => 10000, 'max_depth' => 5, 'cache_enabled' => true, 'db_enabled' => true, 'seo_analysis' => true, 'exclude' => ['/admin/*', '/test/*'], 'priority_rules' => [ 'homepage' => 1.0, '/products/*' => 0.8, '/blog/*' => 0.6 ] ];
Use config file:
php bin/sitemap --config=sitemap.config.php
📂 Examples
The examples/ directory contains ready-to-use code examples:
Basic Example (examples/basic.php)
Simple sitemap generation with minimal configuration:
require 'vendor/autoload.php'; use IProDev\Sitemap\Fetcher; use IProDev\Sitemap\Crawler; use IProDev\Sitemap\SitemapWriter; use IProDev\Sitemap\RobotsTxt; $fetcher = new Fetcher(['concurrency' => 10]); $robots = RobotsTxt::fromUrl('https://example.com', $fetcher); $crawler = new Crawler($fetcher, $robots); $pages = $crawler->crawl('https://example.com', 1000, 3); SitemapWriter::write($pages, './output');
Advanced Example (examples/advanced.php)
Includes caching, filtering, and rate limiting.
Comprehensive Example (examples/comprehensive.php)
Full-featured example demonstrating all capabilities:
- Database storage with change detection
- SEO analysis and quality checks
- Performance metrics
- Webhook notifications
- Image sitemap generation
- Resume capability
Run the comprehensive example:
php examples/comprehensive.php
📖 API Reference
Core Classes
| Class | Description |
|---|---|
Fetcher |
HTTP client with concurrent request support |
Crawler |
Web crawler with depth and page limits |
Parser |
HTML parser for link extraction |
RobotsTxt |
Robots.txt parser with wildcard support |
SitemapWriter |
XML sitemap generator with gzip compression |
Utils |
Utility functions (URL handling, formatting) |
Advanced Classes
| Class | Description |
|---|---|
Database |
SQLite/MySQL/PostgreSQL storage |
ChangeDetector |
Detect URL changes between crawls |
SeoAnalyzer |
Page-level SEO analysis |
ContentQualityChecker |
Find duplicates, broken links, thin content |
UrlFilter |
Include/exclude patterns with priority rules |
FileCache / RedisCache |
Caching implementations |
RateLimiter |
Request throttling |
CrawlCheckpoint |
Resume interrupted crawls |
WebhookNotifier |
Event notifications |
ProxyManager |
Proxy rotation support |
PerformanceMetrics |
Crawl statistics and reporting |
Sitemap Writers
| Class | Description |
|---|---|
SitemapWriter |
Standard XML sitemap |
ImageSitemapWriter |
Image sitemap with title/caption |
VideoSitemapWriter |
Video sitemap with metadata |
NewsSitemapWriter |
News sitemap for Google News |
🧪 Testing
# Run tests composer test # With coverage composer test-coverage # Code style composer lint # Static analysis composer analyze # All checks composer check
🐳 Docker
# Build docker build -t sitemap-generator-pro . # Run docker run --rm \ -v $(pwd)/sitemaps:/app/output \ sitemap-generator-pro \ --url=https://example.com \ --out=/app/output
📊 Performance Tips
-
Increase Concurrency: For faster crawling
--concurrency=50
-
Enable Caching: Reduce duplicate requests
--cache-enabled --cache-driver=redis
-
Use Database: Track changes efficiently
--db-enabled --detect-changes
-
Smart Filtering: Reduce unnecessary pages
--exclude="/admin/*,*.pdf" -
Resume Support: Handle large sites
--resume --checkpoint-interval=1000
🔧 Troubleshooting
Common Issues
1. Memory Exhausted Error
# Increase PHP memory limit
php -d memory_limit=512M bin/sitemap --url=https://example.com
2. Too Many Open Connections
# Reduce concurrency
php bin/sitemap --url=https://example.com --concurrency=5
3. Crawl Takes Too Long
# Use checkpoints and resume
php bin/sitemap --url=https://example.com --resume --checkpoint-interval=500
4. Rate Limited by Server
# Enable rate limiting
php bin/sitemap --url=https://example.com --rate-limit=30 --delay=1000
5. JavaScript-rendered Content Not Captured
# Enable JavaScript rendering (requires Chrome/Chromium)
php bin/sitemap --url=https://example.com --enable-javascript --chrome-path=/usr/bin/chromium
6. SSL Certificate Errors
// In code, configure Guzzle to skip verification (not recommended for production) $fetcher = new Fetcher(['verify' => false]);
Debug Mode
Enable verbose output for debugging:
php bin/sitemap --url=https://example.com --verbose
Logs
When using database storage, check crawl history:
sqlite3 sitemap.db "SELECT * FROM crawls ORDER BY created_at DESC LIMIT 5;"
🔒 Security
- Path traversal prevention
- URL validation and sanitization
- Safe XML generation
- Proxy authentication support
- Rate limiting to prevent blocking
📝 License
MIT License - see LICENSE.md
🤝 Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests (
composer check) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Setup
# Clone the repository git clone https://github.com/iprodev/PHP-XML-Sitemap-Generator.git cd sitemap-generator-pro # Install dependencies composer install # Run tests composer test # Check code style composer lint # Fix code style issues composer phpcbf # Run static analysis composer analyze
🙏 Credits
Created by iProDev (Hemn Chawroka) - https://github.com/iprodev
📞 Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Wiki
Made with ❤️ by iProDev (Hemn Chawroka)
统计信息
- 总下载量: 23
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 84
- 点击次数: 0
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2025-10-08