llm-html-extractor/symfony-bundle
最新稳定版本:0.1
Composer 安装命令:
composer require llm-html-extractor/symfony-bundle
包简介
Symfony bundle for extracting structured data from HTML using LLM providers
关键字:
README 文档
README
A powerful Symfony bundle for extracting structured data from HTML using LLM (Large Language Model) providers with a plugin architecture.
Features
- LLM-Based Extraction: Uses LLM providers (starting with Jina Reader) to extract structured data from HTML
- Type-Safe DTOs: Define extraction schemas using PHP attributes on your DTOs
- Hybrid Extraction: Easily combine LLM extraction with code-based extraction - use AI for complex fields and DomCrawler/XPath for simple structured data
- Extensible: Plugin architecture allows custom extractors for specific use cases
- Cacheable: Built-in caching support for LLM responses
- Logging: Optional logging for LLM requests/responses and cache operations
- Configurable: Flexible configuration for different LLM providers and caching strategies
Installation
composer require llm-html-extractor/symfony-bundle
Configuration
Create or update config/packages/llm_html_extractor.yaml:
llm_html_extractor: llm_client: client: jina_reader # or any service ID for custom client jina_reader: model: 'jinaai/readerlm-v2' # or 'jinaai/readerlm-v1.5' default_temperature: 0.0 # Default temperature for LLM requests (0.0 = deterministic) default_max_tokens: 64000 # Default max tokens for LLM responses http_client: base_uri: 'https://r.jina.ai' api_key: '%env(JINA_API_KEY)%' timeout: 600 headers: X-Custom-Header: 'value' cache: enabled: true ttl: 36000 # 10 hours pool: 'cache.app' logs: enabled: true # default: false logger: 'logger' # service ID of the logger to use
Alternatively, you can use an existing HTTP client service:
llm_html_extractor: llm_client: client: jina_reader jina_reader: model: 'jinaai/readerlm-v2' http_client: 'my_custom_http_client_service' # Service ID implementing HttpClientInterface
Using a Custom LLM Client
To use your own LLM client implementation, just set the client parameter to your service ID:
llm_html_extractor: llm_client: client: 'app.my_custom_llm_client' # Your service ID cache: enabled: true # Will automatically wrap your client with caching
Your custom client must implement LlmHtmlExtractor\SymfonyBundle\Client\LlmClientInterface. The bundle will validate this during container compilation and throw a clear error if the interface is not implemented.
Logging
The bundle provides comprehensive logging for debugging and monitoring:
- Request/Response Logging: When
logs.enabled: true, all LLM requests and responses are logged at info level - Cache Operations: Cache hits and misses are logged when both caching and logging are enabled
- Error Logging: Failed LLM requests are logged at error level with exception details
The decorators are applied in this order:
- Base LLM Client (e.g., JinaReaderLlmClient)
- LoggingLlmClient (if logs enabled) - logs requests/responses
- CacheableLlmClient (if cache enabled) - logs cache hits/misses
This means logged requests show the actual LLM calls (cache misses), not cached responses.
Usage
1. Define Your Extraction DTO
use LlmHtmlExtractor\SymfonyBundle\Attribute\AsLlmExtractableProperty; class ArticleExtractionResult { public function __construct( #[AsLlmExtractableProperty('Extract the article title')] public string $title, #[AsLlmExtractableProperty('Extract the author name')] public string $author, #[AsLlmExtractableProperty('Extract publication date in YYYY-MM-DD format')] public string $publishedAt, #[AsLlmExtractableProperty('Extract the main article content')] public string $content, ) {} }
2. Use the Extraction Handler
use LlmHtmlExtractor\SymfonyBundle\Extractor\ExtractionHandler; class ArticleScraper { public function __construct( private ExtractionHandler $extractionHandler, ) {} public function scrape(string $html): ArticleExtractionResult { return $this->extractionHandler->handle( ArticleExtractionResult::class, $html ); } }
3. Create Custom Extractors (Optional)
For specific extraction needs, implement the FromHtmlExtractorInterface:
use LlmHtmlExtractor\SymfonyBundle\Extractor\FromHtmlExtractorInterface; use Symfony\Component\DomCrawler\Crawler; use Symfony\Component\DependencyInjection\Attribute\AutoconfigureTag; #[AutoconfigureTag('llm_extractor.extractor', ['priority' => 50])] class CustomPdfUrlExtractor implements FromHtmlExtractorInterface { public function extract(string $html, array $context = []): mixed { $crawler = new Crawler($html); return $crawler->filterXPath('//a[contains(@href, ".pdf")]') ->each(fn($node) => $node->attr('href')); } public function supports(string $className, string $propertyName): bool { return $className === ArticleExtractionResult::class && $propertyName === 'pdfUrls'; } }
Supported LLM Providers
Currently supported:
- Jina Reader (jinaai/readerlm-v2, jinaai/readerlm-v1.5)
- Uses vLLM OpenAI API standard endpoint (
/openai/v1/chat/completions) - Tested with Runpod serverless deployments
- Compatible with any vLLM deployment following the OpenAI API standard
- Uses vLLM OpenAI API standard endpoint (
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
统计信息
- 总下载量: 2
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 0
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2025-10-19