README

A powerful Symfony bundle for extracting structured data from HTML using LLM (Large Language Model) providers with a plugin architecture.

Features

LLM-Based Extraction: Uses LLM providers (starting with Jina Reader) to extract structured data from HTML
Type-Safe DTOs: Define extraction schemas using PHP attributes on your DTOs
Hybrid Extraction: Easily combine LLM extraction with code-based extraction - use AI for complex fields and DomCrawler/XPath for simple structured data
Extensible: Plugin architecture allows custom extractors for specific use cases
Cacheable: Built-in caching support for LLM responses
Logging: Optional logging for LLM requests/responses and cache operations
Configurable: Flexible configuration for different LLM providers and caching strategies

Installation

composer require llm-html-extractor/symfony-bundle

Configuration

Create or update config/packages/llm_html_extractor.yaml:

llm_html_extractor:
    llm_client:
        client: jina_reader  # or any service ID for custom client
        jina_reader:
            model: 'jinaai/readerlm-v2'  # or 'jinaai/readerlm-v1.5'
            default_temperature: 0.0  # Default temperature for LLM requests (0.0 = deterministic)
            default_max_tokens: 64000  # Default max tokens for LLM responses
            http_client:
                base_uri: 'https://r.jina.ai'
                api_key: '%env(JINA_API_KEY)%'
                timeout: 600
                headers:
                    X-Custom-Header: 'value'
    cache:
        enabled: true
        ttl: 36000  # 10 hours
        pool: 'cache.app'
    logs:
        enabled: true  # default: false
        logger: 'logger'  # service ID of the logger to use

Alternatively, you can use an existing HTTP client service:

llm_html_extractor:
    llm_client:
        client: jina_reader
        jina_reader:
            model: 'jinaai/readerlm-v2'
            http_client: 'my_custom_http_client_service'  # Service ID implementing HttpClientInterface

Using a Custom LLM Client

To use your own LLM client implementation, just set the client parameter to your service ID:

llm_html_extractor:
    llm_client:
        client: 'app.my_custom_llm_client'  # Your service ID
    cache:
        enabled: true  # Will automatically wrap your client with caching

Your custom client must implement LlmHtmlExtractor\SymfonyBundle\Client\LlmClientInterface. The bundle will validate this during container compilation and throw a clear error if the interface is not implemented.

Logging

The bundle provides comprehensive logging for debugging and monitoring:

Request/Response Logging: When logs.enabled: true, all LLM requests and responses are logged at info level
Cache Operations: Cache hits and misses are logged when both caching and logging are enabled
Error Logging: Failed LLM requests are logged at error level with exception details

The decorators are applied in this order:

Base LLM Client (e.g., JinaReaderLlmClient)
LoggingLlmClient (if logs enabled) - logs requests/responses
CacheableLlmClient (if cache enabled) - logs cache hits/misses

This means logged requests show the actual LLM calls (cache misses), not cached responses.

Usage

1. Define Your Extraction DTO

use LlmHtmlExtractor\SymfonyBundle\Attribute\AsLlmExtractableProperty;

class ArticleExtractionResult
{
    public function __construct(
        #[AsLlmExtractableProperty('Extract the article title')]
        public string $title,

        #[AsLlmExtractableProperty('Extract the author name')]
        public string $author,

        #[AsLlmExtractableProperty('Extract publication date in YYYY-MM-DD format')]
        public string $publishedAt,

        #[AsLlmExtractableProperty('Extract the main article content')]
        public string $content,
    ) {}
}

2. Use the Extraction Handler

use LlmHtmlExtractor\SymfonyBundle\Extractor\ExtractionHandler;

class ArticleScraper
{
    public function __construct(
        private ExtractionHandler $extractionHandler,
    ) {}

    public function scrape(string $html): ArticleExtractionResult
    {
        return $this->extractionHandler->handle(
            ArticleExtractionResult::class,
            $html
        );
    }
}

3. Create Custom Extractors (Optional)

For specific extraction needs, implement the FromHtmlExtractorInterface:

use LlmHtmlExtractor\SymfonyBundle\Extractor\FromHtmlExtractorInterface;
use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\DependencyInjection\Attribute\AutoconfigureTag;

#[AutoconfigureTag('llm_extractor.extractor', ['priority' => 50])]
class CustomPdfUrlExtractor implements FromHtmlExtractorInterface
{
    public function extract(string $html, array $context = []): mixed
    {
        $crawler = new Crawler($html);
        return $crawler->filterXPath('//a[contains(@href, ".pdf")]')
            ->each(fn($node) => $node->attr('href'));
    }

    public function supports(string $className, string $propertyName): bool
    {
        return $className === ArticleExtractionResult::class
            && $propertyName === 'pdfUrls';
    }
}

Supported LLM Providers

Currently supported:

Jina Reader (jinaai/readerlm-v2, jinaai/readerlm-v1.5)
- Uses vLLM OpenAI API standard endpoint (/openai/v1/chat/completions)
- Tested with Runpod serverless deployments
- Compatible with any vLLM deployment following the OpenAI API standard

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

llm-html-extractor/symfony-bundle

包简介

关键字：

README 文档