aisk/mistral-tokenizer
最新稳定版本:0.2.0
Composer 安装命令:
composer require aisk/mistral-tokenizer
包简介
PHP implementation of Mistral's tokenizers
README 文档
README
A PHP implementation of the Mistral tokenizers, focusing on the Tekken tokenizer.
Developed by Aisk.
Installation
composer require aisk/mistral-tokenizer
Features
- Text tokenization and detokenization
- Token counting
- Batch processing
- Special token handling
- Unicode character support
Usage
Basic Usage
<?php require_once 'vendor/autoload.php'; use Aisk\Tokenizer\TokenizerFactory; // Create a tokenizer factory $factory = new TokenizerFactory(); // Get the Tekken tokenizer $tokenizer = $factory->getTekkenTokenizer('240911'); // Encode a string $text = 'Hello, world!'; $tokens = $tokenizer->encode($text); echo "Encoded '{$text}' to " . count($tokens) . " tokens: " . implode(', ', $tokens) . PHP_EOL; // Count tokens $count = $tokenizer->countTokens($text); echo "'{$text}' has {$count} tokens" . PHP_EOL; // Decode tokens back to text $decoded = $tokenizer->decode($tokens); echo "Decoded tokens back to: '{$decoded}'" . PHP_EOL;
Available Tokenizers
- Tekken Tokenizer: A byte-pair encoding tokenizer optimized for Mistral AI models
- Supports both '240718' and '240911' versions
Batch Processing
<?php $texts = [ 'Hello, world!', 'How are you?', 'Tokenizer test.' ]; // Encode in batch $tokensBatch = $tokenizer->encodeBatch($texts); // Decode in batch $decodedTexts = $tokenizer->decodeBatch($tokensBatch);
Command Line Tool
The package includes a command line tool for tokenizing text:
# Tokenize text from the command line ./vendor/bin/tokenize "Hello, world!" # Specify a tokenizer version ./vendor/bin/tokenize -v 240718 "Hello, world!" # Tokenize text from a file ./vendor/bin/tokenize -f input.txt # Only show the token count ./vendor/bin/tokenize -c "Hello, world!"
Implementation Notes
This implementation provides a simplified tokenizer for Mistral models. The current version uses a character-based tokenization for testing purposes. In a production environment, you would need to implement the full BPE algorithm with the model's merges.
Testing
Run the tests with PHPUnit:
./vendor/bin/phpunit
License
MIT
统计信息
- 总下载量: 7.81k
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 1
- 点击次数: 1
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2025-04-22