承接 aisk/mistral-tokenizer 相关项目开发

从需求分析到上线部署,全程专人跟进,保证项目质量与交付效率

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

aisk/mistral-tokenizer

最新稳定版本:0.2.0

Composer 安装命令:

composer require aisk/mistral-tokenizer

包简介

PHP implementation of Mistral's tokenizers

README 文档

README

A PHP implementation of the Mistral tokenizers, focusing on the Tekken tokenizer.

Developed by Aisk.

Installation

composer require aisk/mistral-tokenizer

Features

  • Text tokenization and detokenization
  • Token counting
  • Batch processing
  • Special token handling
  • Unicode character support

Usage

Basic Usage

<?php

require_once 'vendor/autoload.php';

use Aisk\Tokenizer\TokenizerFactory;

// Create a tokenizer factory
$factory = new TokenizerFactory();

// Get the Tekken tokenizer
$tokenizer = $factory->getTekkenTokenizer('240911');

// Encode a string
$text = 'Hello, world!';
$tokens = $tokenizer->encode($text);
echo "Encoded '{$text}' to " . count($tokens) . " tokens: " . implode(', ', $tokens) . PHP_EOL;

// Count tokens
$count = $tokenizer->countTokens($text);
echo "'{$text}' has {$count} tokens" . PHP_EOL;

// Decode tokens back to text
$decoded = $tokenizer->decode($tokens);
echo "Decoded tokens back to: '{$decoded}'" . PHP_EOL;

Available Tokenizers

  • Tekken Tokenizer: A byte-pair encoding tokenizer optimized for Mistral AI models
    • Supports both '240718' and '240911' versions

Batch Processing

<?php

$texts = [
    'Hello, world!',
    'How are you?',
    'Tokenizer test.'
];

// Encode in batch
$tokensBatch = $tokenizer->encodeBatch($texts);

// Decode in batch
$decodedTexts = $tokenizer->decodeBatch($tokensBatch);

Command Line Tool

The package includes a command line tool for tokenizing text:

# Tokenize text from the command line
./vendor/bin/tokenize "Hello, world!"

# Specify a tokenizer version
./vendor/bin/tokenize -v 240718 "Hello, world!"

# Tokenize text from a file
./vendor/bin/tokenize -f input.txt

# Only show the token count
./vendor/bin/tokenize -c "Hello, world!"

Implementation Notes

This implementation provides a simplified tokenizer for Mistral models. The current version uses a character-based tokenization for testing purposes. In a production environment, you would need to implement the full BPE algorithm with the model's merges.

Testing

Run the tests with PHPUnit:

./vendor/bin/phpunit

License

MIT

统计信息

  • 总下载量: 7.81k
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 1
  • 点击次数: 1
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 1
  • Watchers: 0
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2025-04-22