webrek/tokenizers 问题修复 & 功能扩展

解决BUG、新增功能、兼容多环境部署,快速响应你的开发需求

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

webrek/tokenizers

Composer 安装命令:

pie install webrek/tokenizers

包简介

Native byte-level BPE (tiktoken-compatible: cl100k_base/o200k_base) plus WordPiece and Unigram tokenizers for PHP 8.3+, with a process-shared vocab cache. Loads HuggingFace tokenizer.json models and counts Claude/Gemini tokens via their APIs.

README 文档

README

A native PHP extension that counts, encodes, and decodes LLM tokens — byte-exact with the reference tokenizers, fast, and with no Rust toolchain. Plus a pure-PHP companion that counts Claude and Gemini tokens through their official APIs.

CI docs version php thread safety license

Think of it as a scale for AI text: before you send a prompt to an LLM, weigh it in tokens so you know what it will cost and whether it fits the model's context window — all from PHP, exactly, without rebuilding a 26 MB vocabulary on every request.

use Tokenizers\Encoding;

$enc = Encoding::load('cl100k_base');                 // GPT-4 / GPT-4o-class encoding
echo $enc->countTokens('Hello, world! 🎉');           // 7

Why this extension?

Property Pure-PHP tiktoken port This extension
Memory per worker ~26 MB vocab rebuilt every request Loaded once per worker process
Worst-case latency O(n²) per pre-token piece O(n log n) heap-based merge
Install Pure PHP Single .sono Rust, no ffi.enable
Accuracy Approximate Byte-exact vs the reference (tiktoken / BERT / T5)
Models OpenAI only OpenAI + BERT + T5 + Llama/Mistral/Qwen… + Claude/Gemini via API

The wins are memory, worst-case latency, accuracy, and installability — not raw throughput on tiny inputs. For prompts that fit in a tweet, a pure-PHP port can be faster (no extension-call overhead). This extension is for workloads where the 26 MB-per-worker overhead, adversarial inputs, or byte-exactness actually matter.

Supported models

Locally, byte-exact

Algorithm Models How to load
BPE (tiktoken) GPT-4, GPT-4o (text), o1, o3 Encoding::load('cl100k_base')
BPE (tiktoken) GPT-4o (multimodal), o1 mini/pro Encoding::load('o200k_base')
BPE (HuggingFace) GPT-2, RoBERTa, Llama 3, Mistral (tekken), Qwen, DeepSeek Encoding::fromHuggingFace('tokenizer.json')
WordPiece BERT family (bert-base-uncased, …) Encoding::fromHuggingFace('tokenizer.json')
Unigram T5, ALBERT (SentencePiece) Encoding::fromHuggingFace('tokenizer.json')

Conformance is verified byte-for-byte against Python tiktoken, HuggingFace BertTokenizerFast, and t5-small. Any diff against the committed fixtures fails CI. See Status & conformance.

Remotely (no public tokenizer)

Claude 3+ and Gemini do not publish their tokenizers — there is no local vocabulary to load. The pure-PHP companion (Tokenizers\Remote\Anthropic, Tokenizers\Remote\Gemini, Tokenizers\TokenCounter) counts their tokens through the providers' official count_tokens endpoints. It works without building the C extension. See the Remote providers guide.

Install

phpize && ./configure && make && make install

Then enable it in your php.ini:

extension=tokenizers

Verify:

php -m | grep tokenizers          # → tokenizers

pecl install tokenizers and pie install webrek/tokenizers are also supported. Full instructions, requirements (libpcre2), and troubleshooting are in Getting Started.

Quick start

use Tokenizers\Encoding;

// OpenAI encoding (vocab downloads + caches on first use)
$enc = Encoding::load('cl100k_base');
$n   = $enc->countTokens($prompt);     // count without allocating the id array
$ids = $enc->encode($prompt);          // int[]
$str = $enc->decode($ids);             // round-trips for plain text

// HuggingFace model — returns Bpe | WordPiece | Unigram by model type
$bert = Encoding::fromHuggingFace('/path/to/bert/tokenizer.json');
$t5   = Encoding::fromHuggingFace('/path/to/t5/tokenizer.json');

// One facade for local + remote, routed by model name
use Tokenizers\TokenCounter;
$tc = new TokenCounter();
$tc->count('cl100k_base',     $text);  // local BPE, no key
$tc->count('claude-opus-4-8', $text);  // remote Anthropic (needs ANTHROPIC_API_KEY)
$tc->count('gemini-1.5-flash',$text);  // remote Gemini   (needs GEMINI_API_KEY)

Documentation

Guide What it covers
Getting Started Install, enable, verify, first tokenization, troubleshooting
Loading models OpenAI / HuggingFace BPE, WordPiece, Unigram, options, the cache
Estimating LLM costs Budget spend, fit context windows, track usage per client
Remote providers (Claude / Gemini) Counting tokens via the provider APIs
API reference Every class, method, and function
Status & limitations What's verified, conformance results, honest limits, roadmap

Project status

v0.1.0, early but functional. All three planned phases are complete and merged:

  • BPE (cl100k_base, o200k_base, HuggingFace BPE) — byte-exact, O(n log n) merge.
  • WordPiece (BERT) and Unigram (T5/SentencePiece) — byte-exact.
  • Claude / Gemini API companion — pure PHP, standalone.

Honest caveats live in Status & limitations (normalization scope, PIE install not yet verified end-to-end, remote counting needs a network call + key).

License

Apache-2.0 — see LICENSE. Vocabulary files for built-in encodings are downloaded from OpenAI's public CDN at runtime, checksum-verified, and not redistributed with the extension.

统计信息

  • 总下载量: 0
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 2
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • 开发语言: C

其他信息

  • 授权协议: Apache-2.0
  • 更新时间: 2026-06-29