ze/tokenizer-gpt3
最新稳定版本:v1.1
Composer 安装命令:
composer require ze/tokenizer-gpt3
包简介
PHP package for Byte Pair Encoding (BPE) used by GPT-3.
README 文档
README
This Project is fork from Gioni06/GPT3Tokenizer
Just do some changes for PHP 7.4 compatibility
This is a PHP port of the GPT-3 tokenizer. It is based on the original Python implementation and the Nodejs implementation.
GPT-2 and GPT-3 use a technique called byte pair encoding to convert text into a sequence of integers, which are then used as input for the model. When you interact with the OpenAI API, you may find it useful to calculate the amount of tokens in a given text before sending it to the API.
If you want to learn more, read the Summary of the tokenizers from Hugging Face.
Installation
composer require ze/tokenizer-gpt3
Use the configuration Class
$defaultConfig = new Gpt3TokenizerConfig(); $customConfig = new Gpt3TokenizerConfig(); $customConfig ->vocabPath('custom_vocab.json') ->mergesPath('custom_merges.txt') ->useCache(false)
A note on caching
The tokenizer will try to use apcu for caching, if that is not available it will use a plain PHP array.
You will see slightly better performance for long texts when using the cache. The cache is enabled by default.
Encode a text
$config = new Gpt3TokenizerConfig(); $tokenizer = new Gpt3Tokenizer($config); $text = "This is some text"; $tokens = $tokenizer->encode($text); // [1212,318,617,2420]
Decode a text
$config = new Gpt3TokenizerConfig(); $tokenizer = new Gpt3Tokenizer($config); $tokens = [1212,318,617,2420] $text = $tokenizer->decode($tokens); // "This is some text"
Count the number of tokens in a text
$config = new Gpt3TokenizerConfig(); $tokenizer = new Gpt3Tokenizer($config); $text = "This is some text"; $numberOfTokens = $tokenizer->count($text); // 4
License
This project uses the Apache License 2.0 license. See the LICENSE file for more information.
统计信息
- 总下载量: 6.51k
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 1
- 点击次数: 1
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: Apache-2.0
- 更新时间: 2023-02-28