定制 yooper/php-text-analysis 二次开发

按需修改功能、优化性能、对接业务系统,提供一站式技术支持

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

yooper/php-text-analysis

最新稳定版本:1.9.2

Composer 安装命令:

composer require yooper/php-text-analysis

包简介

PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language

README 文档

README

alt text

Latest Stable Version

Total Downloads

PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language. There are tools in this library that can perform:

  • document classification
  • sentiment analysis
  • compare documents
  • frequency analysis
  • tokenization
  • stemming
  • collocations with Pointwise Mutual Information
  • lexical diversity
  • corpus analysis
  • text summarization

All the documentation for this project can be found in the book and wiki.

PHP Text Analysis Book & Wiki

A book is in the works and your contributions are needed. You can find the book at https://github.com/yooper/php-text-analysis-book

Also, documentation for the library resides in the wiki, too. https://github.com/yooper/php-text-analysis/wiki

Installation Instructions

Add PHP Text Analysis to your project

composer require yooper/php-text-analysis

Tokenization

$tokens = tokenize($text);

You can customize which type of tokenizer to tokenize with by passing in the name of the tokenizer class

$tokens = tokenize($text, \TextAnalysis\Tokenizers\PennTreeBankTokenizer::class);

The default tokenizer is \TextAnalysis\Tokenizers\GeneralTokenizer::class . Some tokenizers require parameters to be set upon instantiation.

Normalization

By default, normalize_tokens uses the function strtolower to lowercase all the tokens. To customize the normalize function, pass in either a function or a string to be used by array_map.

$normalizedTokens = normalize_tokens(array $tokens); 
$normalizedTokens = normalize_tokens(array $tokens, 'mb_strtolower');

$normalizedTokens = normalize_tokens(array $tokens, function($token){ return mb_strtoupper($token); });

Frequency Distributions

The call to freq_dist returns a FreqDist instance.

$freqDist = freq_dist(tokenize($text));

Ngram Generation

By default bigrams are generated.

$bigrams = ngrams($tokens);

Customize the ngrams

// create trigrams with a pipe delimiter in between each word
$trigrams = ngrams($tokens,3, '|');

Stemming

By default stem method uses the Porter Stemmer.

$stemmedTokens = stem($tokens);

You can customize which type of stemmer to use by passing in the name of the stemmer class name

$stemmedTokens = stem($tokens, \TextAnalysis\Stemmers\MorphStemmer::class);

Keyword Extract with Rake

There is a short cut method for using the Rake algorithm. You will need to clean your data prior to using. Second parameter is the ngram size of your keywords to extract.

$rake = rake($tokens, 3);
$results = $rake->getKeywordScores();

Sentiment Analysis with Vader

Need Sentiment Analysis with PHP Use Vader, https://github.com/cjhutto/vaderSentiment . The PHP implementation can be invoked easily. Just normalize your data before hand.

$sentimentScores = vader($tokens);

Document Classification with Naive Bayes

Need to do some document classification with PHP, trying using the Naive Bayes implementation. An example of classifying movie reviews can be found in the unit tests

$nb = naive_bayes();
$nb->train('mexican', tokenize('taco nacho enchilada burrito'));        
$nb->train('american', tokenize('hamburger burger fries pop'));  
$nb->predict(tokenize('my favorite food is a burrito'));

统计信息

  • 总下载量: 370.64k
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 539
  • 点击次数: 1
  • 依赖项目数: 3
  • 推荐数: 0

GitHub 信息

  • Stars: 531
  • Watchers: 40
  • Forks: 92
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2016-04-12