-
Notifications
You must be signed in to change notification settings - Fork 87
Helpers
yooper edited this page Oct 28, 2017
·
5 revisions
Helpers help simplify the process of text analysis.
$tokens = tokenize($text);
You can customize which type of tokenizer to tokenize with by passing in the name of the tokenizer class
$tokens = tokenize($text, \TextAnalysis\Tokenizers\PennTreeBankTokenizer::class);
The default tokenizer is \TextAnalysis\Tokenizers\GeneralTokenizer::class . Some tokenizers require parameters to be set upon instantiation.
By default, normalize_tokens uses the function strtolower to lowercase all the tokens. To customize the normalize function, pass in either a function or a string to be used by array_map.
$normalizedTokens = normalize_tokens(array $tokens);
$normalizedTokens = normalize_tokens(array $tokens, 'mb_strtolower');
$normalizedTokens = normalize_tokens(array $tokens, function($token){ return mb_strtoupper($token); });
The call to freq_dist returns a FreqDist instance.
$freqDist = freq_dist(tokenize($text));
By default bigrams are generated.
$bigrams = ngrams($tokens);
Customize the ngrams
// create trigrams with a pipe delimiter in between each word
$trigrams = ngrams($tokens,3, '|');