Text
count_tokens
¶
Counts the number of tokens in the given text using the specified model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to count tokens in. |
required |
model
|
str | None
|
The model to use for token counting. If not provided, the default model is used. |
None
|
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
The number of tokens in the text. |
detokenize
¶
Detokenizes the given tokens using the specified model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens
|
list[int]
|
The tokens to detokenize. |
required |
model
|
str | None
|
The model to use for detokenization. If not provided, the default model is used. |
None
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The detokenized text. |
extract_keywords
¶
Extract keywords from the given text using the yake library.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to extract keywords from. |
required |
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: The keywords extracted from the text. |
Raises:
Type | Description |
---|---|
ImportError
|
If yake is not installed. |
get_encoding_for_model
¶
Get the tiktoken
encoding for the specified model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str | None
|
The model to get the encoding for. If not provided, the default
chat completions model is used (as specified in |
None
|
Returns:
Type | Description |
---|---|
Encoding
|
tiktoken.Encoding: The encoding for the specified model. |
hash_text
cached
¶
Hash the given text using the xxhash algorithm.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to hash. |
()
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The hash of the text. |
slice_tokens
¶
Slices the given text to the specified number of tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to slice. |
required |
n_tokens
|
int
|
The number of tokens to slice the text to. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The sliced text. |
split_text
¶
Split a text into a list of strings. Chunks are split by tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to split. |
required |
chunk_size
|
int
|
The number of tokens in each chunk. |
required |
chunk_overlap
|
float | None
|
The fraction of overlap between chunks. |
None
|
last_chunk_threshold
|
float | None
|
If the last chunk is less than this fraction of the chunk_size, it will be added to the prior chunk |
None
|
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: The list of chunks. |
tokenize
¶
Tokenizes the given text using the specified model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to tokenize. |
required |
model
|
str | None
|
The model to use for tokenization. If not provided, the default model is used. |
None
|
Returns:
Type | Description |
---|---|
list[int]
|
list[int]: The tokenized text as a list of integers. |