Text

`count_tokens` ¶

Counts the number of tokens in the given text using the specified model.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to count tokens in.	required
`model`	`str \| None`	The model to use for token counting. If not provided, the default model is used.	`None`

Returns:

Name	Type	Description
`int`	`int`	The number of tokens in the text.

`detokenize` ¶

Detokenizes the given tokens using the specified model.

Parameters:

Name	Type	Description	Default
`tokens`	`list[int]`	The tokens to detokenize.	required
`model`	`str \| None`	The model to use for detokenization. If not provided, the default model is used.	`None`

Returns:

Name	Type	Description
`str`	`str`	The detokenized text.

`extract_keywords` ¶

Extract keywords from the given text using the yake library.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to extract keywords from.	required

Returns:

Type	Description
`list[str]`	list[str]: The keywords extracted from the text.

Raises:

Type	Description
`ImportError`	If yake is not installed.

Example

Extract keywords from a text:

from raggy.utilities.text import extract_keywords

text = "This is a sample text from which we will extract keywords."
keywords = extract_keywords(text)
print(keywords) # ['keywords', 'sample', 'text', 'extract']

`get_encoding_for_model` ¶

Get the tiktoken encoding for the specified model.

Parameters:

Name	Type	Description	Default
`model`	`str \| None`	The model to get the encoding for. If not provided, the default chat completions model is used (as specified in `raggy.settings`). If an invalid model is provided, 'gpt-3.5-turbo' is used.	`None`

Returns:

Type	Description
`Encoding`	tiktoken.Encoding: The encoding for the specified model.

Example

Get the encoding for the default chat completions model:

from raggy.utilities.text import get_encoding_for_model

encoding = get_encoding_for_model() # 'gpt-3.5-turbo' by default

`hash_text` `cached` ¶

Hash the given text using the xxhash algorithm.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to hash.	`()`

Returns:

Name	Type	Description
`str`	`str`	The hash of the text.

Example

Hash a single text:

from raggy.utilities.text import hash_text

text = "This is a sample text."
hash_ = hash_text(text)
print(hash_) # 4a2db845d20188ce069196726a065a09

`slice_tokens` ¶

Slices the given text to the specified number of tokens.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to slice.	required
`n_tokens`	`int`	The number of tokens to slice the text to.	required

Returns:

Name	Type	Description
`str`	`str`	The sliced text.

Example

Slice a text to the first 50 tokens:

from raggy.utilities.text import slice_tokens

text = "This is a sample text."*100
sliced_text = slice_tokens(text, 5)
print(sliced_text) # 'This is a sample text.'

`split_text` ¶

Split a text into a list of strings. Chunks are split by tokens.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to split.	required
`chunk_size`	`int`	The number of tokens in each chunk.	required
`chunk_overlap`	`float \| None`	The fraction of overlap between chunks.	`None`
`last_chunk_threshold`	`float \| None`	If the last chunk is less than this fraction of the chunk_size, it will be added to the prior chunk	`None`

Returns:

Type	Description
`list[str]`	list[str]: The list of chunks.

Example

Split a text into chunks of 5 tokens with 10% overlap:

from raggy.utilities.text import split_text

text = "This is a sample text."*3
chunks = split_text(text, 5, 0.1)
print(chunks) # ['This is a sample text', '.This is a sample text', '.This is a sample text.']

`tokenize` ¶

Tokenizes the given text using the specified model.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to tokenize.	required
`model`	`str \| None`	The model to use for tokenization. If not provided, the default model is used.	`None`

Returns:

Type	Description
`list[int]`	list[int]: The tokenized text as a list of integers.

Text

count_tokens ¶

detokenize ¶

extract_keywords ¶

get_encoding_for_model ¶

hash_text cached ¶

slice_tokens ¶

split_text ¶

tokenize ¶

`count_tokens` ¶

`detokenize` ¶

`extract_keywords` ¶

`get_encoding_for_model` ¶

`hash_text` `cached` ¶

`slice_tokens` ¶

`split_text` ¶

`tokenize` ¶