Skip to content

raggy.utilities.text

count_tokens

Counts the number of tokens in the given text using the specified model.

Parameters:

Name Type Description Default
text str

The text to count tokens in.

required
model str | None

The model to use for token counting. If not provided, the default model is used.

None

Returns:

Name Type Description
int int

The number of tokens in the text.

detokenize

Detokenizes the given tokens using the specified model.

Parameters:

Name Type Description Default
tokens list[int]

The tokens to detokenize.

required
model str | None

The model to use for detokenization. If not provided, the default model is used.

None

Returns:

Name Type Description
str str

The detokenized text.

extract_keywords

Extract keywords from the given text using the yake library.

Parameters:

Name Type Description Default
text str

The text to extract keywords from.

required

Returns:

Type Description
list[str]

list[str]: The keywords extracted from the text.

Raises:

Type Description
ImportError

If yake is not installed.

Example

Extract keywords from a text:

from raggy.utilities.text import extract_keywords

text = "This is a sample text from which we will extract keywords."
keywords = extract_keywords(text)
print(keywords) # ['keywords', 'sample', 'text', 'extract']

get_encoding_for_model

Get the tiktoken encoding for the specified model.

Parameters:

Name Type Description Default
model str | None

The model to get the encoding for. If not provided, the default chat completions model is used (as specified in raggy.settings). If an invalid model is provided, 'gpt-3.5-turbo' is used.

None

Returns:

Type Description
Encoding

tiktoken.Encoding: The encoding for the specified model.

Example

Get the encoding for the default chat completions model:

from raggy.utilities.text import get_encoding_for_model

encoding = get_encoding_for_model() # 'gpt-3.5-turbo' by default

hash_text cached

Hash the given text using the xxhash algorithm.

Parameters:

Name Type Description Default
text str

The text to hash.

()

Returns:

Name Type Description
str str

The hash of the text.

Example

Hash a single text:

from raggy.utilities.text import hash_text

text = "This is a sample text."
hash_ = hash_text(text)
print(hash_) # 4a2db845d20188ce069196726a065a09

slice_tokens

Slices the given text to the specified number of tokens.

Parameters:

Name Type Description Default
text str

The text to slice.

required
n_tokens int

The number of tokens to slice the text to.

required

Returns:

Name Type Description
str str

The sliced text.

Example

Slice a text to the first 50 tokens:

from raggy.utilities.text import slice_tokens

text = "This is a sample text."*100
sliced_text = slice_tokens(text, 5)
print(sliced_text) # 'This is a sample text.'

split_text

Split a text into a list of strings. Chunks are split by tokens.

Parameters:

Name Type Description Default
text str

The text to split.

required
chunk_size int

The number of tokens in each chunk.

required
chunk_overlap float | None

The fraction of overlap between chunks.

None
last_chunk_threshold float | None

If the last chunk is less than this fraction of the chunk_size, it will be added to the prior chunk

None

Returns:

Type Description
list[str]

list[str]: The list of chunks.

Example

Split a text into chunks of 5 tokens with 10% overlap:

from raggy.utilities.text import split_text

text = "This is a sample text."*3
chunks = split_text(text, 5, 0.1)
print(chunks) # ['This is a sample text', '.This is a sample text', '.This is a sample text.']

tokenize

Tokenizes the given text using the specified model.

Parameters:

Name Type Description Default
text str

The text to tokenize.

required
model str | None

The model to use for tokenization. If not provided, the default model is used.

None

Returns:

Type Description
list[int]

list[int]: The tokenized text as a list of integers.