Skip to main content

Module token_cleaner

Module token_cleaner 

Source
Expand description

Utility routines for sanitizing code before chunking/embedding.

Functionsยง

clean_and_redact
Cleans and redacts code in the proper order:
clean_code
Cleans code by removing comments, imports, console logs, and excessive whitespace.
count_tokens
Performs token counting using a simple word-split approach. This is a basic implementation that counts space-separated tokens. For production, consider integrating with an actual tokenizer like llama3.
redact_secrets
Redacts sensitive information from code (API keys, tokens, passwords, etc.).