Expand description
Utility routines for sanitizing code before chunking/embedding.
Functionsยง
- clean_
and_ redact - Cleans and redacts code in the proper order:
- clean_
code - Cleans code by removing comments, imports, console logs, and excessive whitespace.
- count_
tokens - Performs token counting using a simple word-split approach. This is a basic implementation that counts space-separated tokens. For production, consider integrating with an actual tokenizer like llama3.
- redact_
secrets - Redacts sensitive information from code (API keys, tokens, passwords, etc.).