Automatically compress and clean your code for efficient use with LLM

As you know, large language models (LLM) have limitations on the size of the context window. When asking a question, it is often impossible to insert the entire source text, which requires combining code from different files in one place.

In this regard, I have developed a script that minimizes the source code of a project by removing spaces, tabs, comments, and test functions. The script allows you to collect all or selected project files in one place.

To use, simply run the script in your project directory to generate a minified out.txt file containing optimized code ready for use with large language models.

Before running the script, edit the following arrays to suit your project's needs: folders_to_ignore, extensions_to_search, filenames_to_search, comment_chars, and stop_words.

Example configuration for a Rust project (include all *.rs files in out.txt):

folders_to_ignore=("target" ".git" ".github" ".gitignore" ".idea" )   # Folders to ignore
extensions_to_search=( "rs" )                              # File extensions to search for
filenames_to_search=("Cargo.toml")                       # Filenames to search for
comment_chars=("#" "//" "/*")                            # Characters that denote comments
stop_words=("#[cfg(test)]")                              # Stop words after which to ignore the remaining lines in the file

Example configuration for a Rust project (include only certain files in out.txt):

folders_to_ignore=("target" ".git" ".github" ".gitignore" ".idea" )   # Folders to ignore
extensions_to_search=( )                              # File extensions to search for
filenames_to_search=("Cargo.toml" "lib.rs" "core.rs")                       # Filenames to search for
comment_chars=("#" "//" "/*")                            # Characters that denote comments
stop_words=("#[cfg(test)]")                              # Stop words after which to ignore the remaining lines in the file

Bash version of the script:

#!/bin/bash

# Remove existing out.txt if it exists
rm -f out.txt

# Arrays
folders_to_ignore=("target" ".git" ".github" ".gitignore" ".idea" )   # Folders to ignore
extensions_to_search=( "rs" )                              # File extensions to search for
filenames_to_search=("Cargo.toml" "core.rs" "text.rs" "json.rs")                       # Filenames to search for
comment_chars=("#" "//" "/*")                            # Characters that denote comments
stop_words=("#[cfg(test)]")                              # Stop words after which to ignore the remaining lines in the file

# Build the 'find' command

# Start with the basic 'find' command
find_cmd="find ."

# Add folders to ignore
if [ ${#folders_to_ignore[@]} -gt 0 ]; then
    ignore_dir_expr=""
    for dir in "${folders_to_ignore[@]}"; do
        if [ -n "$ignore_dir_expr" ]; then
            ignore_dir_expr+=" -o "
        fi
        ignore_dir_expr+="-path './$dir' -prune"
    done
    find_cmd+=" \\( $ignore_dir_expr \\) -o"
fi

# Add conditions to search for files
find_cmd+=" \\( "

name_patterns=()

# Add file extensions
for ext in "${extensions_to_search[@]}"; do
    name_patterns+=("-name '*.$ext'")
done

# Add filenames
for fname in "${filenames_to_search[@]}"; do
    name_patterns+=("-name '$fname'")
done

# Combine all patterns using -o
for ((i=0; i<${#name_patterns[@]}; i++)); do
    find_cmd+=" ${name_patterns[$i]}"
    if [ $i -lt $((${#name_patterns[@]} - 1)) ]; then
        find_cmd+=" -o"
    fi
done

find_cmd+=" \\) -type f -print"

# Print the final command for debugging (you can comment out this line)
# echo "Running command: $find_cmd"

# Build the regular expression for comments
comment_pattern=""
for ((i=0; i<${#comment_chars[@]}; i++)); do
    # Escape special characters in comment characters
    escaped_char=$(printf '%s\n' "${comment_chars[$i]}" | sed 's/[][(){}.*+?^$\\|/]/\\&/g')
    if [ $i -eq 0 ]; then
        comment_pattern="$escaped_char"
    else
        comment_pattern="$comment_pattern|$escaped_char"
    fi
done

# Execute the 'find' command and process the results
while read filepath; do
    echo -e "\n#### $filepath ####" >> out.txt
    stop=false
    # Process the file line by line
    while IFS= read -r line; do
        if [ "$stop" = true ]; then
            break
        fi
        # Remove tabs
        line="${line//$'\t'/}"
        # Remove leading spaces
        line="${line#"${line%%[![:space:]]*}"}"
        # Remove trailing spaces
        line="${line%"${line##*[![:space:]]}"}"
        # Skip lines that are empty or contain only spaces
        if [[ -z "$line" ]]; then
            continue
        fi
        # Check for stop words
        for stop_word in "${stop_words[@]}"; do
            if [[ "$line" == "$stop_word" ]]; then
                stop=true
                break
            fi
        done
        if [ "$stop" = true ]; then
            break
        fi
        # Skip lines that are comments
        if [[ "$line" =~ ^($comment_pattern) ]]; then
            continue
        fi
        # Write the processed line to out.txt
        echo "$line" >> out.txt
    done < "$filepath"
done < <(eval $find_cmd)

PowerShell version of the script:

# Remove existing out.txt if it exists
if (Test-Path -Path "out.txt") {
    Remove-Item -Path "out.txt" -Force
}

# Define arrays

# Folders and files to ignore during the search
$foldersToIgnore = @("target", ".git", ".github", ".gitignore", ".idea")

# File extensions to search for
$extensionsToSearch = @("rs")

# Specific filenames to search for
$filenamesToSearch = @("Cargo.toml", "core.rs", "text.rs", "json.rs")

# Characters that denote comments in the files
$commentChars = @("#", "//", "/*")

# Words that, when encountered, will stop processing the current file
$stopWords = @("#[cfg(test)]")

# Function to build file filtering based on provided criteria
function Get-FilteredFiles {
    param (
        [string[]]$IgnoreFolders,
        [string[]]$Extensions,
        [string[]]$Filenames
    )

    # Build a regex pattern for ignored folders
    if ($IgnoreFolders.Count -gt 0) {
        $ignorePattern = ($IgnoreFolders | ForEach-Object { [regex]::Escape($_) }) -join '|'
    } else {
        $ignorePattern = ""
    }

    # Build a list of filters for extensions and filenames
    $nameFilters = @()
    foreach ($ext in $Extensions) {
        $nameFilters += "*.$ext"
    }
    foreach ($fname in $Filenames) {
        $nameFilters += $fname
    }

    # Get all files with the specified extensions or filenames
    Get-ChildItem -Path . -Recurse -File -Include $nameFilters | Where-Object {
        if ($ignorePattern) {
            # Check if the full path contains any of the ignored folders
            -not ($_.FullName -match "\\($ignorePattern)\\")
        } else {
            $true
        }
    }
}

# Build a regex pattern for comments
$escapedCommentChars = $commentChars | ForEach-Object { [regex]::Escape($_) }
$commentPattern = $escapedCommentChars -join '|'

# Get the list of files to process
$files = Get-FilteredFiles -IgnoreFolders $foldersToIgnore -Extensions $extensionsToSearch -Filenames $filenamesToSearch

# Process each file
foreach ($file in $files) {
    # Add file header to out.txt
    "`n#### $($file.FullName) ####" | Out-File -FilePath "out.txt" -Append -Encoding utf8

    $stop = $false

    # Read the file line by line
    Get-Content -Path $file.FullName | ForEach-Object {
        if ($stop) {
            return
        }

        $line = $_

        # Remove tabs
        $line = $line -replace "`t", ""

        # Trim leading and trailing spaces
        $line = $line.Trim()

        # Skip empty lines
        if ([string]::IsNullOrWhiteSpace($line)) {
            return
        }

        # Check for stop words
        foreach ($stopWord in $stopWords) {
            if ($line -eq $stopWord) {
                $stop = $true
                break
            }
        }
        if ($stop) {
            return
        }

        # Skip lines that are comments
        if ($line -match "^($commentPattern)") {
            return
        }

        # Write the processed line to out.txt
        $line | Out-File -FilePath "out.txt" -Append -Encoding utf8
    }
}

P.S.

The contents of the out.txt file must be copied to the clipboard and pasted as text into the LLM input window. Do not attach the out.txt file to the question. Usually, for optimization reasons, LLM processes files, extracts a summary from them, and based on this summary, answers the question. In other words, if you paste the contents of the out.txt file into the LLM input window and then ask a question, the model will answer based on the entire contents of the out.txt file.

The source code of the scripts is located at GitHubif you have any improvements, please make a pull request.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *