This is the scrapbook.

Here is where random information goes. Most of them are little snippets of information, stories, quick fixes or references to additional scrapbook-style information – short notes that were helpful to me and might be useful to someone else. Of course, this is only meant as a suggestion, without any guarantee or assurance of function or feasibility. If you would like my professional, technical support, please contact me at https://c-7.de.

Die “Kladde”

Hier landen zufällige Informationen. Die meisten davon sind kleine Informationsschnipsel,Geschichten, schnelle Lösungen oder Verweise auf zusätzliche Informationen im Sammelalbum-Stil – kurze Notizen, die für mich hilfreich waren und vielleicht auch für jemand anderen nützlich sein können. Natürlich ist das hier nur als Anregung zu verstehen, ohne jegliche Gewähr oder Zusicherung einer Funktion oder Machbarkeit. Wenn Sie meine professionelle, technische Unterstützung möchten, Kontaktieren Sie mich bitte über https://c-7.de

AI: Offline Quality Image-to-Text runs on Laptop

by Claus | Dec 8, 2023 | Software, Tech Corner

Image Management with Quantized LLMs: A Leap in Efficiency, Accessibility and Privacy

The challenge of managing extensive digital image libraries is a universal one, efficiently organizing and retrieving images from large collections is a task that transcends professional boundaries. The advent of quantized Large Language Models (LLMs), particularly the LLaMA model, has introduced a groundbreaking solution for “Image to Text” that is far beyond keywording and both efficient and accessible, even on standard computing hardware like a MacBook Pro.

To keep the whole process ‘in-house’ has significant privacy and confidentiality benefits.

Another often-overlooked aspect of digital image management, particularly crucial for website design and content creation, is (web) accessibility for the visually impaired. Image captions, which provide a textual description of the visual content, are essential for making content more inclusive.

The Universal Challenge of Image Management

Recognising image contents (Image to Text) in a searchable and interpretable format is the next leap. The need for an automated, efficient, and privacy-conscious solution is widely felt. However, the resource requirements of large language models were often a limiting factor as either considerable in-house investment was required or data had to be entrusted to external service providers.

The Power of Quantized LLMs in Image Processing

Quantized LLMs, such as the LLaMA model, can represent a significant advancement for digital asset management. Model quantization is a technique used to reduce the size of large neural networks by modifying the precision of their weights. This process involves converting the weights of the model from higher precision data types (like float32) to lower-precision ones (like INT4), effectively shrinking the model’s size and making it feasible to run on less powerful hardware, even on a Laptop like a MacBook Pro with 16GB of memory which was used for this demonstration.

Key Benefits of Quantization for Image Management

Reduced Hardware Demands: By lowering the precision of the model’s weights, quantization allows the LLaMA model to run efficiently on commonly available hardware, making this technology more accessible.
Maintained Performance: Despite the reduction in size, quantized models like LLaMA maintain a high level of accuracy and capability, crucial for detailed image description and organization.
Enhanced Privacy: Local processing of images with quantized LLMs ensures that sensitive data remains within the user’s system, addressing major privacy concerns.
Time Efficiency: The script processes images in about 15 seconds each, a testament to the efficiency of quantized models in handling complex tasks quickly.

Practical Application and Efficiency

A script has been developed that leverages a Large Language Model to automatically generate and embed detailed descriptions into images from various sources, including files, directories, and URLs. This tool processes both RAW and standard image formats, converting them as needed and storing the AI-generated content descriptions in both text files and image metadata (XMP files) for enhanced content recognition and management. The practical application of this script on a MacBook Pro demonstrates the efficiency of quantized LLMs. The balance between performance and resource requirements means that advanced image processing and organization are now more accessible than ever. Batch processing of 1000 local images files took approximately 15 seconds per image.

Script utilising llama.cpp for image library management

#!/bin/bash

# Enhanced script to describe an image and handle various input/output methods
# file, path-to-files, url
# requires exiftools, llama.cpp 
# User should set these paths before running the script
LLAVA_BIN="YOUR_PATH_TO_LLAVA_CLI"
MODELS_DIR="YOUR_PATH_TO_MODELS_DIR"
MODEL="YOUR_MODEL_NAME"
MMPROJ="YOUR_MMPROJ_NAME"

TOKENS=256
THREADS=8
MTEMP=0.1
MPROMPT="Describe the image in as much detail as possible."
MCONTEXT=2048
GPULAYERS=50

# Function to process an image file
process_image() {
    local image_file=$1
    local output_file="${image_file%.*}.txt"
    local xmp_file="${image_file%.*}.xmp"

    OUTPUT="$(${LLAVA_BIN} -m ${MODELS_DIR}/${MODEL} --mmproj ${MODELS_DIR}/${MMPROJ} --threads ${THREADS} --temp ${MTEMP} --prompt "${MPROMPT}" --image "${image_file}" --n-gpu-layers ${GPULAYERS} --ctx-size ${MCONTEXT} --n-predict ${TOKENS})"
    RES=$(echo "$OUTPUT" | awk '/ per image patch\)/{p=1;next} p')

    # Remove leading and trailing whitespace
    RES="${RES#"${RES%%[![:space:]]*}"}"
    RES="${RES%"${RES##*[![:space:]]}"}"

    # Output handling
    if [[ $input_source == "file" ]]; then
        echo "$RES" > "$output_file"
        # Check if XMP file exists, if not create it
        if [[ ! -f "$xmp_file" ]]; then
            exiftool -xmp -o "$xmp_file" "$image_file"
        fi
        # Write the description to the XMP file
        if [[ -f "$xmp_file" ]]; then
            exiftool -XMP-dc:Description="$RES" "$xmp_file"
        else
            exiftool -XMP-dc:Description="$RES" "$image_file"
        fi
    elif [[ $input_source == "url" ]]; then
        echo "$RES" | pbcopy
    fi
}

# Export the function so it's available in subshells
export -f process_image

# Function to process a directory
process_directory() {
    local dir=$1
    while IFS= read -r -d '' file; do
        # Check if MODELS_DIR exists
        if [[ ! -d "$MODELS_DIR" ]]; then
            echo "Error: MODELS_DIR ($MODELS_DIR) not found. Exiting."
            exit 1
        fi
        process_image "$file" "file"
    done < <(find "$dir" -type f \( -iname "*.jpg" -o -iname "*.png" -o -iname "*.gif" -o -iname "*.tiff" \) -print0)
}

# Function to download and process an image from a URL
process_url() {
    local url=$1
    local temp_dir=$(mktemp -d)
    local temp_file="${temp_dir}/image"

    curl -s -o "$temp_file" "$url" && process_image "$temp_file" "url"
    rm -rf "$temp_dir"
}

# Main script logic
input=$1

if [[ -f $input ]]; then
    input_source="file"
    process_image "$input"
elif [[ -d $input ]]; then
    input_source="file"
    process_directory "$input"
elif [[ $input =~ ^https?:// ]]; then
    input_source="url"
    process_url "$input"
else
    echo "Invalid input. Please provide a valid file, directory, or URL."
    exit 1
fi

Examples

Example Image 1:

Model Generated Description: The image features a train station with a large train parked on the tracks. The train spans a significant portion of the scene, stretching from the left to the right side. There are several other trains visible in the background, indicating a busy train station.
In addition to the trains, there are multiple people scattered throughout the scene, likely waiting for their trains or attending to other tasks at the station. Some of these individuals are closer to the foreground, while others are further back in the scene.
The overall atmosphere of the image suggests a bustling and active train station, with various trains and people contributing to the busy environment.

Example Image 2:

Model Generated Description: The image features a large statue of a cat holding a fork and knife in its paws. The cat statue is positioned in the middle of the scene, with the fork and knife held prominently in its hands. The statue appears to be made of metal, giving it a unique and artistic appearance.In the background, there are trees visible, adding a natural element to the scene. The combination of the cat statue and the surrounding environment creates an interesting and visually appealing composition.

Remark: It is not error-free. There is no knife in the paw, it is a fold of the coat.

Conclusion

The integration of quantized LLMs like LLaMA in image management is a transformative development, making advanced digital asset organization accessible on standard hardware and ensuring data privacy. This approach represents a significant leap forward, offering a powerful, user-friendly solution for a wide range of users and scenarios.

*Note: For those interested in exploring this solution further or seeking assistance with similar challenges, consultancy services are available. These services provide expertise in integrating and customizing such technologies to suit a variety of needs and preferences. Feel free to contact me at hello∂c-7.de. Claus Siebeneicher *