WebGPU crash on Android Chrome running SmolVLM-256M-Instruct #1205

sbrzz · 2025-02-22T12:48:38Z

System Info

transformers.js 3.3.3 (via https://cdn.jsdelivr.net/npm/@huggingface/[email protected])
Platform Android 13
Chrome for Android 133.0.6943.50

webgpureport attached

webgpureport-2025-02-22T06-59-48-900Z.txt

Environment/Platform

Description

I successfully run SmolVLM2-256-Instruct on my development machine (webgpu enabled). I saw a classic 10x improve over WASM.
However, I have an error when I try to use the same code on target device (Android).

I tried embed_tokens: "fp16" without success (no support from the target device), then switched to embed_tokens: "fp32".

Chrome console output:

WebGL: CONTEXT_LOST_WEBGL: loseContext: context lostUnderstand this warning
A valid external Instance reference no longer exists.
Uncaught (in promise) AbortError: Failed to execute 'mapAsync' on 'GPUBuffer': A valid external Instance reference no longer exists.

I supposed the problem was fixed based on #943

Any idea?

Reproduction

Code to Reproduce

import { 
    AutoProcessor,
    AutoModelForVision2Seq,
    load_image,
} from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';

console.log("vlm.js");

const DEBUG_MODE = true;

globalThis.whatsInTheImage = async function (imagePath) {
    console.log(imagePath);

    // Track execution times
    const timings = {};
    function logTime(label) {
        if (DEBUG_MODE){
            const now = performance.now();
            if (!timings[label]) {
                timings[label] = now;
            } else {
                console.log(`${label} took ${(now - timings[label]).toFixed(2)}ms`);
                delete timings[label];
            }
        }
    }

    // Load image
    logTime("Image Loading");
    const image1 = await load_image(imagePath);
    logTime("Image Loading");

    // Load processor and model
    const model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";

    logTime("Processor Loading");
    const processor = await AutoProcessor.from_pretrained(model_id);
    logTime("Processor Loading");

    logTime("Model Loading");
    const model = await AutoModelForVision2Seq.from_pretrained(model_id, {
        dtype: {
            embed_tokens: "fp32", 
            vision_encoder: "q4", 
            decoder_model_merged: "q4", 
        },
        device: "webgpu",
    });
    logTime("Model Loading");

    // Prepare input messages
    const messages = [
        {
            role: "user",
            content: [
                { type: "image" },
                { type: "text", text: "Can you describe this artistic image?" },
            ],
        },
    ];

    // Process text
    logTime("Text Processing");
    const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
    logTime("Text Processing");

    logTime("Processor Apply");
    const inputs = await processor(text, [image1], {
        do_image_splitting: false,
    });
    logTime("Processor Apply");

    // Generate output
    logTime("Model Generation");
    const generated_ids = await model.generate({
        ...inputs,
        max_new_tokens: 500,
    });
    logTime("Model Generation");

    logTime("Batch Decoding");
    const generated_texts = processor.batch_decode(
        generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]), 
        { skip_special_tokens: true },
    );
    logTime("Batch Decoding");

    return generated_texts[0];
};

The text was updated successfully, but these errors were encountered:

sbrzz · 2025-02-23T08:43:41Z

I tried embed_tokens: "fp16" without success (no support from the target device), then switched to embed_tokens: "fp32".

Just a note on this topic, I saw that webgpureport on Android platform is missing the shaders-f16 feature.
The same feature is listed in the development machine.
I suppose this is the source of the error at runtime.

It could make sense to check the availability of this option before breaking the code.
Check this: https://developer.chrome.com/blog/new-in-webgpu-120

const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has("shader-f16")) {
  throw new Error("16-bit floating-point value support is not available");
}
// Explicitly request 16-bit floating-point value support.
const device = await adapter.requestDevice({
  requiredFeatures: ["shader-f16"],
});

const code = `
  enable f16;

  @compute @workgroup_size(1)
  fn main() {
    const c : vec3h = vec3<f16>(1.0h, 2.0h, 3.0h);
  }
`;

const shaderModule = device.createShaderModule({ code });
// Create a compute pipeline with this shader module
// and run the shader on the GPU...`

hacktronics · 2025-02-23T18:43:25Z

Hitting the same issue: Failed to execute 'mapAsync' on 'GPUBuffer': A valid external Instance reference no longer exists. with Florence 2 example also in browser with [email protected]

sbrzz · 2025-02-25T13:00:01Z

Update: I make it work by skip the overlay provided by transformers.js (apart from preprocessing).
I directly used onnxruntime-web, even tough the generated tokens have no sense (support to understand the bug is appreciated).

I definetively think this is the proof of a bug in transformer.js, let's see if there is time to find and patch with a PR.

This working code:

import { 
  AutoProcessor,
  load_image,
  AutoConfig
} from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';

class SmolVLMInference {
    constructor(config) {
      // Model configuration
      this.modelId = "HuggingFaceTB/SmolVLM-256M-Instruct";
      this.config = {
        text_config: {
          num_key_value_heads: config.text_config.num_key_value_heads,
          head_dim: config.text_config.head_dim,
          num_hidden_layers: config.text_config.num_hidden_layers,
          eos_token_id: config.text_config.eos_token_id
        },
        image_token_id: config.image_token_id
      };
      
      // Initialize sessions and processor
      this.visionSession = null;
      this.embedSession = null;
      this.decoderSession = null;
      this.processor = null;
      
      // Model parameters from config
      this.numKeyValueHeads = this.config.text_config.num_key_value_heads;
      this.headDim = this.config.text_config.head_dim;
      this.numHiddenLayers = this.config.text_config.num_hidden_layers;
      this.eosTokenId = this.config.text_config.eos_token_id;
      this.imageTokenId = this.config.image_token_id;
    }
  
    // Initialize ONNX sessions
    async loadModels() {
      try {
        console.log("Loading ONNX models...");
        
        // Load all three models in parallel
        [this.visionSession, this.embedSession, this.decoderSession] = await Promise.all([
          ort.InferenceSession.create('./vision_encoder_q4.onnx', { executionProviders: ['webgpu'] }),
          ort.InferenceSession.create('./embed_tokens_q4.onnx', { executionProviders: ['webgpu'] }),
          ort.InferenceSession.create('./decoder_model_merged_q4.onnx', { executionProviders: ['webgpu'] })
        ]);
        
        console.log("Models loaded successfully!");
        return true;
      } catch (error) {
        console.error("Error loading models:", error);
        return false;
      }
    }
  
    // Simplified token decoder
    decodeTokens(tokens) {
      // This is a very simplified decoder
      return tokens.map(t => String.fromCharCode(97 + (Number(t) % 26))).join("");
    }

    async officialPreproc(imageUrl, question){

      const image1 = await load_image(imageUrl);

      // Load processor and model
      const model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";

      const processor = await AutoProcessor.from_pretrained(model_id);

      const messages = [
          {
              role: "user",
              content: [
                  { type: "image" },
                  { type: "text", text: question },
              ],
          },
      ];
      const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
      const inputs = await processor(text, [image1], {
        do_image_splitting: false,
      });

      return inputs;
    }
  
    // Main inference function
    async generateText(imageUrl, question, maxNewTokens = 1024) {
      try {

        const officialInputProcessing = await this.officialPreproc(imageUrl, question);
        
        // Prepare decoder inputs
        const batchSize = 1;
        let pastKeyValues = {};
        for (let layer = 0; layer < this.numHiddenLayers; layer++) {
          for (let kv of ['key', 'value']) {
            pastKeyValues[`past_key_values.${layer}.${kv}`] = new ort.Tensor(
              'float32', 
              new Float32Array(0), 
              [batchSize, this.numKeyValueHeads, 0, this.headDim]
            );
          }
        }
        
        let imageFeatures = null;
        let inputIds = officialInputProcessing.input_ids;
        let attentionMask = officialInputProcessing.attention_mask;
        
        // Calculate position IDs
        let positionIds = this.calculatePositionIds(attentionMask);
        
        // Generation loop
        let generatedTokens = [];
        let outputText = "";
        
        console.log("Starting generation...");
        
        for (let i = 0; i < maxNewTokens; i++) {
          // Get token embeddings
          const inputIdsArray = this.getTensorData(inputIds);
          const embedFeed = { 'input_ids': inputIds };
          const embedResult = await this.embedSession.run(embedFeed);
          const inputsEmbeds = embedResult.inputs_embeds; // Assumes output tensor is named 'output'
          
          // Process image if needed
          if (imageFeatures === null) {
            const visionFeed = {
              'pixel_values': officialInputProcessing.pixel_values,
              'pixel_attention_mask': officialInputProcessing.pixel_attention_mask
            };
            
            const visionResult = await this.visionSession.run(visionFeed);
            imageFeatures = visionResult.image_features;
            
            // Replace image token embeddings with image features
            // This would need a more complex implementation to find and replace the correct embeddings
            // For now, just a placeholder showing the concept
          }
          
          // Run decoder model
          const decoderFeeds = {
            'inputs_embeds': inputsEmbeds,
            'attention_mask': attentionMask,
            'position_ids': positionIds,
            ...pastKeyValues
          };
          
          const decoderResults = await this.decoderSession.run(decoderFeeds);
          const logits = decoderResults.logits;
          const presentKeyValues = decoderResults.present_key_values || [];
          
          // Get next token (argmax of last logits)
          const nextToken = this.getNextToken(logits);
          
          // Update for next iteration
          inputIds = new ort.Tensor('int64', new BigInt64Array([BigInt(nextToken)]), [1, 1]);
          attentionMask = new ort.Tensor('int64', new BigInt64Array([1n]), [1, 1]);
          positionIds = new ort.Tensor('int64', new BigInt64Array([BigInt(this.getTensorData(positionIds)[0] + BigInt(1))]), [1, 1]);
          
          // Update past key values
          // This would need proper handling of the present key values structure
          
          // Add token to generated sequence
          generatedTokens.push(nextToken);
          
          // Decode token and add to output text
          const tokenText = this.decodeTokens([nextToken]);
          outputText += tokenText;
          
          // Optional streaming output
          if (i % 5 === 0) {
            console.log("Generation progress:", outputText);
          }
          
          // Check for EOS token
          if (nextToken === this.eosTokenId) {
            break;
          }
        }
        
        console.log("Generation complete!");
        return outputText;
      } catch (error) {
        console.error("Error in generation:", error);
        return "An error occurred during text generation.";
      }
    }
  
    // Helper to calculate position IDs from attention mask
    calculatePositionIds(attentionMask) {
      const attentionArray = this.getTensorData(attentionMask);
      const positionArray = new BigInt64Array(attentionArray.length);
      
      let position = 0n;
      for (let i = 0; i < attentionArray.length; i++) {
        if (attentionArray[i] === 1n) {
          positionArray[i] = BigInt(position);
          position++;
        } else {
          positionArray[i] = 0n;
        }
      }
      
      return new ort.Tensor('int64', positionArray, attentionMask.dims);
    }
  
    // Helper to get next token from logits
    getNextToken(logits) {
      // Get the last token's logits
      const lastLogits = Array.from(this.getTensorData(logits).slice(-logits.dims[2]));
      
      // Find the index of the maximum value (argmax)
      let maxIndex = 0;
      let maxValue = lastLogits[0];
      
      for (let i = 1; i < lastLogits.length; i++) {
        if (lastLogits[i] > maxValue) {
          maxValue = lastLogits[i];
          maxIndex = i;
        }
      }
      
      return maxIndex;
    }
  
    // Helper to get tensor data as array
    getTensorData(tensor) {
      return tensor.data;
    }
  }
  
  // Usage example
  async function runSmolVLM() {

    let model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";
    const config = await AutoConfig.from_pretrained(model_id);
    const inferenceEngine = new SmolVLMInference(config);
    
    // Step 1: Load models
    const modelsLoaded = await inferenceEngine.loadModels();
    if (!modelsLoaded) {
      console.error("Failed to load models");
      return;
    }
    
    // Step 2: Run inference
    const imageUrl = "./Statue-of-Liberty-Island-New-York-Bay.jpg";
    const question = "Can you describe this image?";
    
    console.log("Running inference on image:", imageUrl);
    console.log("Question:", question);
    
    const result = await inferenceEngine.generateText(imageUrl, question);
    
    // Step 3: Show results
    console.log("Generated text:");
    console.log(result);
    
    // Display in UI if needed
    if (document.getElementById('result')) {
      document.getElementById('result').textContent = result;
    }
  }

// Add this at the bottom of your smolvlm.js file
export { SmolVLMInference, runSmolVLM };

<!DOCTYPE html>
<html>
<head>
  <title>SmolVLM Demo</title>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/onnxruntime-web/1.20.1/ort.webgpu.min.js"></script>
  <script type="module" src="smolvlm.js"></script>
</head>
<body>
    <h1>SmolVLM Image Captioning</h1>
    <button id="runButton">Run Model</button>
    <div id="result"></div>

    <script type="module">
    // Import the function from your module
    import { runSmolVLM } from './smolvlm.js';
    
    // Add event listener to button
    document.getElementById('runButton').addEventListener('click', async () => {
        try {
        await runSmolVLM();
        } catch (error) {
        console.error("Error running SmolVLM:", error);
        document.getElementById('result').textContent = "Error: " + error.message;
        }
    });
    </script>
</body>
</html>

xenova · 2025-02-25T14:07:26Z

One "optimization" which transformers.js adds is to use preferredOutputLocation to keep the kv cache on GPU between forward passes: https://onnxruntime.ai/docs/api/js/interfaces/InferenceSession.SessionOptions.html#preferredOutputLocation

Maybe try add that to your sample code to see whether this is the cause of the issue?

sbrzz · 2025-02-25T15:56:47Z

@xenova I suppose you talk about this:

transformers.js/src/models.js

Line 297 in c2ab81a

preferredOutputLocation[key] = 'gpu-buffer';

I tried to globally set gpu-buffer in this way:

// Load all three models in parallel
[this.visionSession, this.embedSession, this.decoderSession] = await Promise.all([
ort.InferenceSession.create('./vision_encoder_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
ort.InferenceSession.create('./embed_tokens_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
ort.InferenceSession.create('./decoder_model_merged_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
]);

An issue is raised because I try to access logits data in GPU while it should be in CPU, and it happens here:

const lastLogits = Array.from(this.getTensorData(logits).slice(-logits.dims[2]));

Is the same issue hitting tensorflow.js?

Error in generation: Error: The data is not on CPU. Use `getData()` to download GPU data to CPU, or use `texture` or `gpuBuffer` property to access the GPU data directly.

sbrzz added the bug Something isn't working label Feb 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebGPU crash on Android Chrome running SmolVLM-256M-Instruct #1205

WebGPU crash on Android Chrome running SmolVLM-256M-Instruct #1205

sbrzz commented Feb 22, 2025

sbrzz commented Feb 23, 2025

hacktronics commented Feb 23, 2025 •

edited

Loading

sbrzz commented Feb 25, 2025 •

edited

Loading

xenova commented Feb 25, 2025

sbrzz commented Feb 25, 2025 •

edited

Loading

WebGPU crash on Android Chrome running SmolVLM-256M-Instruct #1205

WebGPU crash on Android Chrome running SmolVLM-256M-Instruct #1205

Comments

sbrzz commented Feb 22, 2025

System Info

Environment/Platform

Description

Reproduction

Code to Reproduce

sbrzz commented Feb 23, 2025

hacktronics commented Feb 23, 2025 • edited Loading

sbrzz commented Feb 25, 2025 • edited Loading

xenova commented Feb 25, 2025

sbrzz commented Feb 25, 2025 • edited Loading

hacktronics commented Feb 23, 2025 •

edited

Loading

sbrzz commented Feb 25, 2025 •

edited

Loading

sbrzz commented Feb 25, 2025 •

edited

Loading