Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebGPU crash on Android Chrome running SmolVLM-256M-Instruct #1205

Open
1 of 5 tasks
sbrzz opened this issue Feb 22, 2025 · 5 comments
Open
1 of 5 tasks

WebGPU crash on Android Chrome running SmolVLM-256M-Instruct #1205

sbrzz opened this issue Feb 22, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@sbrzz
Copy link

sbrzz commented Feb 22, 2025

System Info

transformers.js 3.3.3 (via https://cdn.jsdelivr.net/npm/@huggingface/[email protected])
Platform Android 13
Chrome for Android 133.0.6943.50

webgpureport attached

webgpureport-2025-02-22T06-59-48-900Z.txt

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

I successfully run SmolVLM2-256-Instruct on my development machine (webgpu enabled). I saw a classic 10x improve over WASM.
However, I have an error when I try to use the same code on target device (Android).

I tried embed_tokens: "fp16" without success (no support from the target device), then switched to embed_tokens: "fp32".

Chrome console output:

WebGL: CONTEXT_LOST_WEBGL: loseContext: context lostUnderstand this warning
A valid external Instance reference no longer exists.
Uncaught (in promise) AbortError: Failed to execute 'mapAsync' on 'GPUBuffer': A valid external Instance reference no longer exists.

I supposed the problem was fixed based on #943

Any idea?

Reproduction

Code to Reproduce

import { 
    AutoProcessor,
    AutoModelForVision2Seq,
    load_image,
} from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';

console.log("vlm.js");

const DEBUG_MODE = true;

globalThis.whatsInTheImage = async function (imagePath) {
    console.log(imagePath);

    // Track execution times
    const timings = {};
    function logTime(label) {
        if (DEBUG_MODE){
            const now = performance.now();
            if (!timings[label]) {
                timings[label] = now;
            } else {
                console.log(`${label} took ${(now - timings[label]).toFixed(2)}ms`);
                delete timings[label];
            }
        }
    }

    // Load image
    logTime("Image Loading");
    const image1 = await load_image(imagePath);
    logTime("Image Loading");

    // Load processor and model
    const model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";

    logTime("Processor Loading");
    const processor = await AutoProcessor.from_pretrained(model_id);
    logTime("Processor Loading");

    logTime("Model Loading");
    const model = await AutoModelForVision2Seq.from_pretrained(model_id, {
        dtype: {
            embed_tokens: "fp32", 
            vision_encoder: "q4", 
            decoder_model_merged: "q4", 
        },
        device: "webgpu",
    });
    logTime("Model Loading");

    // Prepare input messages
    const messages = [
        {
            role: "user",
            content: [
                { type: "image" },
                { type: "text", text: "Can you describe this artistic image?" },
            ],
        },
    ];

    // Process text
    logTime("Text Processing");
    const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
    logTime("Text Processing");

    logTime("Processor Apply");
    const inputs = await processor(text, [image1], {
        do_image_splitting: false,
    });
    logTime("Processor Apply");

    // Generate output
    logTime("Model Generation");
    const generated_ids = await model.generate({
        ...inputs,
        max_new_tokens: 500,
    });
    logTime("Model Generation");

    logTime("Batch Decoding");
    const generated_texts = processor.batch_decode(
        generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]), 
        { skip_special_tokens: true },
    );
    logTime("Batch Decoding");

    return generated_texts[0];
};
@sbrzz sbrzz added the bug Something isn't working label Feb 22, 2025
@sbrzz
Copy link
Author

sbrzz commented Feb 23, 2025

I tried embed_tokens: "fp16" without success (no support from the target device), then switched to embed_tokens: "fp32".

Just a note on this topic, I saw that webgpureport on Android platform is missing the shaders-f16 feature.
The same feature is listed in the development machine.
I suppose this is the source of the error at runtime.

It could make sense to check the availability of this option before breaking the code.
Check this: https://developer.chrome.com/blog/new-in-webgpu-120

const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has("shader-f16")) {
  throw new Error("16-bit floating-point value support is not available");
}
// Explicitly request 16-bit floating-point value support.
const device = await adapter.requestDevice({
  requiredFeatures: ["shader-f16"],
});

const code = `
  enable f16;

  @compute @workgroup_size(1)
  fn main() {
    const c : vec3h = vec3<f16>(1.0h, 2.0h, 3.0h);
  }
`;

const shaderModule = device.createShaderModule({ code });
// Create a compute pipeline with this shader module
// and run the shader on the GPU...`

@hacktronics
Copy link

hacktronics commented Feb 23, 2025

Hitting the same issue: Failed to execute 'mapAsync' on 'GPUBuffer': A valid external Instance reference no longer exists. with Florence 2 example also in browser with [email protected]

@sbrzz
Copy link
Author

sbrzz commented Feb 25, 2025

Update: I make it work by skip the overlay provided by transformers.js (apart from preprocessing).
I directly used onnxruntime-web, even tough the generated tokens have no sense (support to understand the bug is appreciated).

I definetively think this is the proof of a bug in transformer.js, let's see if there is time to find and patch with a PR.

This working code:

import { 
  AutoProcessor,
  load_image,
  AutoConfig
} from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';

class SmolVLMInference {
    constructor(config) {
      // Model configuration
      this.modelId = "HuggingFaceTB/SmolVLM-256M-Instruct";
      this.config = {
        text_config: {
          num_key_value_heads: config.text_config.num_key_value_heads,
          head_dim: config.text_config.head_dim,
          num_hidden_layers: config.text_config.num_hidden_layers,
          eos_token_id: config.text_config.eos_token_id
        },
        image_token_id: config.image_token_id
      };
      
      // Initialize sessions and processor
      this.visionSession = null;
      this.embedSession = null;
      this.decoderSession = null;
      this.processor = null;
      
      // Model parameters from config
      this.numKeyValueHeads = this.config.text_config.num_key_value_heads;
      this.headDim = this.config.text_config.head_dim;
      this.numHiddenLayers = this.config.text_config.num_hidden_layers;
      this.eosTokenId = this.config.text_config.eos_token_id;
      this.imageTokenId = this.config.image_token_id;
    }
  
    // Initialize ONNX sessions
    async loadModels() {
      try {
        console.log("Loading ONNX models...");
        
        // Load all three models in parallel
        [this.visionSession, this.embedSession, this.decoderSession] = await Promise.all([
          ort.InferenceSession.create('./vision_encoder_q4.onnx', { executionProviders: ['webgpu'] }),
          ort.InferenceSession.create('./embed_tokens_q4.onnx', { executionProviders: ['webgpu'] }),
          ort.InferenceSession.create('./decoder_model_merged_q4.onnx', { executionProviders: ['webgpu'] })
        ]);
        
        console.log("Models loaded successfully!");
        return true;
      } catch (error) {
        console.error("Error loading models:", error);
        return false;
      }
    }
  
    // Simplified token decoder
    decodeTokens(tokens) {
      // This is a very simplified decoder
      return tokens.map(t => String.fromCharCode(97 + (Number(t) % 26))).join("");
    }

    async officialPreproc(imageUrl, question){

      const image1 = await load_image(imageUrl);

      // Load processor and model
      const model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";

      const processor = await AutoProcessor.from_pretrained(model_id);

      const messages = [
          {
              role: "user",
              content: [
                  { type: "image" },
                  { type: "text", text: question },
              ],
          },
      ];
      const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
      const inputs = await processor(text, [image1], {
        do_image_splitting: false,
      });

      return inputs;
    }
  
    // Main inference function
    async generateText(imageUrl, question, maxNewTokens = 1024) {
      try {

        const officialInputProcessing = await this.officialPreproc(imageUrl, question);
        
        // Prepare decoder inputs
        const batchSize = 1;
        let pastKeyValues = {};
        for (let layer = 0; layer < this.numHiddenLayers; layer++) {
          for (let kv of ['key', 'value']) {
            pastKeyValues[`past_key_values.${layer}.${kv}`] = new ort.Tensor(
              'float32', 
              new Float32Array(0), 
              [batchSize, this.numKeyValueHeads, 0, this.headDim]
            );
          }
        }
        
        let imageFeatures = null;
        let inputIds = officialInputProcessing.input_ids;
        let attentionMask = officialInputProcessing.attention_mask;
        
        // Calculate position IDs
        let positionIds = this.calculatePositionIds(attentionMask);
        
        // Generation loop
        let generatedTokens = [];
        let outputText = "";
        
        console.log("Starting generation...");
        
        for (let i = 0; i < maxNewTokens; i++) {
          // Get token embeddings
          const inputIdsArray = this.getTensorData(inputIds);
          const embedFeed = { 'input_ids': inputIds };
          const embedResult = await this.embedSession.run(embedFeed);
          const inputsEmbeds = embedResult.inputs_embeds; // Assumes output tensor is named 'output'
          
          // Process image if needed
          if (imageFeatures === null) {
            const visionFeed = {
              'pixel_values': officialInputProcessing.pixel_values,
              'pixel_attention_mask': officialInputProcessing.pixel_attention_mask
            };
            
            const visionResult = await this.visionSession.run(visionFeed);
            imageFeatures = visionResult.image_features;
            
            // Replace image token embeddings with image features
            // This would need a more complex implementation to find and replace the correct embeddings
            // For now, just a placeholder showing the concept
          }
          
          // Run decoder model
          const decoderFeeds = {
            'inputs_embeds': inputsEmbeds,
            'attention_mask': attentionMask,
            'position_ids': positionIds,
            ...pastKeyValues
          };
          
          const decoderResults = await this.decoderSession.run(decoderFeeds);
          const logits = decoderResults.logits;
          const presentKeyValues = decoderResults.present_key_values || [];
          
          // Get next token (argmax of last logits)
          const nextToken = this.getNextToken(logits);
          
          // Update for next iteration
          inputIds = new ort.Tensor('int64', new BigInt64Array([BigInt(nextToken)]), [1, 1]);
          attentionMask = new ort.Tensor('int64', new BigInt64Array([1n]), [1, 1]);
          positionIds = new ort.Tensor('int64', new BigInt64Array([BigInt(this.getTensorData(positionIds)[0] + BigInt(1))]), [1, 1]);
          
          // Update past key values
          // This would need proper handling of the present key values structure
          
          // Add token to generated sequence
          generatedTokens.push(nextToken);
          
          // Decode token and add to output text
          const tokenText = this.decodeTokens([nextToken]);
          outputText += tokenText;
          
          // Optional streaming output
          if (i % 5 === 0) {
            console.log("Generation progress:", outputText);
          }
          
          // Check for EOS token
          if (nextToken === this.eosTokenId) {
            break;
          }
        }
        
        console.log("Generation complete!");
        return outputText;
      } catch (error) {
        console.error("Error in generation:", error);
        return "An error occurred during text generation.";
      }
    }
  
    // Helper to calculate position IDs from attention mask
    calculatePositionIds(attentionMask) {
      const attentionArray = this.getTensorData(attentionMask);
      const positionArray = new BigInt64Array(attentionArray.length);
      
      let position = 0n;
      for (let i = 0; i < attentionArray.length; i++) {
        if (attentionArray[i] === 1n) {
          positionArray[i] = BigInt(position);
          position++;
        } else {
          positionArray[i] = 0n;
        }
      }
      
      return new ort.Tensor('int64', positionArray, attentionMask.dims);
    }
  
    // Helper to get next token from logits
    getNextToken(logits) {
      // Get the last token's logits
      const lastLogits = Array.from(this.getTensorData(logits).slice(-logits.dims[2]));
      
      // Find the index of the maximum value (argmax)
      let maxIndex = 0;
      let maxValue = lastLogits[0];
      
      for (let i = 1; i < lastLogits.length; i++) {
        if (lastLogits[i] > maxValue) {
          maxValue = lastLogits[i];
          maxIndex = i;
        }
      }
      
      return maxIndex;
    }
  
    // Helper to get tensor data as array
    getTensorData(tensor) {
      return tensor.data;
    }
  }
  
  // Usage example
  async function runSmolVLM() {

    let model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";
    const config = await AutoConfig.from_pretrained(model_id);
    const inferenceEngine = new SmolVLMInference(config);
    
    // Step 1: Load models
    const modelsLoaded = await inferenceEngine.loadModels();
    if (!modelsLoaded) {
      console.error("Failed to load models");
      return;
    }
    
    // Step 2: Run inference
    const imageUrl = "./Statue-of-Liberty-Island-New-York-Bay.jpg";
    const question = "Can you describe this image?";
    
    console.log("Running inference on image:", imageUrl);
    console.log("Question:", question);
    
    const result = await inferenceEngine.generateText(imageUrl, question);
    
    // Step 3: Show results
    console.log("Generated text:");
    console.log(result);
    
    // Display in UI if needed
    if (document.getElementById('result')) {
      document.getElementById('result').textContent = result;
    }
  }

// Add this at the bottom of your smolvlm.js file
export { SmolVLMInference, runSmolVLM };
<!DOCTYPE html>
<html>
<head>
  <title>SmolVLM Demo</title>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/onnxruntime-web/1.20.1/ort.webgpu.min.js"></script>
  <script type="module" src="smolvlm.js"></script>
</head>
<body>
    <h1>SmolVLM Image Captioning</h1>
    <button id="runButton">Run Model</button>
    <div id="result"></div>

    <script type="module">
    // Import the function from your module
    import { runSmolVLM } from './smolvlm.js';
    
    // Add event listener to button
    document.getElementById('runButton').addEventListener('click', async () => {
        try {
        await runSmolVLM();
        } catch (error) {
        console.error("Error running SmolVLM:", error);
        document.getElementById('result').textContent = "Error: " + error.message;
        }
    });
    </script>
</body>
</html>

@xenova
Copy link
Collaborator

xenova commented Feb 25, 2025

One "optimization" which transformers.js adds is to use preferredOutputLocation to keep the kv cache on GPU between forward passes: https://onnxruntime.ai/docs/api/js/interfaces/InferenceSession.SessionOptions.html#preferredOutputLocation

Maybe try add that to your sample code to see whether this is the cause of the issue?

@sbrzz
Copy link
Author

sbrzz commented Feb 25, 2025

@xenova I suppose you talk about this:

preferredOutputLocation[key] = 'gpu-buffer';

I tried to globally set gpu-buffer in this way:

// Load all three models in parallel
[this.visionSession, this.embedSession, this.decoderSession] = await Promise.all([
ort.InferenceSession.create('./vision_encoder_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
ort.InferenceSession.create('./embed_tokens_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
ort.InferenceSession.create('./decoder_model_merged_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
]);

An issue is raised because I try to access logits data in GPU while it should be in CPU, and it happens here:

const lastLogits = Array.from(this.getTensorData(logits).slice(-logits.dims[2]));

Is the same issue hitting tensorflow.js?

Error in generation: Error: The data is not on CPU. Use `getData()` to download GPU data to CPU, or use `texture` or `gpuBuffer` property to access the GPU data directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants