-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebGPU crash on Android Chrome running SmolVLM-256M-Instruct #1205
Comments
Just a note on this topic, I saw that webgpureport on Android platform is missing the It could make sense to check the availability of this option before breaking the code. const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has("shader-f16")) {
throw new Error("16-bit floating-point value support is not available");
}
// Explicitly request 16-bit floating-point value support.
const device = await adapter.requestDevice({
requiredFeatures: ["shader-f16"],
});
const code = `
enable f16;
@compute @workgroup_size(1)
fn main() {
const c : vec3h = vec3<f16>(1.0h, 2.0h, 3.0h);
}
`;
const shaderModule = device.createShaderModule({ code });
// Create a compute pipeline with this shader module
// and run the shader on the GPU...` |
Hitting the same issue: Failed to execute 'mapAsync' on 'GPUBuffer': A valid external Instance reference no longer exists. with Florence 2 example also in browser with [email protected] |
Update: I make it work by skip the overlay provided by transformers.js (apart from preprocessing). I definetively think this is the proof of a bug in transformer.js, let's see if there is time to find and patch with a PR. This working code: import {
AutoProcessor,
load_image,
AutoConfig
} from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';
class SmolVLMInference {
constructor(config) {
// Model configuration
this.modelId = "HuggingFaceTB/SmolVLM-256M-Instruct";
this.config = {
text_config: {
num_key_value_heads: config.text_config.num_key_value_heads,
head_dim: config.text_config.head_dim,
num_hidden_layers: config.text_config.num_hidden_layers,
eos_token_id: config.text_config.eos_token_id
},
image_token_id: config.image_token_id
};
// Initialize sessions and processor
this.visionSession = null;
this.embedSession = null;
this.decoderSession = null;
this.processor = null;
// Model parameters from config
this.numKeyValueHeads = this.config.text_config.num_key_value_heads;
this.headDim = this.config.text_config.head_dim;
this.numHiddenLayers = this.config.text_config.num_hidden_layers;
this.eosTokenId = this.config.text_config.eos_token_id;
this.imageTokenId = this.config.image_token_id;
}
// Initialize ONNX sessions
async loadModels() {
try {
console.log("Loading ONNX models...");
// Load all three models in parallel
[this.visionSession, this.embedSession, this.decoderSession] = await Promise.all([
ort.InferenceSession.create('./vision_encoder_q4.onnx', { executionProviders: ['webgpu'] }),
ort.InferenceSession.create('./embed_tokens_q4.onnx', { executionProviders: ['webgpu'] }),
ort.InferenceSession.create('./decoder_model_merged_q4.onnx', { executionProviders: ['webgpu'] })
]);
console.log("Models loaded successfully!");
return true;
} catch (error) {
console.error("Error loading models:", error);
return false;
}
}
// Simplified token decoder
decodeTokens(tokens) {
// This is a very simplified decoder
return tokens.map(t => String.fromCharCode(97 + (Number(t) % 26))).join("");
}
async officialPreproc(imageUrl, question){
const image1 = await load_image(imageUrl);
// Load processor and model
const model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const messages = [
{
role: "user",
content: [
{ type: "image" },
{ type: "text", text: question },
],
},
];
const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
const inputs = await processor(text, [image1], {
do_image_splitting: false,
});
return inputs;
}
// Main inference function
async generateText(imageUrl, question, maxNewTokens = 1024) {
try {
const officialInputProcessing = await this.officialPreproc(imageUrl, question);
// Prepare decoder inputs
const batchSize = 1;
let pastKeyValues = {};
for (let layer = 0; layer < this.numHiddenLayers; layer++) {
for (let kv of ['key', 'value']) {
pastKeyValues[`past_key_values.${layer}.${kv}`] = new ort.Tensor(
'float32',
new Float32Array(0),
[batchSize, this.numKeyValueHeads, 0, this.headDim]
);
}
}
let imageFeatures = null;
let inputIds = officialInputProcessing.input_ids;
let attentionMask = officialInputProcessing.attention_mask;
// Calculate position IDs
let positionIds = this.calculatePositionIds(attentionMask);
// Generation loop
let generatedTokens = [];
let outputText = "";
console.log("Starting generation...");
for (let i = 0; i < maxNewTokens; i++) {
// Get token embeddings
const inputIdsArray = this.getTensorData(inputIds);
const embedFeed = { 'input_ids': inputIds };
const embedResult = await this.embedSession.run(embedFeed);
const inputsEmbeds = embedResult.inputs_embeds; // Assumes output tensor is named 'output'
// Process image if needed
if (imageFeatures === null) {
const visionFeed = {
'pixel_values': officialInputProcessing.pixel_values,
'pixel_attention_mask': officialInputProcessing.pixel_attention_mask
};
const visionResult = await this.visionSession.run(visionFeed);
imageFeatures = visionResult.image_features;
// Replace image token embeddings with image features
// This would need a more complex implementation to find and replace the correct embeddings
// For now, just a placeholder showing the concept
}
// Run decoder model
const decoderFeeds = {
'inputs_embeds': inputsEmbeds,
'attention_mask': attentionMask,
'position_ids': positionIds,
...pastKeyValues
};
const decoderResults = await this.decoderSession.run(decoderFeeds);
const logits = decoderResults.logits;
const presentKeyValues = decoderResults.present_key_values || [];
// Get next token (argmax of last logits)
const nextToken = this.getNextToken(logits);
// Update for next iteration
inputIds = new ort.Tensor('int64', new BigInt64Array([BigInt(nextToken)]), [1, 1]);
attentionMask = new ort.Tensor('int64', new BigInt64Array([1n]), [1, 1]);
positionIds = new ort.Tensor('int64', new BigInt64Array([BigInt(this.getTensorData(positionIds)[0] + BigInt(1))]), [1, 1]);
// Update past key values
// This would need proper handling of the present key values structure
// Add token to generated sequence
generatedTokens.push(nextToken);
// Decode token and add to output text
const tokenText = this.decodeTokens([nextToken]);
outputText += tokenText;
// Optional streaming output
if (i % 5 === 0) {
console.log("Generation progress:", outputText);
}
// Check for EOS token
if (nextToken === this.eosTokenId) {
break;
}
}
console.log("Generation complete!");
return outputText;
} catch (error) {
console.error("Error in generation:", error);
return "An error occurred during text generation.";
}
}
// Helper to calculate position IDs from attention mask
calculatePositionIds(attentionMask) {
const attentionArray = this.getTensorData(attentionMask);
const positionArray = new BigInt64Array(attentionArray.length);
let position = 0n;
for (let i = 0; i < attentionArray.length; i++) {
if (attentionArray[i] === 1n) {
positionArray[i] = BigInt(position);
position++;
} else {
positionArray[i] = 0n;
}
}
return new ort.Tensor('int64', positionArray, attentionMask.dims);
}
// Helper to get next token from logits
getNextToken(logits) {
// Get the last token's logits
const lastLogits = Array.from(this.getTensorData(logits).slice(-logits.dims[2]));
// Find the index of the maximum value (argmax)
let maxIndex = 0;
let maxValue = lastLogits[0];
for (let i = 1; i < lastLogits.length; i++) {
if (lastLogits[i] > maxValue) {
maxValue = lastLogits[i];
maxIndex = i;
}
}
return maxIndex;
}
// Helper to get tensor data as array
getTensorData(tensor) {
return tensor.data;
}
}
// Usage example
async function runSmolVLM() {
let model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";
const config = await AutoConfig.from_pretrained(model_id);
const inferenceEngine = new SmolVLMInference(config);
// Step 1: Load models
const modelsLoaded = await inferenceEngine.loadModels();
if (!modelsLoaded) {
console.error("Failed to load models");
return;
}
// Step 2: Run inference
const imageUrl = "./Statue-of-Liberty-Island-New-York-Bay.jpg";
const question = "Can you describe this image?";
console.log("Running inference on image:", imageUrl);
console.log("Question:", question);
const result = await inferenceEngine.generateText(imageUrl, question);
// Step 3: Show results
console.log("Generated text:");
console.log(result);
// Display in UI if needed
if (document.getElementById('result')) {
document.getElementById('result').textContent = result;
}
}
// Add this at the bottom of your smolvlm.js file
export { SmolVLMInference, runSmolVLM }; <!DOCTYPE html>
<html>
<head>
<title>SmolVLM Demo</title>
<script src="https://cdnjs.cloudflare.com/ajax/libs/onnxruntime-web/1.20.1/ort.webgpu.min.js"></script>
<script type="module" src="smolvlm.js"></script>
</head>
<body>
<h1>SmolVLM Image Captioning</h1>
<button id="runButton">Run Model</button>
<div id="result"></div>
<script type="module">
// Import the function from your module
import { runSmolVLM } from './smolvlm.js';
// Add event listener to button
document.getElementById('runButton').addEventListener('click', async () => {
try {
await runSmolVLM();
} catch (error) {
console.error("Error running SmolVLM:", error);
document.getElementById('result').textContent = "Error: " + error.message;
}
});
</script>
</body>
</html> |
One "optimization" which transformers.js adds is to use Maybe try add that to your sample code to see whether this is the cause of the issue? |
@xenova I suppose you talk about this: Line 297 in c2ab81a
I tried to globally set // Load all three models in parallel
[this.visionSession, this.embedSession, this.decoderSession] = await Promise.all([
ort.InferenceSession.create('./vision_encoder_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
ort.InferenceSession.create('./embed_tokens_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
ort.InferenceSession.create('./decoder_model_merged_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
]); An issue is raised because I try to access
Is the same issue hitting tensorflow.js? Error in generation: Error: The data is not on CPU. Use `getData()` to download GPU data to CPU, or use `texture` or `gpuBuffer` property to access the GPU data directly. |
System Info
transformers.js 3.3.3 (via https://cdn.jsdelivr.net/npm/@huggingface/[email protected])
Platform Android 13
Chrome for Android 133.0.6943.50
webgpureport attached
webgpureport-2025-02-22T06-59-48-900Z.txt
Environment/Platform
Description
I successfully run SmolVLM2-256-Instruct on my development machine (webgpu enabled). I saw a classic 10x improve over WASM.
However, I have an error when I try to use the same code on target device (Android).
I tried
embed_tokens: "fp16"
without success (no support from the target device), then switched toembed_tokens: "fp32".
Chrome console output:
WebGL: CONTEXT_LOST_WEBGL: loseContext: context lostUnderstand this warning
A valid external Instance reference no longer exists.
Uncaught (in promise) AbortError: Failed to execute 'mapAsync' on 'GPUBuffer': A valid external Instance reference no longer exists.
I supposed the problem was fixed based on #943
Any idea?
Reproduction
Code to Reproduce
The text was updated successfully, but these errors were encountered: