failures during experimental feature parallel compile #6250

vladmandic · 2022-03-19T14:57:17Z

testing new experimental feature from PR #5826 Add functions for parallel compilation
which was recently merged into main branch

im loading number of small models and attempting to run pre-compile and i'm getting errors on all attempts

here ive documented 3 different failures:

compile fails on some models with a totally random message such as:
(and works fine on some models)

Uncaught (in promise) Error: Pass at least one tensor to tf.stack

compile completes without errors, but later actual code execution in js fails:
(same code works just fine if there is no pre-compile)

Uncaught (in promise) TypeError: Cannot read properties of null (reading 'A')
  at tfjs.esm.js:47772:27
  at Array.forEach (<anonymous>)
  at runProgram (tfjs.esm.js:47770:10)
  at _MathBackendWebGL.runWebGLProgram (tfjs.esm.js:49796:7)
  at _MathBackendWebGL.uploadToGPU (tfjs.esm.js:49916:40)

which happens in a trivial function that runs tf.image.resizeBilinear followed by tf.div to normalize input tensor

compile completes without errors, but later model inference fails with the same error as above
actual backtrace shows that it happens during execute call and kernel op in model that triggers error is a simple sub
(same model executes without issues if there is no pre-compile)

my function that runs precompile on all models is:

type Models: Record<string, GraphModel>;

async function runCompile(allModels: Models) {
  const backendType = tf.getBackend();
  const webGLBackend = tf.backend();
  if ((backendType !== 'webgl') || (!webGLBackend || !webGLBackend.checkCompileCompletion)) {
    log('compile pass: skip');
    return;
  }
  const models = Object.values(allModels).filter((m) => m !== null) as GraphModel[];
  tf.env().set('ENGINE_COMPILE_ONLY', true);
  const numTensorsStart = tf.engine().state.numTensors;
  for (const model of models) {
    const shape = (model.inputs && model.inputs[0] && model.inputs[0].shape) ? [...model.inputs[0].shape] : [1, 64, 64, 3];
    const dtype = (model.inputs && model.inputs[0] && model.inputs[0].dtype) ? model.inputs[0].dtype : 'float32';
    for (let dim = 0; dim < shape.length; dim++) {
      if (shape[dim] === -1) shape[dim] = dim === 0 ? 1 : 64; // override batch number and any dynamic dimensions
    }
    const tensor = tf.zeros(shape, dtype);
    const res = await model.executeAsync(tensor);
    if (Array.isArray(res)) res.forEach((t) => tf.dispose(t));
    else tf.dispose(res);
    tf.dispose(tensor);
  }
  const kernels = await webGLBackend.checkCompileCompletionAsync(); // same errors if check is moved inside per-model loop
  webGLBackend.getUniformLocations();
  log('compile pass kernels:', kernels.length); // getting a reasonable value here
  tf.env().set('ENGINE_COMPILE_ONLY', false);
  const numTensorsEnd = tf.engine().state.numTensors;
  if ((numTensorsEnd - numTensorsStart) > 0) log('tensor leak:', numTensorsEnd - numTensorsStart); // no leaks
}

The text was updated successfully, but these errors were encountered:

lina128 · 2022-04-14T21:50:20Z

Hi @vladmandic , thank you for reporting this. The parallel compilation experimental feature, we only test for one model, if there's a couple models, the state may get messed up because of the async call (this line const res = await model.executeAsync(tensor)). Maybe try using model.execute(). We'd like to know whether it works. Anyways, we are working on some infra improvement that will allow us to track state for each model, when that improvement is done, we will be able to support multiple models.

vladmandic · 2022-04-15T11:43:09Z

yup, that does the trick!

and compile definitely speeds up time to first inference - some ~30% in my tests using simple models
that is VERY useful for webapps where time to interactive is critical

i do wish there was a way to determine ahead of time if model can be executed synchronously
instead of wrapping the block in try...catch (i do have open feature request for that)

for reference:

with additional ENGINE_COMPILE_ONLY step:

loaded models: 9
compile fail model: handtrack
compile pass models: (8) ['centernet', 'emotion', 'facedetect', 'faceiris', 'facemesh', 'faceres', 'handskeleton', 'movenet']
compile pass kernels: 306
warmup full 2781 ms

without

loaded models: 9
warmup full 3920 ms

vladmandic · 2022-09-29T23:36:07Z

any update on supporting models that require async execution?
or how to detect in advance if model requires async execution to start with?

SangbumChoi · 2023-01-19T08:31:11Z

any progress update for this parallel compilation?

gaikwadrahul8 · 2023-05-30T11:11:18Z

Hi, @vladmandic

Apologize for the delayed response and we're re-visiting our older issues and checking whether those issues got resolved or not as of now so May I know are you still looking for the solution or your issue got resolved ?

If issue still persists after trying with latest version of TFJs please let us know with error log and code snippet to replicate the same issue from our end ?

Could you please confirm if this issue is resolved for you ? Please feel free to close the issue if it is resolved ? Thank you!

vladmandic · 2023-05-30T11:22:33Z

Yes, this issue is still valid and there has been no updates from TFJS team.

vladmandic added the type:bug Something isn't working label Mar 19, 2022

rthadur assigned lina128 Mar 24, 2022

rthadur added comp:backend-webgl P2 labels Mar 24, 2022

rthadur added stat:awaiting tensorflower and removed stat:awaiting tensorflower labels Apr 4, 2022

rthadur added the stat:awaiting response label Apr 14, 2022

rthadur added stat:awaiting tensorflower and removed stat:awaiting response labels Apr 15, 2022

gaikwadrahul8 added the stat:awaiting response label May 30, 2023

gaikwadrahul8 self-assigned this May 30, 2023

gaikwadrahul8 removed the stat:awaiting tensorflower label May 30, 2023

google-ml-butler bot removed the stat:awaiting response label May 30, 2023

lina128 assigned pyu10055 and unassigned lina128 May 31, 2023

gaikwadrahul8 added the stat:awaiting tensorflower label Jun 30, 2023

ac-mmi linked a pull request Feb 6, 2025 that will close this issue

Fix: failures during experimental feature parallel compile #8510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failures during experimental feature parallel compile #6250

failures during experimental feature parallel compile #6250

vladmandic commented Mar 19, 2022

lina128 commented Apr 14, 2022

vladmandic commented Apr 15, 2022

vladmandic commented Sep 29, 2022

SangbumChoi commented Jan 19, 2023

gaikwadrahul8 commented May 30, 2023

vladmandic commented May 30, 2023

failures during experimental feature parallel compile #6250

failures during experimental feature parallel compile #6250

Comments

vladmandic commented Mar 19, 2022

lina128 commented Apr 14, 2022

vladmandic commented Apr 15, 2022

vladmandic commented Sep 29, 2022

SangbumChoi commented Jan 19, 2023

gaikwadrahul8 commented May 30, 2023

vladmandic commented May 30, 2023