Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

Latest commit



174 lines (145 loc) · 10.2 KB

File metadata and controls

174 lines (145 loc) · 10.2 KB

CLEVR Question Generation

CLEVR questions are generated using the script, which is expected to be run from the question_generation directory.

This script reads a JSON file containing information about scenes (as produced by and outputs a JSON file containing questions, functional programs, and answers for those images. In most cases the script will be invoked like this:

python --input_scene_file $INPUT_FILE --output_questions_file $OUTPUT_FILE

Question generation has no dependencies other than Python itself. The code was developed on Python 3.5, but should also work on Python 2.7.

Questions are generated by instantiating question templates; the question templates used for our CVPR paper can be found in the directory CLEVR_1.0_templates. Each file in this directory contains several related templates.

Selecting input scenes

By default will generate questions for all images in the input file. However you can generate questions for only a subset of images using the --scene_start_idx and --num_scenes flags: the former gives the index at which to start generating questions, and the latter gives the number of images for which questions should be generated. These flags can be useful for distributing question generation among many workers.

Controlling questions per image

The flag --templates_per_image (default 10) is the number of templates that we will aim to instantiate for every image, and the flag --instances_per_template gives the number of instantiations we will try to find per template. In total the number of questions per image will be the product of --templates_per_image and --instances_per_template; however some images may have slightly fewer questions if no valid template instantiations can be found.

Question Templates

Each question template consists of four components:

  1. One or more parameters, each with a type and a name. Instantiating the template amounts to choosing a value for each of these parameters; parameters may be given a NULL value
  2. One or more text templates that give a natural-language representation of the question
  3. A program template consisting of a sequence of nodes; each node in the program template may expand to multiple functions in the final program instantiated from the template
  4. Zero or more constraints restricting the allowed values that the parameters are allowed to take.

Here is an example template:

  "params": [
    {"type": "Size", "name": "<Z>"},
    {"type": "Color", "name": "<C>"},
    {"type": "Material", "name": "<M>"},
    {"type": "Shape", "name": "<S>"},
    {"type": "Relation", "name": "<R>"},
    {"type": "Size", "name": "<Z2>"},
    {"type": "Color", "name": "<C2>"},
    {"type": "Material", "name": "<M2>"},
    {"type": "Shape", "name": "<S2>"}
  "text": [
    "What size is the <Z2> <C2> <M2> <S2> [that is] <R> the <Z> <C> <M> <S>?",
    "What is the size of the <Z2> <C2> <M2> <S2> [that is] <R> the <Z> <C> <M> <S>?",
    "How big is the <Z2> <C2> <M2> <S2> [that is] <R> the <Z> <C> <M> <S>?",
    "There is a <Z2> <C2> <M2> <S2> [that is] <R> the <Z> <C> <M> <S>; what size is it?",
    "There is a <Z2> <C2> <M2> <S2> [that is] <R> the <Z> <C> <M> <S>; how big is it?",
    "There is a <Z2> <C2> <M2> <S2> [that is] <R> the <Z> <C> <M> <S>; what is its size?"
  "nodes": [
    {"type": "scene", "inputs": []},
    {"type": "filter_unique", "inputs": [0], "side_inputs": ["<Z>", "<C>", "<M>", "<S>"]},
    {"type": "relate_filter_unique", "inputs": [1], "side_inputs": ["<R>", "<Z2>", "<C2>", "<M2>", "<S2>"]},
    {"type": "query_size", "inputs": [2]}
  "constraints": [
    {"type": "NULL", "params": ["<Z2>"]}


The special file metadata.json defines the simple functional programming language used to construct programs and program templates.

Template Parameters

Each template parameter has a type and a name; the allowed types are Size, Color, Material, Shape, and Relation. The allowed values for each of these types is stored in metadata.json; in addition to the values defined here, each non-Relation template parameter may also be assigned the value NULL.

By convention, Size parameters are called <Z>, <Z2>, <Z2>, etc; similarly Color parameters are called <C>, Material parameters are called <M>, Shape parameters are called <S>, and Relation parameters are called <R>.

Text Templates

Each question template defines one or more text templates which give different ways of expressing the question in natural language. Text templates must use all of the template parameters. After values have been chosen for all template parameters, a natural language version of the question is generated by randomly choosing one of the text templates and replacing the parameter names with their values. Parameters whose value is NULL are replaced with the empty string, unless the parameter has type Shape in which case its textual value is "thing".

To increase linguistic diversity, the file synonyms.json defines a set of synonyms for template parameter values, e.g. "ball" is a synonym for "sphere". When instantiating templates, values are randomly replaced by synonyms.

Text templates can also have optional segments; any text surrounded by brackets will be removed with probability 0.5 during template instantiation. In the example above, the substring "that is" is optional in all text templates.

Finally, there are some special-case heuristics to replace the word "other" with "another", "a", or the empty string in some circumstances to try and minimize ambiguity.

Program Templates

A program template is defined as a sequence of nodes; each node receives input from zero or more other nodes, and produces an output; this sequence is expected to be sorted topologically in the template. The inputs to each node are identified by nodes field of a node, which is a list of integers indexing into the node sequence. A node in a program template may expand to more than one node in the program instantiated from the template.

Each node has a type, such as scene or filter_color; the metadata.json defines the full list of available nodes types, as well as input and output types for each node type.

In addition to receiving inputs from earlier nodes, some nodes also receive side inputs (also called value inputs in some places); these are literal values of some type. The number and types of expected side inputs for all node types are also listed in the metadata.json file.

As a concrete example, in the template above the first node has type scene; the metadata.json file gives us the following information about this node type:

// From metadata.json
  "name": "scene",
  "inputs": [],
  "output": "ObjectSet",
  "terminal": false

This indicates that scene nodes receive no inputs, and output an ObjectSet; scene nodes receive no side inputs, and cannot be the final node in a fully instantiated program since they are not terminal.

The next node in the sequence above has type filter_unique; since its input is [0] it receives as input the output from the previous scene node. the metadata.json file gives us the following information about this node type:

// From metadata.json
   "name": "filter_unique",
   "inputs": ["ObjectSet"],
   "side_inputs": ["Size", "Color", "Material", "Shape"],
   "output": "Object",
   "terminal": false,
   "template_only": true

Thus nodes of type filter_unique receive one input of type ObjectSet and four side inputs of type Size, Color, Material, and Shape (corresponding to parameters <Z>, <C>, <M>, <S> in the side_inputs field of the template node), and produce an output of type Object. Again, this node is not terminal so it cannot be the final node of a fully instantiated program. This node type is marked as template_only, indicating this node type is only valid as part of a program template and cannot be used in a fully instantiated program; during instantiation template nodes of type filter_unique will be replaced by a subsequence of filter_size, filter_color, filter_material, filter_shape, followed by a unique node. The use of special template-only nodes like this lead to more expressive templates, and also allow us to more easily prune the search space during template instantiation.

Continuing with the example template above, the output from the filter_unique node is passed to another node of type relate_filter_unique, which takes an input of type Object and five side inputs, and produces an output of type Object. This is another special template-only node type which will expand into a relate node followed by some subsequence of filter_size, filter_color, filter_material, filter_shape, followed by a unique node. The output of the relate_filter_unique node is then passed to a node of type query_size, which takes an Object as input and produces an output of type Size. This node type is terminal and is not template-only, so it will be the final node of both the program template as well as all programs instantiated from that template.


Templates can define constraints on the values that template parameters are allowed to take; constraints can be necessary to ensure that the question does not give away its answer. The example template above includes a constraint that the parameter <Z2> must be NULL; without this constraint the template could produce questions such as "What size is the big thing left of the sphere?" which can be trivially answered from the text of the question.

The following two constraint types are supported:

  • NULL: The parameter must take the value NULL, as in the example above.
  • OUT_NEQ: The outputs of the two specified nodes must have different values when the instantiated program is run. This is used for templates like "Are there an equal number of <Z> <C> <M> <S>s and <Z2> <C2> <M2> <S2>s?" to ensure that the two question subparts refer to different sets of objects, which avoids trivial questions like "Are there an equal number of spheres and balls?".