Follow what CLIP did in their repo, we as well share some of our manually crafted prompt and motivations here.
classes = [
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
]
template = "a photo of a digital number {}"
The SUN397 dataset is a popular image recognition benchmark composed of 108,754 images in 397 categories. The images are very diverse and often contain complex scenes with several objects (more than 10 per image) and/or contextual details. The dataset is divided into 3 subsets: 50,000 for training, 5,000 for validation and 53,754 for testing. see here for the full list of classes.
template = "a photo of a {}"
The PatchCamelyon benchmark is a new and challenging image classification dataset. It consists of 327.680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annoted with a binary label indicating presence of metastatic tissue. PCam provides a new benchmark for machine learning models: bigger than CIFAR10, smaller than Imagenet, trainable on a single GPU.
classes = [
"normal lymph node tissue",
"lymph node metastasis"
]
template = "a photo of {}"
The dataset contains 31,500 images in 45 classes, with 700 images per class. The dataset is divided into 3 subsets: 25,000 for training, 1,000 for validation and 5,500 for testing. The images are in high resolution (256x256 pixels) in .tif format. The dataset is a subset of the UC Merced Land Use Dataset.
classes = [
"airplane",
"airport",
"baseball_diamond",
"basketball_court",
"beach",
"bridge",
"chaparral",
"church",
"circular_farmland",
"cloud",
"commercial_area",
"dense_residential",
"desert",
"forest",
"freeway",
"golf_course",
"ground_track_field",
"harbor",
"industrial_area",
"intersection",
"island",
"lake",
"meadow",
"medium_residential",
"mobile_home_park",
"mountain",
"overpass",
"palace",
"parking_lot",
"railway",
"railway_station",
"rectangular_farmland",
"river",
"roundabout",
"runway",
"sea_ice",
"ship",
"snowberg",
"sparse_residential",
"stadium",
"storage_tank",
"tennis_court",
"terrace",
"thermal_power_station",
"wetland"
]
template = "satellite imagery of {}"
The Dmlab dataset consists of 360x480 color images in 6 classes. The classes are {close, far, very far} x {positive reward, negative reward} respectively.
classes = [
"a close object with postive reward",
"a close object with negative reward",
"a far object with postive reward",
"a far object with negative reward"
"a very far object with postive reward",
"a very far object with negative reward"
]
template = "a photo of {}"
The dataset contains 7481 training images annotated with 3D bounding boxes. A full description of the annotations can be found in the readme of the object development kit readme on the Kitti homepage.
classes = [
"a car on my left or right side.",
"a car nearby.",
"a car in the distance.",
"no car."
]
template = "a photo with {}"
Created as part of the VTAB benchmark. Count for the number of objects in the image.
classes = [
"three things",
"four things",
"five things",
"six things",
"seven things",
"eight things",
"nine things",
"ten things"
]
template = "there are {} objects in the image"
Created as part of the VTAB benchmark. Distance to the closest object in the image.
classes = [
"0-8.0M",
"8.0-8.5M",
"8.5-9.0M",
"9.0-9.5M",
"9.5-10.0M",
"10.0-100M",
]
template = "the closest object is {} away from the camera"
classes = [
"0",
"20",
"40",
"60",
"80",
"100",
"120",
"140",
"160",
"180",
"200",
"220",
"240",
"260",
"280",
"300",
"320",
"340",
]
template = "the camera azimuth is {} degrees"
classes = [
"30",
"35",
"40",
"45",
"50",
"55",
"60",
"65",
"70",
]
template = "the camera elevation is {} degrees"
_DATASET_TEMPLATE = {
"vtab-oxford_flowers": "a photo of a {}",
"vtab-caltech101": "a photo of a {}",
"vtab-cifar100": "a photo of a {}",
"vtab-dtd": "a photo of a {}",
"vtab-eurosat": "a photo of a {}",
"vtab-oxford_pet": "a photo of a {}",
"vtab-pcam": "a photo of a {}",
"vtab-svhncropped": "a photo of a digital number {}",
"vtab-sun397": "a photo of a {}",
"vtab-clevr_count": "there are {} objects in the image",
"vtab-clevr_distance": "the closest object is {} away from the camera",
"vtab-dmlab": "a photo of {}",
"vtab-kitti": "a photo with {}",
"vtab-smallnorb_azimuth": "the camera azimuth is {} degrees",
"vtab-smallnorb_elevation": "the camera elevation is {} degrees",
"vtab-dSprites_location": "a photo of a {}",
"vtab-dSprites_orientation": "a photo of a {}",
"vtab-resisc45": "satellite imagery of {}",
}