Automated evaluation of text-to-image generative models using description logic. The main T2I model to be evaluated is Stable Diffusion V1.4 and Stable Diffusion V2.1.
![image](https://private-user-images.githubusercontent.com/77603231/366168004-f2a19458-812c-4365-bc06-89549a12cb05.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNDk1ODksIm5iZiI6MTczOTE0OTI4OSwicGF0aCI6Ii83NzYwMzIzMS8zNjYxNjgwMDQtZjJhMTk0NTgtODEyYy00MzY1LWJjMDYtODk1NDlhMTJjYjA1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDAxMDEyOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTIxYmY1ZDhhNjk5NWY4MjFiYjZlYzEwODNlMmFjMDEwOTlkMzczMGU1ZjlhMzYwZmE0OTYwZDk1OTU4MzlhMWEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.lwsd53cspKGvb2A504-IkiNNqAWmwaSflie-X0bwJ9c)
There are multiple evaluation methods. Evaluations can be automated using two ways:
- Creating a pipeline to generate a diverse set of prompts
- Designing the evaluation procedure to check generated images
There are a few challenges associated with this task:
- Bias within evaluation data (like, apple is always associated with red and green colors)
- Can we create better evaluation dataset?
- Other kinds of biases: apple is always evaluated on the basis of colors, but not sizes Can Stable Diffusion generate big apple with the size of an elephant?
- Hallucination: If we ask the model to generate “A” then it generates “A + B”.
- How to detect such hallucinations?
Prompt generation will take the following format and expand on it to form more complicated prompts:
C = Color = {Red, Green, Black}
D = Fruit = {Banana, Apple}
F = Furniture = {Chair, Table}
R = Relation = {“on top of”, “and”}
![image](https://private-user-images.githubusercontent.com/77603231/366169171-6ce5cb7c-be0f-495b-9f79-253c5d2c1b9a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNDk1ODksIm5iZiI6MTczOTE0OTI4OSwicGF0aCI6Ii83NzYwMzIzMS8zNjYxNjkxNzEtNmNlNWNiN2MtYmUwZi00OTViLTlmNzktMjUzYzVkMmMxYjlhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDAxMDEyOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTkwODQ0MDg0MjhkNDRkZjJhOGU0ZGRkYWM4MTE0YjVlMGJlNTM4ZDQ4NDI0OGIyZDZhOWQ5MzExNzU2ZDUyZmEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.f8zN-M7TkvZAfN2onX4Jdz7QX14UPKucWoB5sHkNmgM)
C union D = {“red banana”, “black apple”}
R((C union D), F) = {“black apple on top of chair”}
R((C union D), (C union D)) = {“black apple and red banana”}
Estimated Duration | Tasks |
---|---|
2 weeks | Learning description logics |
Playing with Stable Diffusion (and understanding where it is failing) | |
Reading and analyzing the existing T2I evaluation strategies: DALL-Eval and HRS-Benchmark | |
2 weeks | Defining the description logic rules (i.e., knowledge graph) |
Creating a small diverse set of prompts using automated strategies | |
Evaluating several T2I models | |
3 weeks | Scaling the description logic rules |
Performing automated evaluations of T2I models | |
1 week | Summarizing and report writing |
Check out our detailed report for further details - here