My model based on two papers:
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks A Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts
The main idea is to use pretrained Encoder from the second paper during training the first one architecture. More details you can see on the image below.
My position is explained by the fact that ZSL GAN takes as input noisy text descriptions about an unseen class and can generates synthesized visual features for this class. Continuing the previous words we can train ZSL GAN first and use its Embedder (_netG in code) during training AttnGAN . This will allow us to generate images from noisy descriptions. To make it better I also added DenseEncoder after output of _netG. It made learning more stable.