Alt Text Generation - For Images

Image data is present everywhere. It has become one of the important hallmarks of digital representation and data exchange over the internet. The importance of images can be seen from memes to infographics, it could be easily understood. But the prevalence of image data comes also the need for identification and analysis during data handling.

Alternative texts can be used to identify an image in a much more general fashion rather than scourging through the image metadata. These texts can also be used for accessibility and curation purposes as well.

The Philippines




Content and Data Management Company



Our client is a well-known publishing company with worldwide operations, they cover many fields of literature especially academic and fiction. We were tasked with creating a model to extract and generate alt texts for their textbooks.

These alt texts are important for the variety of services our clients provide within their textbooks. Most importantly it can facilitate read-along tools to help many people with reading disorders and various disabilities as well.



A massive array of challenges regarding this model that need to be tackled are listed below :

  • Image Pre-processing - For a neural network to process images, they have to be pre-processed and made ready for easy data extraction. This process usually takes the shape of organizing and oriented the raw data and selecting the methodologies for Convolution Neural Network analysis.

  • Image Clarity and Cohesion - Images with better contrast, resolutions and clarity tends to yield better results after analysis. Since, most illustrations are artist renders and stock images this problem can be effectively rectified.

  • Coherent Caption Generation  - Our model has to tackle the problems of multicontextuality and provide a seamless undertaking for generation caption with less human intervention. One image can mean many things and this could be major hurdle for the machine to understand what it actually is. With good load training and scoring., this could be prevented.



We have arrived at a solution with the project basis and our own research regarding the challenges in mind. They are listed below:

  • The images are collected and organized as a coherent dataset for further implementations. The content is preprocessed and refined to make the data extraction as easy as possible.

  • A Convolution Neural Network encoder module is used as the framework for encoding an image to it’s features - i.e., identifying the crucial elements of an image. These elements are the bedrock of our model. Our CNN module contains both pre-configured encoders but they can also extract information and develop encoders in real time.

  • A word embedding model is based on a natural language processing technique, an operation to group together similar words and terms. These groupings are extracted when the decoded information from the image, they provide the details that are observed by the CNN in more simpler human parsable terms. This module is coupled also with an operation that could provide contextuality to the text, maintaining an structured result rather than a mere grouping of words. This provides a series of relational and dependent captions which are automatically scored by preference and totality.

  • At last, a CNN decoder model - the inverse of the first encoder model - where the image vectors (and subsequently the images) are matched together with the highest probable and accurate caption predetermined by scoring.

  • For deep interpretation of illustrations and other objects in the document a custom OCR is used for detection over a pre-defined, out-of-box, OCR method.



An undertaking of this manner creates rapidly pushes various alt-texts and captions which can be either used nominally or modified. A model like this can not only tag illustrations and images, but they can also generate needed captions for graphs and various other important functions. When these images are digitally hosted, the alt-texts can be simply reused again with no ease. This modularity is very essential and could be very helpful to our clients.

They provide an extra layer of accessibility, especially for people with visual and communicative disabilities - providing alt-texts for read along devices, screen readers, and braille content generation. For a even deeper analysis, our model can identify and intepret the type of graphs, models, and illustrations present in various documents.

Ready to put AI to work for your business?

Make a plan and understand your ROI before you start implementing AI. 
Don’t fall into the trap most companies fall into. 
Take the first step—Get in touch today.