Also known as image description annotation, image caption annotation adds natural language description text to images (such as annotating the corresponding sentence for an image of "a puppy running on the grass"), establishing the association between image visual information and natural language text, used for image caption generation and cross-modal understanding model training.