Please use this identifier to cite or link to this item: http://localhost/handle/Hannan/128745
Title: Generating image descriptions with multidirectional 2D long short-term memory
Authors: Shuohao Li;Jun Zhang;Qiang Guo;Jun Lei;Dan Tu
Year: 2017
Publisher: IET
Abstract: Connecting visual imagery with descriptive language is a challenge for computer vision and machine translation. To approach this problem, the authors propose a novel end-to-end model to generate descriptions for images. Some early works used convolutional neural network-long-short-term memory (CNN-LSTM) model to describe the image, where a CNN encodes the input image into feature vector and an LSTM decodes the feature vector into a description. Since two-dimensional LSTM (2DLSTM) has property of translation invariance and can encode the relationships between regions in an image, they not only apply a CNN to extract global features of an image, but also use a multidirectional 2DLSTM to encode the feature maps extracted by CNN into structural local features. Their model is trained through maximising the likelihood of the target description sentence from the training dataset. Experiments on two challenging datasets show the accuracy of the model and the fluency of the language which is learned by their model. They compare bilingual evaluation understudy score and retrieval metric of their results with current state-of-the-art scores and show the improvements on Flickr30k and MS COCO.
URI: http://localhost/handle/Hannan/128745
volume: 11
issue: 1
More Information: 104,
111
Appears in Collections:2017

Files in This Item:
File SizeFormat 
7826797.pdf3.9 MBAdobe PDF
Title: Generating image descriptions with multidirectional 2D long short-term memory
Authors: Shuohao Li;Jun Zhang;Qiang Guo;Jun Lei;Dan Tu
Year: 2017
Publisher: IET
Abstract: Connecting visual imagery with descriptive language is a challenge for computer vision and machine translation. To approach this problem, the authors propose a novel end-to-end model to generate descriptions for images. Some early works used convolutional neural network-long-short-term memory (CNN-LSTM) model to describe the image, where a CNN encodes the input image into feature vector and an LSTM decodes the feature vector into a description. Since two-dimensional LSTM (2DLSTM) has property of translation invariance and can encode the relationships between regions in an image, they not only apply a CNN to extract global features of an image, but also use a multidirectional 2DLSTM to encode the feature maps extracted by CNN into structural local features. Their model is trained through maximising the likelihood of the target description sentence from the training dataset. Experiments on two challenging datasets show the accuracy of the model and the fluency of the language which is learned by their model. They compare bilingual evaluation understudy score and retrieval metric of their results with current state-of-the-art scores and show the improvements on Flickr30k and MS COCO.
URI: http://localhost/handle/Hannan/128745
volume: 11
issue: 1
More Information: 104,
111
Appears in Collections:2017

Files in This Item:
File SizeFormat 
7826797.pdf3.9 MBAdobe PDF
Title: Generating image descriptions with multidirectional 2D long short-term memory
Authors: Shuohao Li;Jun Zhang;Qiang Guo;Jun Lei;Dan Tu
Year: 2017
Publisher: IET
Abstract: Connecting visual imagery with descriptive language is a challenge for computer vision and machine translation. To approach this problem, the authors propose a novel end-to-end model to generate descriptions for images. Some early works used convolutional neural network-long-short-term memory (CNN-LSTM) model to describe the image, where a CNN encodes the input image into feature vector and an LSTM decodes the feature vector into a description. Since two-dimensional LSTM (2DLSTM) has property of translation invariance and can encode the relationships between regions in an image, they not only apply a CNN to extract global features of an image, but also use a multidirectional 2DLSTM to encode the feature maps extracted by CNN into structural local features. Their model is trained through maximising the likelihood of the target description sentence from the training dataset. Experiments on two challenging datasets show the accuracy of the model and the fluency of the language which is learned by their model. They compare bilingual evaluation understudy score and retrieval metric of their results with current state-of-the-art scores and show the improvements on Flickr30k and MS COCO.
URI: http://localhost/handle/Hannan/128745
volume: 11
issue: 1
More Information: 104,
111
Appears in Collections:2017

Files in This Item:
File SizeFormat 
7826797.pdf3.9 MBAdobe PDF