Welcome to P K Kelkar Library, Online Public Access Catalogue (OPAC)

Normal view MARC view ISBD view

Computational methods for integrating vision and language /

By: Barnard, Kobus [author.].
Material type: materialTypeLabelBookSeries: Synthesis digital library of engineering and computer science: ; Synthesis lectures on computer vision: # 7.Publisher: San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, 2016.Description: 1 PDF (xvi, 211 pages) : illustrations.Content type: text Media type: electronic Carrier type: online resourceISBN: 9781608451135.Subject(s): Computer vision -- Mathematical models | Information visualization | Closed captioning -- Technological innovations | Keyword searching -- Technological innovations | Natural language processing (Computer science) | Multimodal user interfaces (Computer systems) | vision | language | loosely labeled data | correspondence ambiguity | auto-annotation | region labeling | multimodal translation | cross-modal disambiguation | image captioning | video captioning | affective visual attributes | aligning visual and linguistic data | auto-illustration | visual question answeringDDC classification: 006.37 Online resources: Abstract with links to resource Also available in print.
Contents:
1. Introduction -- 1.1 Redundant, complementary, and orthogonal multimodal data -- 1.1.1 Multimodal mutual information -- 1.1.2 Complementary multimodal information -- 1.2 Computational tasks -- 1.2.1 Multimodal translation -- 1.2.2 Integrating complementary multimodal data and cross modal disambiguation -- 1.2.3 Grounding language with sensory data -- 1.3 Multimodal modeling -- 1.3.1 Discriminative methods -- 1.4 Mutimodal inference, applications to computational tasks -- 1.4.1 Region labeling with a concept model -- 1.4.2 Cross-modal disambiguation, region labeling with image keywordS -- 1.4.3 Cross-modal disambiguation, word sense disambiguation with images -- 1.5 Learning from redundant representations in loosely labeled multimodal data -- 1.5.1 Resolving region-label correspondence ambiguity -- 1.5.2 Data variation and semantic grouping -- 1.5.3 Simultaneously learning models and reducing correspondence ambiguity --
2. The semantics of images and associated text -- 2.1 Lessons from image search -- 2.1.1 Content-based image retrieval (CBIR) -- 2.2 Images and text as evidence about the world -- 2.3 Affective attributes of images and video -- 2.3.1 Emotion induction from images and video -- 2.3.2 Inferring emotion in people depicted in images and videos --
3. Sources of data for linking visual and linguistic information -- 3.1 WordNet for building semantic visual-linguistic data sets -- 3.2 Visual data with a single objective label -- 3.3 Visual data with a single subjective label -- 3.4 Visual data with keywords or object labels -- 3.4.1 Localized labels -- 3.4.2 Semantic segmentations with labels -- 3.5 Visual data with descriptions -- 3.6 Image data with questions and answers --
4. Extracting and representing visual information -- 4.1 Low-level features -- 4.1.1 Color -- 4.1.2 Edges -- 4.1.3 Texture -- 4.1.4 Characterizing neighborhoods using histograms of oriented gradients -- 4.2 Segmentation for low-level spatial grouping -- 4.3 Representation of regions and patches -- 4.3.1 Visual word representations -- 4.4 Mid-level representations for images -- 4.4.1 Artificial neural network representations -- 4.5 Object category recognition and detection --
5. Text and speech processing -- 5.1 Text associated with audiovisual data -- 5.2 Text embedded within visual data -- 5.3 Basic natural language processing -- 5.4 Word sense disambiguation -- 5.5 Online lexical resource for vision and language integration -- 5.5.1 WordNet -- 5.5.2 Representing words by vectors --
6. Modeling images and keywords -- 6.1 Scene semantic-keywords for entire images -- 6.2 Localized semantics-keywords for regions -- 6.3 Generative models with independent multi-modal concepts -- 6.3.1 Notational preliminaries -- 6.3.2 Semantic concepts with multi-model evidence -- 6.3.3 Joint modeling of images and keywords (PWRM and IRCM) -- 6.3.4 Inferring image keywords and region labels -- 6.3.5 Learning multi-modal concept models from loosely labeled data -- 6.3.6 Evaluation of region labeling and image annotation -- 6.4 Translation models -- 6.4.1 Notational preliminaries (continuing 6.3.1) -- 6.4.2 A simple region translation model (RTM) -- 6.4.3 Visual translation models for broadcast video -- 6.4.4 A word translation model (WTM) -- 6.4.5 Supervised multiclass labeling (SML) -- 6.4.6 Discriminative models for translation -- 6.5 Image clustering and interdependencies among concepts -- 6.5.1 Region concepts with image categories (CIRCM) -- 6.5.2 Latent dirichlet allocation (LDA) -- 6.5.3 Multiclass supervised LDA (sLDA) with annotations -- 6.6 Segmentation, region grouping, and spatial context -- 6.6.1 Notational preliminaries (continuing 6.3.1 and 6.4.1) -- 6.6.2 Random fields for representing image semantics -- 6.6.3 Joint learning of translation and spatial relationships -- 6.6.4 Multistage learning and inference -- 6.6.5 Dense CRFs for general context -- 6.6.6 Dense CRFs for multiple pairwise relationships -- 6.6.7 Multiscale CRF (mCRF) -- 6.6.8 Relative location prior with CRFs -- 6.6.9 Encoding spatial patterns into the unary potentials with texture-layout features -- 6.6.10 Discriminative region labeling with spatial and scene information -- 6.6.11 Holistic integration of appearance, object detection, and scene type -- 6.7 Image annotation without localization -- 6.7.1 Nonparametric generative models -- 6.7.2 Label propagation --
7. Beyond simple nouns -- 7.1 Reasoning with proper nouns -- 7.1.1 Names and faces in the news -- 7.1.2 Linking action verbs to pose-who is doing what? -- 7.1.3 Learning structured appearance for named objects -- 7.2 Learning and using adjectives and attributes -- 7.2.1 Learning visual attributes for color names -- 7.2.2 Learning complex visual attributes for specific domains -- 7.2.3 Inferring emotional attributes for images -- 7.2.4 Inferring emotional attributes for video clips -- 7.2.5 Sentiment analysis in consumer photographs and videos -- 7.2.6 Extracting aesthetic attributes for images -- 7.2.7 Addressing subjectivity -- 7.3 Noun-noun relationships-spatial prepositions and comparative adjectives -- 7.3.1 Learning about preposition use in natural language -- 7.4 Linking visual data to verbs -- 7.5 Vision helping language understanding -- 7.5.1 Using vision to improve word sense disambiguation -- 7.5.2 Using vision to improve coreference resolution -- 7.5.3 Discovering visual-semantic senses -- 7.6 Using associated text to improve visual understanding -- 7.6.1 Using captions to improve semantic image parsing (cardinality and prepositions) -- 7.7 Using world knowledge from text sources for visual understanding -- 7.7.1 Seeing what cannot be seen? -- 7.7.2 World knowledge for training large-scale fine-grained visual models --
8. Sequential structure -- 8.1 Automated image and video captioning -- 8.1.1 Captioning by reusing existing sentences and fragments -- 8.1.2 Captioning using templates, schemas, or simple grammars -- 8.1.3 Captioning video using storyline models -- 8.1.4 Captioning with learned sentence generators -- 8.2 Aligning sentences with images and video -- 8.3 Automatic illustration of text documents -- 8.4 Visual question and answering --
A. Additional definitions and derivations -- Basic definitions from probability and information theory -- Additional considerations for multimodal evidence for a concept -- Loosely labeled vs. strongly labeled data -- Pedantic derivation of equation (6.13) -- Derivation of the EM equations for the image region concept model (IRCM) -- Bibliography -- Author's biography.
Abstract: Modeling data from visual and linguistic modalities together creates opportunities for better understanding of both, and supports many useful applications. Examples of dual visual-linguistic data includes images with keywords, video with narrative, and figures in documents. We consider two key task-driven themes: translating from one modality to another (e.g., inferring annotations for images) and understanding the data using all modalities, where one modality can help disambiguate information in another. The multiple modalities can either be essentially semantically redundant (e.g., keywords provided by a person looking at the image), or largely complementary (e.g., meta data such as the camera used). Redundancy and complementarity are two endpoints of a scale, and we observe that good performance on translation requires some redundancy, and that joint inference is most useful where some information is complementary. Computational methods discussed are broadly organized into ones for simple keywords, ones going beyond keywords toward natural language, and ones considering sequential aspects of natural language. Methods for keywords are further organized based on localization of semantics, going from words about the scene taken as whole, to words that apply to specific parts of the scene, to relationships between parts. Methods going beyond keywords are organized by the linguistic roles that are learned, exploited, or generated. These include proper nouns, adjectives, spatial and comparative prepositions, and verbs. More recent developments in dealing with sequential structure include automated captioning of scenes and video, alignment of video and text, and automated answering of questions about scenes depicted in images.
    average rating: 0.0 (0 votes)
Item type Current location Call number Status Date due Barcode Item holds
E books E books PK Kelkar Library, IIT Kanpur
Available EBKE709
Total holds: 0

Mode of access: World Wide Web.

System requirements: Adobe Acrobat Reader.

Part of: Synthesis digital library of engineering and computer science.

Includes bibliographical references (pages 155-210).

1. Introduction -- 1.1 Redundant, complementary, and orthogonal multimodal data -- 1.1.1 Multimodal mutual information -- 1.1.2 Complementary multimodal information -- 1.2 Computational tasks -- 1.2.1 Multimodal translation -- 1.2.2 Integrating complementary multimodal data and cross modal disambiguation -- 1.2.3 Grounding language with sensory data -- 1.3 Multimodal modeling -- 1.3.1 Discriminative methods -- 1.4 Mutimodal inference, applications to computational tasks -- 1.4.1 Region labeling with a concept model -- 1.4.2 Cross-modal disambiguation, region labeling with image keywordS -- 1.4.3 Cross-modal disambiguation, word sense disambiguation with images -- 1.5 Learning from redundant representations in loosely labeled multimodal data -- 1.5.1 Resolving region-label correspondence ambiguity -- 1.5.2 Data variation and semantic grouping -- 1.5.3 Simultaneously learning models and reducing correspondence ambiguity --

2. The semantics of images and associated text -- 2.1 Lessons from image search -- 2.1.1 Content-based image retrieval (CBIR) -- 2.2 Images and text as evidence about the world -- 2.3 Affective attributes of images and video -- 2.3.1 Emotion induction from images and video -- 2.3.2 Inferring emotion in people depicted in images and videos --

3. Sources of data for linking visual and linguistic information -- 3.1 WordNet for building semantic visual-linguistic data sets -- 3.2 Visual data with a single objective label -- 3.3 Visual data with a single subjective label -- 3.4 Visual data with keywords or object labels -- 3.4.1 Localized labels -- 3.4.2 Semantic segmentations with labels -- 3.5 Visual data with descriptions -- 3.6 Image data with questions and answers --

4. Extracting and representing visual information -- 4.1 Low-level features -- 4.1.1 Color -- 4.1.2 Edges -- 4.1.3 Texture -- 4.1.4 Characterizing neighborhoods using histograms of oriented gradients -- 4.2 Segmentation for low-level spatial grouping -- 4.3 Representation of regions and patches -- 4.3.1 Visual word representations -- 4.4 Mid-level representations for images -- 4.4.1 Artificial neural network representations -- 4.5 Object category recognition and detection --

5. Text and speech processing -- 5.1 Text associated with audiovisual data -- 5.2 Text embedded within visual data -- 5.3 Basic natural language processing -- 5.4 Word sense disambiguation -- 5.5 Online lexical resource for vision and language integration -- 5.5.1 WordNet -- 5.5.2 Representing words by vectors --

6. Modeling images and keywords -- 6.1 Scene semantic-keywords for entire images -- 6.2 Localized semantics-keywords for regions -- 6.3 Generative models with independent multi-modal concepts -- 6.3.1 Notational preliminaries -- 6.3.2 Semantic concepts with multi-model evidence -- 6.3.3 Joint modeling of images and keywords (PWRM and IRCM) -- 6.3.4 Inferring image keywords and region labels -- 6.3.5 Learning multi-modal concept models from loosely labeled data -- 6.3.6 Evaluation of region labeling and image annotation -- 6.4 Translation models -- 6.4.1 Notational preliminaries (continuing 6.3.1) -- 6.4.2 A simple region translation model (RTM) -- 6.4.3 Visual translation models for broadcast video -- 6.4.4 A word translation model (WTM) -- 6.4.5 Supervised multiclass labeling (SML) -- 6.4.6 Discriminative models for translation -- 6.5 Image clustering and interdependencies among concepts -- 6.5.1 Region concepts with image categories (CIRCM) -- 6.5.2 Latent dirichlet allocation (LDA) -- 6.5.3 Multiclass supervised LDA (sLDA) with annotations -- 6.6 Segmentation, region grouping, and spatial context -- 6.6.1 Notational preliminaries (continuing 6.3.1 and 6.4.1) -- 6.6.2 Random fields for representing image semantics -- 6.6.3 Joint learning of translation and spatial relationships -- 6.6.4 Multistage learning and inference -- 6.6.5 Dense CRFs for general context -- 6.6.6 Dense CRFs for multiple pairwise relationships -- 6.6.7 Multiscale CRF (mCRF) -- 6.6.8 Relative location prior with CRFs -- 6.6.9 Encoding spatial patterns into the unary potentials with texture-layout features -- 6.6.10 Discriminative region labeling with spatial and scene information -- 6.6.11 Holistic integration of appearance, object detection, and scene type -- 6.7 Image annotation without localization -- 6.7.1 Nonparametric generative models -- 6.7.2 Label propagation --

7. Beyond simple nouns -- 7.1 Reasoning with proper nouns -- 7.1.1 Names and faces in the news -- 7.1.2 Linking action verbs to pose-who is doing what? -- 7.1.3 Learning structured appearance for named objects -- 7.2 Learning and using adjectives and attributes -- 7.2.1 Learning visual attributes for color names -- 7.2.2 Learning complex visual attributes for specific domains -- 7.2.3 Inferring emotional attributes for images -- 7.2.4 Inferring emotional attributes for video clips -- 7.2.5 Sentiment analysis in consumer photographs and videos -- 7.2.6 Extracting aesthetic attributes for images -- 7.2.7 Addressing subjectivity -- 7.3 Noun-noun relationships-spatial prepositions and comparative adjectives -- 7.3.1 Learning about preposition use in natural language -- 7.4 Linking visual data to verbs -- 7.5 Vision helping language understanding -- 7.5.1 Using vision to improve word sense disambiguation -- 7.5.2 Using vision to improve coreference resolution -- 7.5.3 Discovering visual-semantic senses -- 7.6 Using associated text to improve visual understanding -- 7.6.1 Using captions to improve semantic image parsing (cardinality and prepositions) -- 7.7 Using world knowledge from text sources for visual understanding -- 7.7.1 Seeing what cannot be seen? -- 7.7.2 World knowledge for training large-scale fine-grained visual models --

8. Sequential structure -- 8.1 Automated image and video captioning -- 8.1.1 Captioning by reusing existing sentences and fragments -- 8.1.2 Captioning using templates, schemas, or simple grammars -- 8.1.3 Captioning video using storyline models -- 8.1.4 Captioning with learned sentence generators -- 8.2 Aligning sentences with images and video -- 8.3 Automatic illustration of text documents -- 8.4 Visual question and answering --

A. Additional definitions and derivations -- Basic definitions from probability and information theory -- Additional considerations for multimodal evidence for a concept -- Loosely labeled vs. strongly labeled data -- Pedantic derivation of equation (6.13) -- Derivation of the EM equations for the image region concept model (IRCM) -- Bibliography -- Author's biography.

Abstract freely available; full-text restricted to subscribers or individual document purchasers.

Compendex

INSPEC

Google scholar

Google book search

Modeling data from visual and linguistic modalities together creates opportunities for better understanding of both, and supports many useful applications. Examples of dual visual-linguistic data includes images with keywords, video with narrative, and figures in documents. We consider two key task-driven themes: translating from one modality to another (e.g., inferring annotations for images) and understanding the data using all modalities, where one modality can help disambiguate information in another. The multiple modalities can either be essentially semantically redundant (e.g., keywords provided by a person looking at the image), or largely complementary (e.g., meta data such as the camera used). Redundancy and complementarity are two endpoints of a scale, and we observe that good performance on translation requires some redundancy, and that joint inference is most useful where some information is complementary. Computational methods discussed are broadly organized into ones for simple keywords, ones going beyond keywords toward natural language, and ones considering sequential aspects of natural language. Methods for keywords are further organized based on localization of semantics, going from words about the scene taken as whole, to words that apply to specific parts of the scene, to relationships between parts. Methods going beyond keywords are organized by the linguistic roles that are learned, exploited, or generated. These include proper nouns, adjectives, spatial and comparative prepositions, and verbs. More recent developments in dealing with sequential structure include automated captioning of scenes and video, alignment of video and text, and automated answering of questions about scenes depicted in images.

Also available in print.

Title from PDF title page (viewed on May 13, 2016).

There are no comments for this item.

Log in to your account to post a comment.

Powered by Koha