Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Linked e-resources

Details

Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing
Generative Negative Text Replay for Continual Vision-Language Pretraining
Video Graph Transformer for Video Question Answering
Trace Controlled Text to Image Generation
Video Question Answering with Iterative Video-Text Co-Tokenization
Rethinking Data Augmentation for Robust Visual Question Answering
Explicit Image Caption Editing
Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding
Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features
Selective Query-Guided Debiasing for Video Corpus Moment Retrieval
Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embodied Reference Understanding
Object-Centric Unsupervised Image Captioning
Contrastive Vision-Language Pre-training with Limited Resources
Learning Linguistic Association towards Efficient Text-Video Retrieval
ASSISTER: Assistive Navigation via Conditional Instruction Generation
X-DETR: A Versatile Architecture for Instance-Wise Vision-Language Tasks
Learning Disentanglement with Decoupled Labels for Vision-Language Navigation
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input
Word-Level Fine-Grained Story Visualization
Unifying Event Detection and Captioning as Sequence Generation via Pre-training
Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation
Fine-Grained Visual Entailment
Bottom Up Top down Detection Transformers for Language Grounding in Images and Point Clouds
New Datasets and Models for Contextual Reasoning in Visual Dialog
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection
Classification-Regression for Chart Comprehension
AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant
FindIt: Generalized Localization with Natural Language Queries
UniTAB: Unifying Text and Box Outputs for Grounded VisionLanguage Modeling
Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning
Speaker-Adaptive Lip Reading with User-Dependent Padding
TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation
SemAug: Semantically Meaningful Image Augmentations for Object Detection through Language Grounding
Referring Object Manipulation of Natural Images with Conditional Classifier-Free Guidance
NewsStories: Illustrating Articles with Visual Summaries
Webly Supervised Concept Expansion for General Purpose Vision Models
FedVLN: Privacy-Preserving Federated Vision-and-Language Navigation
CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval
Language-Driven Artistic Style Transfer
Single-Stream Multi-level Alignment for Vision-Language Pretraining.

Browse Subjects

Show more subjects...

Statistics

from
to
Export