UNIVTG: TOWARDS UNIFIED VIDEO-LANGUAGE TEMPORAL GROUNDING

MOST METHODS IN THIS DIRECTION DEVELOP TASKSPECIFIC MODELS THAT ARE TRAINED WITH TYPE-SPECIFIC LABELS, SUCH AS MOMENT RETRIEVAL (TIME INTERVAL) AND HIGHLIGHT DETECTION (WORTHINESS CURVE), WHICH LIMITS THEIR ABILITIES TO GENERALIZE TO VARIOUS VTG TASKS AND LABELS.

MEMORY TRANSFORMER

ADDING TRAINABLE MEMORY TO SELECTIVELY STORE LOCAL AS WELL AS GLOBAL REPRESENTATIONS OF A SEQUENCE IS A PROMISING DIRECTION TO IMPROVE THE TRANSFORMER MODEL.