TO EFFECTIVELY FUSE LANGUAGE AND VISION MODALITIES, WE CONCEPTUALLY DIVIDE A CLOSED-SET DETECTOR INTO THREE PHASES AND PROPOSE A TIGHT FUSION SOLUTION, WHICH INCLUDES A FEATURE ENHANCER, A LANGUAGE-GUIDED QUERY SELECTION, AND A CROSS-MODALITY DER FOR CROSS-MODALITY FUSION.