Building Vision Language Models On Solid Foundations With Masked Distillation