Building Vision Language Models As Knowledge