How Proper Names Behave in Text Embedding Space

etoud 7 hours ago

I was debugging a RAG system and noticed that “semantic” dense retrievers were oddly good at author names, even when hybrid clearly worked better overall. This post builds a small diagnostic around synthetic (author, topic) queries and shows that proper names carry about half as much separation power as the topic in embedding space. Then I systematically “break” the names (masks, gibberish IDs, small edit-distance corruptions, formatting and layout changes) to see what survives, and find that most of the signal comes from surface form and exact-match bias rather than any deep notion of identity.