Semantic leakage - quick notes
The canonical example from Does Liking Yellow Imply Driving a School Bus?:
Prompt: He likes yellow. He works as a
GPT4o: school bus driver.
does not seem to work with current models - I tested it with the models from OpenAI, Claude, Llama 3.2 (1B and 3B) and Mistral (7B) and I never got that answer. Maybe this is now trained out. But it looks like a plausible failure mode.
This answer taken alone is not wrong. But if it is overrepresented in the LLM answers - then it is showing a bias.
The second example from that paper:
Prompt: He likes ants. His favorite food is
GPT4o: ant-covered chocolate, a unique delicacy that combines the crunch of ants with the sweetness of chocolate
This does not seem fair to qualify this as a failure - the LLM tries to take all the input information as relevant to the question. This is more similar to how the LLM is distracted by non-relevant information in the math questions from “GSM Symbolic” that I analyzed in my previous post. But now I wonder how the first is similar to that too - I think humans would also often finish the sentence with “school bus driver”.
It seems to be caused by attention not being selective enough.
Can we increase selectivity of attention in some way? Would that improve semantic leakage?
How training impact selectivity of attention?
It would be interesting to compute how much school bus is correlated with yellow in the training data and compare that with semantic leakage in LLM replies.
How much semantic leakage would we get if we test humans? Not zero for sure. Also it would be very context dependent.