Semantic Projection

A hierarchy of value represents how humans assign importance or priority to elements in their lives. This is predicated on the fact that we cannot treat everything with equal significance - some things inevitably matter more to us than others. We form our behavioral patterns and life choices based on this hierarchy.

Semantic Projection uncovers the hierarchy of value in Large Language Models (LLMs) by uncovering the relative importance of different actions and outcomes.

Approach

The final hidden layer in a LLM is a representation of how that model calculates the semantic coherence of words. Words that are semantically similar like “truck” and “car” are found closer together in this semantic space, whereas dissimilar words like “truck” and “sandwich” tend to be farther apart.

Subtracting the coordinates of one word in this semantic space from another word produces a new vector that represents the shortest path between those words. This path represents a continuous line between those two words in semantic space.

For instance, if you subtract the coordinates for "small" from the coordinates for "big", you effectively draw a line in the semantic space that represents a dimension of size. The position of words along this line (determined by the dot product) represents their position on that dimension.

Application

The hierarchy of value was probed for two LLMs: DeciLM-7B and Llama-3-8B. Both of these models are top performers in their respective classes as of April 2024.

Thirty-two items from the Moral Foundations Questionnaire were encoded in each LLM. The items of this questionnaire represent a collection of common moral values. Semantic Projection was conducted by calculating the position of each item along a Desirable–Undesirable dimension.

In the Llama-3 semantic space, the distribution of items reflected the degree to which an action causes harm. For instance, the top three most undesirable actions were: Killing a human being, Someone was cruel, Someone acted unfairly.

In contrast, the DeciLM semantic space organized items in terms of their social transgressions. In this semantic space, the top three most undesirable actions were: Someone violated standards of purity and decency, Someone did something disgusting, Someone acted unfairly. Killing a human being was the 13th most undesirable action in this space.

Why this matters for AI research

A hierarchy of value determines the hierarchy of appropriate action. As AI systems become more autonomous, we must ensure that the values of these systems will foster the development of humanity on our own terms.

A system that is prone to value purity over the sanctity of life is obviously a danger to humanity. Semantic Projection can help to identify and rectify such bias in AI systems, thereby preventing unforeseen negative impacts and ensuring that AI systems are contributing positively to our society

Understanding ourselves

In LLMs that align with human values – or when convergence is found among LLM value hierarchies – we can use the resulting maps to better understand our collective value structure.

Because LLMs can represent an externalization of our collective priorities as encoded in conversation, Semantic Projection can reveal answers to our deepest moral questions. What is the highest good? What do we owe future generations? Are there limits to personal liberty?

Techniques

Determine how AI models conceptualize the world.

Uncover the motivational first principles of AI models.

Map the hierarchy of value contained within AI models.