• Kogasa@programming.dev
    link
    fedilink
    arrow-up
    7
    ·
    11 months ago

    Semantic embeddings are a thing. LLMs “work with tokens” but they associate them with semantic models internally. You can externalize it via semantic embeddings so that the same semantic models can be shared between LLMs.

    • Lvxferre@lemmy.ml
      link
      fedilink
      arrow-up
      4
      ·
      edit-2
      11 months ago

      The source that I’ve linked mentions semantic embedding; so does further literature on the internet. However, the operations are still being performed with the vectors resulting from the tokens themselves, with said embedding playing a secondary role.

      This is evident for example through excerpts like

      The token embeddings map a token ID to a fixed-size vector with some semantic meaning of the tokens. These brings some interesting properties: similar tokens will have a similar embedding (in other words, calculating the cosine similarity between two embeddings will give us a good idea of how similar the tokens are).

      Emphasis mine. A similar conclusion (that the LLM is still handling the tokens, not their meaning) can be reached by analysing the hallucinations that your typical LLM bot outputs, and asking why that hallu is there.

      What I’m proposing is deeper than that. It’s to use the input tokens (i.e. morphemes) only to retrieve the sememes (units of meaning; further info here) that they’re conveying, then discard the tokens themselves, and perform the operations solely on the sememes. Then for the output you translate the sememes obtained by the transformer into morphemes=tokens again.

      I believe that this would have two big benefits:

      1. The amount of data necessary to “train” the LLM will decrease. Perhaps by orders of magnitude.
      2. A major type of hallucination will go away: self-contradiction (for example: states that A exists, then that A doesn’t exist).

      And it might be an additional layer, but the whole approach is considerably simpler than what’s being done currently - pretending that the tokens themselves have some intrinsic value, then playing whack-a-mole with situations where the token and the contextually assigned value (by the human using the LLM) differ.

      [This could even go deeper, handling a pragmatic layer beyond the tokens/morphemes and the units of meaning/sememes. It would be closer to what @[email protected] understood from my other comment, as it would then deal with the intent of the utterance.]