Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ric 
posted an update Mar 26, 2024
Post
1731
𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗯𝗲𝗮𝗺 𝘀𝗲𝗮𝗿𝗰𝗵 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴 𝘄𝗼𝗿𝗸? ➡️ 𝙉𝙚𝙬 𝙫𝙞𝙨𝙪𝙖𝙡𝙞𝙯𝙖𝙩𝙞𝙤𝙣 𝙩𝙤𝙤𝙡! 👀

In Decoder-type LLMs like GPT4 or Mistral-Large, the output is generated one token (=word part) at a time. That's why they're nicknamed "stochastic parrots": the "thinking" process only happens one step at a time, so it can seem really myopic.

𝐒𝐨 𝐡𝐨𝐰 𝐢𝐬 𝐭𝐡𝐞 𝐧𝐞𝐱𝐭 𝐭𝐨𝐤𝐞𝐧 𝐬𝐞𝐥𝐞𝐜𝐭𝐞𝐝?

📊 Given its input sentence like "𝘞𝘩𝘢𝘵 𝘪𝘴 𝘵𝘩𝘦 7𝘵𝘩 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳? 𝘛𝘩𝘦 7𝘵𝘩 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳", the Decoder LLM generates, for each token in its vocabulary, a score that represents this token's probability of coming next.
For instance: "𝙞𝙨" gets score 0.56, and "𝙘𝙖𝙣" gets score 0.35.

🤑 𝐆𝐫𝐞𝐞𝐝𝐲 𝐝𝐞𝐜𝐨𝐝𝐢𝐧𝐠 is the naive option where you simply take the next most probable token at each step. But this creates paths that maximize very short-term rewards, thus may overlook better paths for the long term (like this time when you played FIFA all evening and arrived unprepared to your school exam on the next day).
In our example, the next highest score token might be "𝙞𝙨", but this will strongly bias the LLM towards giving an hasty response. On the opposite, starting with "𝙘𝙖𝙣" could have been completed with "𝘣𝘦 𝘰𝘣𝘵𝘢𝘪𝘯𝘦𝘥 𝘧𝘳𝘰𝘮 𝘤𝘰𝘮𝘱𝘶𝘵𝘪𝘯𝘨 𝘱𝘳𝘦𝘷𝘪𝘰𝘶𝘴 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳𝘴 𝘧𝘪𝘳𝘴𝘵", which steers the LLM towards a correct reasoning!

🗺️ 𝐁𝐞𝐚𝐦 𝐬𝐞𝐚𝐫𝐜𝐡 improves on greedy decoding by generating at each step several paths - called beams - instead of one. This allows the generation to explore a much larger space, thus find better completions. In our example, both the "𝙞𝙨" and the "𝙘𝙖𝙣" completion could be tested. ✅

👉 I've created a tool to let you visualize it, thank you @joaogante for your great help!
𝙏𝙧𝙮 𝙞𝙩 𝙝𝙚𝙧𝙚: m-ric/beam_search_visualizer
In this post