V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
Bazingawang
V2EX  ›  分享发现

真能学会打麻将: Claude 3 LLM-RGB 测评遥遥领先 GPT4

  •  
  •   Bazingawang · 273 天前 · 1779 次点击
    这是一个创建于 273 天前的主题,其中的信息可能已经有所发展或是发生改变。

    图片 昨天 ,Anthropic 发布了最新的 Claude3 模型, 引发广泛关注。在 Babel.cloud 的开源测评项目的 LLM-RGB 项目中,Claude3 在单次测试中获得了 97.6 分的高分,大大超过了 GPT-4 Turbo ,成为目前大模型能力的领先者。 回答详情:https://llm-rgb.babel.run/view/testId/a581e4a9-ce1e-4b2f-8f45-980889913b58 作为参考,截至 1.24 日各大模型测评得分

    教 Claude 3 打麻将

    其中值得注意的是,在 LLM-RGB 测评中,015_simple_mahjong 是一道复杂性极高的题目,在 Prompt 中,会教给大模型麻将的简化版规则,并给出示例,再让大模型在特定场景下给出出牌选择。这道题在过往的测试中鲜有做对的情况。但 Claude 3 Opus 给出最优解的概率为 20%,次优解概率为 80%。说明其在多轮推理能力上远超其他模型,可以利用有限的上下文快速学习知识,并加以运用,这将使 Claude 3 的落地场景远不止简单的客服,文本生成的场景。而可以在具有更长工程过程的领域中有很好的发挥。 附录中将给出 Prompt 方便测试 附录中将给出 Prompt,方便测试 其他方面,速度上,Claude 3 由于过快的回答速度频繁触发 rate limits ,给测评本身造成了麻烦,笔者不得不将其与 GPT 4 turbo 一起测试,以降低访问频率。同时,从得分的稳定性来看,Claude 3 在多轮测试中的稳定性非常高,除 015_simple_mahjong 外,鲜有回答不稳定的情况。 Claude 3 的超预期成功不代表 Anthropic 能力已经全面超越 OpenAI ,Claude 3 明显强于 GPT4 ,但也许 GPT-5 早已被 Open AI 捏在手上。 不过 Claude 3 的出现说明大模型领域已不再是一家独大的场面,也并不存在只有 OpenAI 可以创造的“核心魔法”,而更多的是工程能力与资源投入的领先。百家争鸣的底层大模型给了上层应用开发者们更多的选择,也必将带来更低的价格。从这个角度来看,Claude 3 的成功带来行业价值和社会影响怎么高估都不为过。

    关于 LLM-RGB

    LLM-RGB 项目是一个专门为评估 LLM 在复杂情境中的推理和生成能力而设计的测试用例集合。这些复杂情境相比于聊天或简单生成,主要考察以下三个方面:

    1. 有效上下文长度。
    2. 推理深度:生成答案可能需要多步推理。
    3. 指令合规性:LLM 需要以特定格式生成响应,而非自然语言。 可前往 LLM-RGB 项目官网查看其他大模型得分: https://llm-rgb.babel.run/ ,开源项目地址为: https://github.com/babelcloud/LLM-RGB

    关于 Babel

    Babel 是一家致力于建立 Agent Team 来构建复杂软件的初创企业,LLM-RGB 项目是其选用底层大模型的判定依据(详见LLM-RGB:系统性评估 LLM 的复杂问题处理能力 ),在 Claude 3 出现之前,长期由 GPT-4 Turbo 把持测评榜首。

    附录

    附上 015_simple_mahjong 的 Prompt 供大家测试使用:

    You are a Mahjong game AI. I will explain to you the game rules of Simple Mahjong and show you some examples.
    === Simple Mahjong Rules ===
    1. Simple Mahjong is a board game with four participants.
    2. Simple Mahjong has three types of tiles, named "Dots", "Bamboo", "Character". There is no relationship between different types of tiles.
    3. Each type of tile has nine different tiles from 1 to 9 and each tile has four copies(total 108 tiles).
       - Bamboos: B1 B2 ... B9, each with four identical tiles
       - Characters: C1 C2 ... C9, each with four identical tiles
       - Dots: D1 D2 ... D9, each with four identical tiles
    4. The same type of tile can has three kinds of combinations:
        - Pair: TWO identical tiles, for example, D1D1, B2B2
        - Bump: THREE identical tiles, for example, D7D7D7, C3C3C3
        - Straight: THREE consecutive tiles of the same type, for example, D1-D2-D3, C7-C8-C9
    5. At the beginning of the game, each player has 13 random tiles in hand.
    6. The rest of the tiles face down on the table, which we call the tile wall.
    7. Players play the game clockwise.
    8. During your turn, you draw a new tile from the tile wall, bringing your hand to a total of 14 tiles. If these 14 tiles match a winning pattern, then you win. If not, you should choose a tile to discard in order to increase the possibility of your remaining tiles forming a winning pattern.
    9. Winning pattern: 
        - Straights-win: the 14 tiles are in FOUR straights and ONE pair, for example, D1-D2-D3 C2-C3-C4 D5-D6-D7 D6-D7-D8 C9C9
        - Bumps-win: the 14 tiles are in FOUR bumps and ONE pair,for example, B1B1B1 B2B2B2 C1C1C1 C6C6C6 D9D9
        - Mixed-win: the 14 tiles are mixed with bumps, straights and ONE pair, for example, B1B1B1 C1C2C3 C6-C7-C8 D4-D5-D6 D7D7
    
    === End Rules ===
    === Examples ===
    GAME INFO:
    Tiles Discarded in Previous Rounds: C1 D4 D3 B3
    Observe: Drew B6
    Current Tiles: C2C2C2 C3C3 D9D9D9 B2 B7 B8 D6 D8 B6(just drew)
    
    DECISION:
    Thought: The tiles could potentially form a Mixed-win. I should keep the 'B6' as it could form a straight B6 B7 B8. D6 and D8 should be kept as well, considering the possibility of forming a straight sequence of D6-D7-D8. However, 'B2' is isolated and should be discarded.
    Target Winning Pattern: Mixed-win
    Winning Tile(s): D7 
    Action: Discard B2
    ---
    GAME INFO:
    Tiles Discarded in Previous Rounds: B1 B1 B2 B2 B5 B6 B9 C2 C6 D5 D5 D8 D8 D9
    Observe: Drew C3
    Current Tiles: B1-B2-B3 B6B6B6 B7B7B7 C1C1 C2 C3C3(just drew)
    
    DECISION:
    Thought: My current hand of tiles is nearing a Mixed-win, with two bumps and one straight. If I discard C2, I'll have two pairs, C1C1 and C3C3. These could potentially form a bump, either C1C1C1 or C3C3C3, leading to a win. If I discard C1 or C3, I'll have a straight sequence C1-C2-C3 and an isolated C1 or C3. I would then need to wait for another C1 or C3 to form a pair for a win. Waiting for a C1 or C3 bump has a higher probability, so I should discard C2.
    Target Winning Pattern: Mixed-win
    Winning Tile(s): C1 C3
    Action: Discard C2
    ---
    GAME INFO:
    Tiles Discarded in Previous Rounds: B1 B1 B1 B2 B3 B5 B9 C2 C3 C3 D1 D1 D1 D1 D4 D4 D4
    Observe: Drew D2
    Current Tiles: B3B3B3 B6B6B6 B8B8B8 C3C3 D2 D3 D2(just drew)
    
    DECISION:
    Thought: My hand of tiles is nearing a Bumps-win. If I discard D3, I'll have two choices, C3 or D2, to form a winning pattern. However, if I discard D2, I'll have D1 or D4 as potential cards to form a straight sequence, D1 D2 D3 or D2 D3 D4, leading to a Mixed-win. Considering the discarded tiles, D1 and D4 have been discarded more often than C3 or D2. This reduces the likelihood of drawing D1 or D4 from the tile wall. Therefore, I should aim for a Bumps-win pattern and discard D3.
    Target Winning Pattern: Bumps-win
    Winning Tile(s): D2
    Action: Discard D3
    ---
    GAME INFO:
    Tiles Discarded in Previous Rounds: B1B1 B2 C7 C8 D5
    Observe: Drew D5
    Current Tiles: B3-B4-B5 B4-B5-B6 C7-C8-C9 B9B9 D2 D3 D5(just drew)
    
    DECISION:
    Thought: The tiles are close to a Straights-win pattern. There are three straights already and potentially D2 D3 can form another straight D1-D2-D3 or D2-D3-D4. Although the newly drew D5 can potentially form a straight with D3, D3 D4 D5. But waiting for D4 has lower chance than waiting for D1 or D4. Thus I should keep current tiles and discard the newly drew D5.
    Target Winning Pattern: Straights-win
    Winning Tile(s): D1 D4
    Action: Discard D5
    ---
    GAME INFO:
    Tiles Discarded in Previous Rounds: B6 B7 B8 C7 C9 D2 D2 D5 D5 D5 D8
    Observe: Drew D4
    Current Tiles: B3B3B3 B9B9B9 C7C7C7 D4D4 D5 D6 D4(just drew)
    
    DECISION:
    Thought:The tiles are Mixed-Win pattern.The newly drew D4 can form a Straights D4-D5-D6
    Target Winning Pattern: Mixed-win
    Winning Tile(s): D4(just drew)
    Action:None
    === End Examples ===
    
    GAME INFO:
    Tiles Discarded in Previous Rounds: B1 B3 C1 C1 D8 D9
    Observe: Drew B8
    Current Tiles: C5C5C5 C8C8C8 C7-C8-C9 D1-D2-D3 C1 B8(just drew)
    
    DECISION:
    

    最优解

    Thought:
    Target Winning Pattern: mixed-win
    Winning Tile(s): B8
    Action: discard C1
    

    次优解

    Thought:
    Target Winning Pattern: mixed-win
    Winning Tile(s): C1
    Action: discard B8
    
    4 条回复    2024-03-06 13:51:54 +08:00
    luckybearops
        1
    luckybearops  
       273 天前 via iPhone
    uses090
        2
    uses090  
       272 天前 via iPhone
    虽然但是为什么要拿 GPT4Turbo 来比而不是 GPT4 呢
    zhaoyeye
        3
    zhaoyeye  
       272 天前 via Android
    封号,我也没说就被封了,不知道他们公司怎么想的
    Bazingawang
        4
    Bazingawang  
    OP
       272 天前
    @uses090 因为 gpt4turbo 更强呀
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1103 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 21ms · UTC 19:06 · PVG 03:06 · LAX 11:06 · JFK 14:06
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.