MTG Bench: Testing how well LLMs can play Magic

CCallumFerg about 13 hours ago 22 commentsRead Article on mtgautodeck.com

ZH version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

83% Positive

Analyzed from 779 words in the discussion.

Discussion (22 Comments)Read Original on HackerNews

derac•about 2 hours ago

I think running them against each other with a rules engine would be more interesting. Count up illegal moves and wins/unfinished games. I think llm grading is too unreliable.

alasdair_•41 minutes ago

I wrote a rules engine in rust along with a reinforcement learning with MCTS based system to play decks against each other. It can handle aggro decks well enough but complex combo decks like Amulet Titan are tough to get working without expert demos or reward hacking.

josh_p•about 4 hours ago

I know the author specifically did not use a rules engine in their simulation because of uncertainty on how it would affect it.

I do still wonder if adapting something like card forge for llm use would result in engaging gameplay with an llm.

https://github.com/Card-Forge/forge

CallumFerg•about 4 hours ago

I actually considered using card forge when I started this. I mostly didn't end up using it because of how much more work it would have been.

But also with a rules engine, you have to manually go though every step, and pass priority after every action.

I think it makes more sense to let an LLM play magic like a person would. On early turns it is acceptable to say "I play a land and pass" without going through every phase. And you can say "I tap all my land and play this card" without having to use a tool call and agent turn for every land tap.

Also card forge would not let you goldfish a deck. You must have opponents.

fc417fc802•about 2 hours ago

Those things sound less like general problems with rules engines and more like deficiencies of card forge IMO.

veqq•18 minutes ago

MTG: Arena uses a rules engine CLIPS (a s-expr expert system based on the RETE engine), which an acquaintance wrote a course for: https://ryjo.codes/tour-of-clips.html and even a declarative chat server: https://ryjo.codes/articles/a-simple-tcp-server-written-in-g...

    (defrule connection
      (connection ?id)
      =>
      (println "User " ?id " connected")
      (printout ?id "Welcome to the chatroom from CLIPS!" crlf)
      (do-for-all-facts ((?f connection)) (neq ?id (nth$ 1 ?f:implied))
          (printout (nth$ 1 ?f:implied) "User " ?id " connected" crlf)))
    
    (defrule say
      (connection ?id)
      ?f <- (message-buffered ?id)
      ?ff <- (message ?id ~/me ?message)
      =>
      (retract ?f ?ff)
      (printout ?id "You: " ?message crlf)
      (do-for-all-facts ((?f connection)) (neq ?id (nth$ 1 ?f:implied))
        (printout (nth$ 1 ?f:implied)
         ?id ": " ?message crlf)))

fc417fc802•about 2 hours ago

> because of uncertainty on how it would affect it.

Have the LLM submit a proposed move and either advance the game state or reply "permission denied, try again". Probably also log the number of times it happens since attempted violations seems like a valuable signal as well.

OsrsNeedsf2P•about 4 hours ago

I love obscure benchmarks, and I feel like I can trust their results a lot more - afterall, they (probably) weren't benchmaxxed. RuneBench[0] is another good example (how well LLMs can play Runescape)

[0] https://maxbittker.github.io/runebench/

thurn•about 3 hours ago

To clarify, the more accurate description would be "Testing how well LLMs can follow the rules of Magic", right? There is no actual evaluation of how "well" they are playing?

purple-leafy•about 3 hours ago

Benchmarks like this are onto something. Next frontier of llm benchmarking

jmccaf•about 4 hours ago

Awesome ! Does this use https://mage-bench.com/ , or is it a separate project? I ran 4 local models in a tournament recently with mage-bench on an RTX 5090 ; Qwen 3.6 27B won narrowly over Gemma 4 .

CallumFerg•about 4 hours ago

No, I was not aware of that project when I made this.

I'll have to look into that project, but I also have an RTX 5090 and did a lot of testing with Qwen3.6 27B and Gemma 4 31B. I was not able to get it to play legal turns consistently. I had to keep expanding the system prompt and adding rules for edge cases. By the end, the prompt was over 10k tokens, and while it mostly make legal turns, it did not make good turns. And all the heuristics in the prompt degraded the performance and increased the cost for frontier models.

OwenCR•about 4 hours ago

Sadly this benchmark removes the part of MTG that is most interesting: the opponent(s). Without opponents you simply don't have a game. You just have a rules engine - quite boring!

I think I object more to the decks used in testing than the machines' decisions. I do have nit picks though: This hand is quite poor and should be mulliganned: https://app.mtgautodeck.com/public/benchmarks/4bd9955b-ebe1-.... The poor runout reinforces this decision.

This project is cool though, props for making it!

comex•2 minutes ago

Gotta walk before you can run.

CallumFerg•about 4 hours ago

Admittedly, the mulligan phase system prompt is the weakest part of the project. I had to add heuristics to stop the LLMs from mulliganing down to just a few cards looking for a perfect hand. The scoring for the benchmark is mostly based on if the LLM could complete legal turns, not good turns.

https://github.com/CallumFerguson/mtg-auto-deck/blob/a877c08...

danbrooks•about 4 hours ago

Very cool. I’ve been daydreaming about whether LLMs can be used to reason through gaming decisions.

pilord314•about 3 hours ago

They should randomize games of judge tower and see who wins:

https://mtg.fandom.com/wiki/Judge_Tower

TZubiri•about 3 hours ago

Looking forward to this metric being Goodhart lawed.

Like how the strawberry example was overtrained for, or how the pelican on a bike started being used in official release posts.

gravitronic•about 3 hours ago

Magic is complicated. I looked at doing something like this but the open-ended nature where one specific card will completely change the rules or require a series of followup events or modifications to the rules engine at hand is just tremendous.

8note•about 3 hours ago

or, that certain cards when play together make an infinite loop, and so cannot be played/insta-die

fc417fc802•about 2 hours ago

You misspelled insta-win. Infinite turn combos are the best.

akoboldfrying•about 2 hours ago

I was wondering how complicated it could really be, and it turns out that some people showed in 2019 that it's Turing-complete -- meaning that any conceivable computation can be simulated by a MTG game, indeed a game in which every move by every player is forced: https://arxiv.org/abs/1904.09828

IOW, it's as complicated as possible.