how did we make deepseek outperform opus 4.7?

how did we make deepseek outperform opus 4.7? https://x.com/MrAhmadAwais/status/205... i've been thinking about why "open model bad at tool calling" is almost always a harness problem, not a model problem. especially as i build https://commandcode.ai context: spent the two days looking at billions of tokens in ‪@CommandCodeAI‬ (tb open source ai cli) using deepseek. I ended up writing a tool-input repair layer. the trigger was watching deepseek-flash fail on the simplest /review run, every shellCommand and readFile call bouncing back with a raw zod issues blob, the model unable to recover because the error wasn't in a form it could read. by the end deepseek v4 pro was beating opus 4.7 6/10 times on our internal evals. a few things i learned that feel general: 1/ the failure modes aren't random they're a small finite compositional set. across deepseek-flash, deepseek v4 pro, glm, qwen, the same four mistakes repeat almost exactly: sending `null` for an optional field instead of omitting it emitting `["a","b"]` as a json string instead of an actual array wrapping a single arg in `{}` where the schema expected an array (an "empty placeholder") passing a bare string where an array was expected (`"foo"` instead of `["foo"]`) four repairs, ~30-100 lines each, ordered carefully (json-array-parse must run before bare-string-wrap or `'["a","b"]'` becomes `['["a","b"]']`). that is the whole catalogue. when i hear "this open source model can't do tool calls" i now assume one of those four, and so far that's been right ~90% of the time. 2/ the funniest failure mode is also the most revealing. 3/ the design choice that mattered was inverting preprocess-then-validate to validate-then-repair. 4/ shape invariants and relational invariants need different fixes. zoom out: a lot of what looks like model capability is actually contract design. a strict schema is a choice with a cost it filters out noise, but it also filters out recoverable noise from any model that hasn't memorized the exact json contract you happened to pick. the largest commercial models eat that cost invisibly and are linient on tool calling because they've seen enough of every contract during pretraining; open models pay it loudly and get dismissed for it. the harness is where you mediate between distributions. four small repairs (i'm sure more to follow as we have three more merging today), two regex lines for auto-links, one relational default, one prefix change. the model didn't change. the contract got more forgiving in exactly the places it needed to be. deepseek v4 pro now beats opus 4.7 6/10 times on our internal evals. imo "skill issue" applies to the harness more often than the model. Follow on 𝕏: https://x.com/MrAhmadAwais Command Code: https://commandcode.ai GitHub: https://github.com/AhmadAwais LinkedIn:   / mrahmadawais   YouTube: https://YouTube/AhmadAwais Dev Blog: https://AhmadAwais.com If you like my work, feel free to share it, like it, and subscribe to my YouTube channel. Let's connect on X @ https://x.com/MrAhmadAwais Use your code for good. Peace! ⌘