how did we make deepseek outperform opus 4.7?

how did we make deepseek outperform opus 4.7? https://x.com/MrAhmadAwais/status/205... i've been thinking about why "open model bad at tool calling" is almost always a harness problem, not a model problem. especially as i build https://commandcode.ai context: spent the two days looking at billions of tokens in ‪@CommandCodeAI‬ (tb open source ai cli) using deepseek. I ended up writing a tool-input repair layer. the trigger was watching deepseek-flash fail on the simplest /review run, every shellCommand and readFile call bouncing back with a raw zod issues blob, the model unable to recover because the error wasn't in a form it could read. by the end deepseek v4 pro was beating opus 4.7 6/10 times on our internal evals. a few things i learned that feel general: 1/ the failure modes aren't random they're a small finite compositional set. across deepseek-flash, deepseek v4 pro, glm, qwen, the same four mistakes repeat almost exactly: sending `null` for an optional field instead of omitting it emitting `["a","b"]` as a json string instead of an actual array wrapping a single arg in `{}` where the schema expected an array (an "empty placeholder") passing a bare string where an array was expected (`"foo"` instead of `["foo"]`) four repairs, ~30-100 lines each, ordered carefully (json-array-parse must run before bare-string-wrap or `'["a","b"]'` becomes `['["a","b"]']`). that is the whole catalogue. when i hear "this open source model can't do tool calls" i now assume one of those four, and so far that's been right ~90% of the time. 2/ the funniest failure mode is also the most revealing. 3/ the design choice that mattered was inverting preprocess-then-validate to validate-then-repair. 4/ shape invariants and relational invariants need different fixes. zoom out: a lot of what looks like model capability is actually contract design. a strict schema is a choice with a cost it filters out noise, but it also filters out recoverable noise from any model that hasn't memorized the exact json contract you happened to pick. the largest commercial models eat that cost invisibly and are linient on tool calling because they've seen enough of every contract during pretraining; open models pay it loudly and get dismissed for it. the harness is where you mediate between distributions. four small repairs (i'm sure more to follow as we have three more merging today), two regex lines for auto-links, one relational default, one prefix change. the model didn't change. the contract got more forgiving in exactly the places it needed to be. deepseek v4 pro now beats opus 4.7 6/10 times on our internal evals. imo "skill issue" applies to the harness more often than the model. Follow on 𝕏: https://x.com/MrAhmadAwais Command Code: https://commandcode.ai GitHub: https://github.com/AhmadAwais LinkedIn: / mrahmadawais YouTube: https://YouTube/AhmadAwais Dev Blog: https://AhmadAwais.com If you like my work, feel free to share it, like it, and subscribe to my YouTube channel. Let's connect on X @ https://x.com/MrAhmadAwais Use your code for good. Peace! ⌘

GLM 5.2 is my new favorite model...

GLM 5.2 is my new favorite model...

How Did DeepSeek Make V4 So Cheap?

How Did DeepSeek Make V4 So Cheap?

Google DeepMind Has a Very Big Problem!

Google DeepMind Has a Very Big Problem!

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains

I Paid $1, Got $40 of DeepSeek V4 Pro (Command Code)

I Paid $1, Got $40 of DeepSeek V4 Pro (Command Code)

China Just Built What TSMC Said Was Impossible

China Just Built What TSMC Said Was Impossible

"There's a Secret Backdoor in Netgear" Routers, ft. Wendell of Level1 Techs

"There's a Secret Backdoor in Netgear" Routers, ft. Wendell of Level1 Techs

I Tested Claude Inside Unreal Engine 5.8...

I Tested Claude Inside Unreal Engine 5.8...

He Once Worked at Subway. At 58, He Solved An "Impossible" Problem

He Once Worked at Subway. At 58, He Solved An "Impossible" Problem

Qwable 3.6 27B tested - 16GB Local LLM setup

Qwable 3.6 27B tested - 16GB Local LLM setup

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

What is happening at Meta?

What is happening at Meta?

This Claude Code Plugin Writes 94% Less Code (ponytail)

This Claude Code Plugin Writes 94% Less Code (ponytail)

the true reason C++ always wins

the true reason C++ always wins

How Much Longer Can We "Hide" The Inflation?

How Much Longer Can We "Hide" The Inflation?

The Most Famous AI Company Isn't Winning. Here's Who Is.

The Most Famous AI Company Isn't Winning. Here's Who Is.

Building pi in a World of Slop — Mario Zechner

Building pi in a World of Slop — Mario Zechner

The Best Local Agentic Coding Workflow (Complete Guide)

The Best Local Agentic Coding Workflow (Complete Guide)

🚨Rubio Goes FOR KILL SHOT to GET VANCE FIRED!!!

🚨Rubio Goes FOR KILL SHOT to GET VANCE FIRED!!!

Stop Prompting Claude. Use Karpathy's Method Instead.

Stop Prompting Claude. Use Karpathy's Method Instead.