The unconventional path to stumbling on the best seed architecture for recursive self-improvement
The race for recursive self-improvement is on. How we combined long-horizon planning with coding agents to outpace the biggest frontier labs in AI.
It’s dollars, but that isn’t driving the propensity to race. We want to be the guy (or gal, in the case of Mira) who gets there first. It reminds me of the moon landing, or the Age of Exploration with Columbus, Magellan, and the like. It was the concept of getting there first that drove people to race, not necessarily pure necessity. Especially with AI, it’s not like we need it to save us—if all tech innovation stopped today, the world would function just fine as it is.
When we look at coding in particular, the race is amplified. Software itself has been the biggest driver of economic value over the last couple of decades with the internet and social media, mainly because of its unlimited digital portability. It’s only natural that this would be the F1 league of AI, where fields like legal and medicine are F2, and the speed of the cars and stakes are highest by far. If you win the Constructors' Championship of AI, it’s much easier to win in every other category because you haven't just built a car—you’ve built a car that can transform and replicate the fastest. We’ve already seen this as Claude Code is being used far beyond coding as the primary driver to write word docs, build presentations, and execute sales and marketing motions.
At the point in the race we are now, it takes humans in the loop to evolve the system, while optimizations like RL refine the details. Once a more advanced version of the system exists that can reliably evolve the underlying architecture without a human in the loop—not just refine its performance—it’s basically like racing boxcars. You give it a push, it gains momentum, and then it follows an intense acceleration curve.
The trillion-dollar question is: which human entity can give their self-evolving system the seed architecture that gives the cellular automata the best chance of evolving at the fastest rate? Claude Code and Codex are not there. Cursor is not there. Factory and Devin are not there yet. What’s the bottleneck? Sure, they can self-improve as is, but are there fundamental architectural constraints that are limiting the current generation of harnesses from accelerating at the fastest pace in existence?
This is where what we are doing at predev comes into play, and how our specific story has led us to a unique position in the race. We do not, by any means, have the most resources. But the starvation we faced and our unconventional path to our current seed architecture have defined our potential and our claim to accelerate the fastest in the coding agent race. Sure, Claude Code and Cognition's Devin could eventually figure out what we have and rebuild it—but will it be too late by the time they learn the lessons we already have? I like to think so. Like a good lawyer in a courtroom, I’m going to make a claim, prove it, and then let the jury decide for themselves.
How far ahead are we? We already proved a step-function gain in performance on the most relevant benchmark for the basic short-term terminal agent on TBench 2.0. We broke the model-cost continuum on Anthropic’s own models compared to Claude Code and OpenCode to prove the supremacy of our basic coding agent harness. There were many fine details and optimizations in architecture, memory, context management, and orchestration that led to this result—but it was actually not the most difficult thing to prove because of what we learned before we even started building a coding agent. It took us about three months to build the native agent harness that jumped a whole model tier on Terminal Bench, with Haiku 4.5 beating Sonnet 4.5, and Sonnet 4.6 beating Opus 4.5.
We started predev about three years ago. So, what were we up to for the other two years and nine months? Why was it so fast for a tiny team of two engineers to lap a massive frontier lab on the most important coding benchmark for the most critical initiative in their entire company? Sure, they beat us in revenue by billions, but does that really matter at all in a performance-based race? Potential energy is the most important driver here, because revenue is temporary and fickle in a fast-changing AI market.
So let’s rewind to May 2023, when predev was first conceived. I had just spent the last six years working on a stealth AI hedge fund, doing AI research, training my own models, and deploying a full agent infrastructure and continuous learning pipeline. My architecture was an autonomous, time-series-based portfolio management system that allowed you to input a universe of assets, dial leverage and instruments, and then the agent would wake up every minute to make portfolio-balancing decisions in either long or short directions. Building this taught me the fundamentals of AI and how to build resilient agentic systems that were highly sensitive and had to function while trading large sums of money in seconds. I trained a time-series-based transformer architecture as the underlying prediction mechanism, then a real-time RL pipeline optimized actions taken by the agent on top of the bigger network.
Over one year of trading, it recorded a 343% return, but I was forced to shut it down when regulations over perpetual futures became unclear for US customers. I sat there, exhausted and hopeless, after six years of grinding to build this.
After being dormant for six months—only able to trade with fake money to measure performance—I decided to go back to my bread and butter: reviving my software development shop to take on a few projects to pay the bills. I emailed a few group chats and finally got in front of a customer who wanted to build a startup platform. On that first meeting, I quickly realized how much of a pain it is to sit there and listen to a non-technical founder’s massive idea dump and information overload. On top of that, when I gave them an estimate, they acted offended. I was like, "Wait, but I mapped out exactly what you said in the call to the hour—user stories, task breakdowns with cost, timelines, architecture diagrams, wireframes." I spent my whole weekend doing the work of an entire team just to get immediately rejected as if I were overcharging.
The mismatch in expectations was apparent. Either I had to charge way less for work that was undoubtedly going to take longer than what they were paying for, or I had to continue searching for clients who understood development a little better and wouldn’t complain about the price. Either way, I felt extreme frustration doing more work without getting paid. Even if I happened to take on the project for less than I was worth, that client was going to have a mismatch in expectations through the entire lifetime of the product—which could literally be years. It was a massive headache waiting to happen.
At the same time this was happening, GPT-3.5 dropped. The first time using it, I was blown away that transformers similar to my finance algos were able to produce sentient-like behavior in LLMs. Because the dev client situation was so fresh, I naturally thought: how could I automate this annoying experience of dealing with planning a development project and estimating the cost, so I could get a quicker yes or no, start coding, and get money in the bank? One thing I learned from working with the agentic portfolio management system was that predicting ahead makes all the difference. You can do pretty well by running an agent that analyzes things in real time, but if you have something that reliably forecasts prices, then the RL agent running in real time and maximizing its reward has a huge advantage over a vanilla agent just seeing raw OHLCV data. Everything my system did was reliant on the predictions of the planning network and the RL agent optimizing decisions based on current market conditions, while using a long-horizon predictor network as a sort of compass. Either model in isolation would produce solid returns, but when combined, the returns took off. The long-horizon predictions looked ahead for 4 hours at 5-minute intervals, helping the RL agent avoid big dips and tap into a deeper signal to make winning trades. The forecasts were never perfect, but their presence was crucial to decision-making.
This was fresh in my head when I was thinking about the problem of scoping in software development. It’s a form of pricing that’s based on predicting ahead how the project and budget will evolve over time. Naturally, I figured that a planner + execution agent architecture could be incredibly interesting when applied to software development projects.
predev Alpha
The first demo I created was not much different from how it works today, though the underlying mechanisms have developed as AI has become more powerful. You input a prompt of what you want to build, it maps the requirements, maps the system architecture components (frontend, backend, DB, services, etc.), and then uses that representation to predict the actual tasks and their dependencies. With all that information, you can estimate complexity per node and calculate project costs. This was the planning layer that predev started with.
We didn’t start out like GPT Engineer by Anton Osika, which at the time was a Claude Code equivalent coding agent operating in the terminal. I tried GPT Engineer and knew it just wasn’t good enough yet to automate writing code. Also, I was still very much focused on my own problem of trying to pay my bills by acquiring clients more efficiently for my dev consulting business. The tasks that were mapped out were tasks I myself would be doing, so I wanted them to be clean, crisp estimates.
I quickly noticed that the quality, depth, and coverage of the Statement of Work (SOW) produced by the alpha demo of predev surpassed what I was able to do manually over a full weekend (>15 hours)—and it did it in 5 minutes. It asked 5 discovery questions that adapted to the user, and that was more than enough input to capture full, complex architectures for most startup founder-esque ideas. Compare this to an hour-long call where I had to listen to a founder’s cryptic brain dump and piece together 1,000+ mixed-up puzzle pieces. When I saw that output and sent the first SOW to a client, I felt a new wave of inspiration. I was burned out from six years of heavy lifting that got killed prematurely, but somehow, I found a second wind.
I showed my early work to my friend Michael, who had worked at Cognizant and did this stuff at scale. He was intrigued and helped me ideate early on. Then, in August of that year, I realized I wanted to actually run with this. ChatGPT was being discovered by more than just techies, and a new AI revolution was starting—and I already had a incredible demo in a domain no one else was thinking of. I was using AI to generate graphs and forecast software costs rather than simple back-and-forth chat. In those early days, people hardly knew what ChatGPT was, so it was really hard for me to explain my idea and the need for it. I did research and found all these McKinsey studies about scoping and scope creep—how 50% of software budgets die because of mis-scoping, and the rest get killed because of scope creep or tech debt down the line. My research showed that the single biggest point of failure in software occurs before code is ever written. And thus, predev was born.
I was sick of dealing with regulations and uncertainty and ready for some good old-fashioned SaaS. I gave Arjun a call—the only 30+ year old who could match my coding output and who I knew had the balls to go all-in on something. I had to impress him, though, but I knew the demo would do the magic itself.
I said, “Pick an idea—any idea.” He picked a workout tracking app, and I watched him interact with the demo. His eyes lit up when he saw the intuitive follow-up questions the LLM asked him. Then, he finished the 5-question flow, and a simple overlay with a loading icon appeared saying “generating architecture.” He was intrigued and confused simultaneously, asking what was happening. Then, 2 to 3 minutes later, the magic moment happened: a 3D force graph representation of the frontend and backend architecture appeared, and a 10-page SOW toggled on the next tab. He paused in absolute awe.
He was sold. I didn’t have to say anything else. I just said, "Let’s build the biggest software company in AI. Are you in?"
From that one moment forward, Arjun and I have been more than all-in. I say "more than all-in" because it’s taken more energy and resilience than we could have ever imagined. We woke up three years later, working 7 days a week with bags under our eyes, and found ourselves in a race we didn’t even know we had entered.
So back to my original question: how did our seed harness architecture put us in a position to recursively self-improve with a head start on the frontier labs? What did we learn over those 2 years and 9 months prior to building the coding agent?
Well, our first attempt at scaling was to apply the planning layer to real human development projects. We faced many challenges along the way—namely, the stakes are incredibly high for development shops trying to put food on the table. There’s not a 1:1 match for scoping to reality, and dev shops are often incentivized to bloat the budget. To bloat the budget, they have to be vague with their scoping; if they shared exact details, they would be leaving money on the table because maybe a project only takes 4 months realistically, but they can squeeze out another 50% in revenue by saying it takes 6.
We learned that dev shops do not actually want more accurate scoping. They want to scope things their own way. If it’s faster, cool, but our scopes were “too detailed.” We called it the "mechanic problem"—essentially, when you send someone who isn't car-savvy to the mechanic, they often get ripped off for a basic oil change with a bunch of unknown costs due to the technical communication gap.
What we did learn is that by being transparent in the first 2 minutes—giving clients an estimate with all the bells and whistles accounted for—we closed clients faster, or at least didn’t waste time on clients who would have rejected the budget anyway. Some dev shops got it. We built an API (Architect API) and an embeddable widget that agencies could use with their customers. The agencies that used us closed more clients faster, and their dev projects went smoother with the accurate accounting done upfront. We heard the "why not just use ChatGPT?" line like a broken record millions of times, but we knew that GPT was not specialized in this domain and could not produce complex architecture and SOWs deterministically and consistently. We also had data from real projects that helped us calibrate our predictor.
The Floor is Lava
By November 2025, we were deeply ingrained in using Claude Code and Cursor to develop much faster. We had always known since the early days that one day, when LLMs were good enough, coding agents would adopt our planning layer in the exact same way real-life engineering teams do. As a result, we launched our Architect MCP and developed content comparing vanilla Cursor and Claude to Cursor equipped with our MCP. For building a Perplexity clone in one shot, we had a video go viral because Cursor with our MCP created a fully functional version in a single prompt—complete with streaming and real citations/search—compared to the vanilla Cursor plan mode, which produced a hollow shell of a prototype.
When December rolled around and Opus 4.5 dropped, we noticed a real jump in productivity, which was the signal to start connecting the Architect layer to a coding agent. We figured the traditional dev shops might be screwed because we alone were able to ship massively complex code with just two people. We experimented with the Claude Code SDK and OpenCode SDK in December, and had our first version of the end-to-end self-driving agent live in January on the Claude Code SDK.
Unfortunately, in the wild, we hit a bunch of scaling issues early on because our agents had to run for much longer since we projected full roadmaps of hundreds of stories and tasks. The Claude Agent SDK locked us into Claude models, which were some of the most expensive on the market. It would work fine for a milestone and then just crash the 2GB sandbox.
Over the next few months, I realized we needed to build our own. The current agent frameworks were not designed for long-horizon work—they were clearly vibe-coded pieces of trash. We needed to be cost-effective because we were running these agents on the web with real users and real constraints. Unit economics were real, and we learned quickly that token usage, sandboxing, and infrastructure costs add up fast. People wanted everything to work perfectly, and users got highly frustrated when it didn’t one-shot their app seamlessly. Part of the blame lies with vibe-coding tools like Lovable and Replit for conditioning users into thinking it’s much easier than it actually is to build software that works.
It’s a difficult architecture stack to balance from the top down, and small mistakes compound quickly and are brutal to recover from. Building a truly agentic coding system that operated on the web without the tight environment constraints of a closed ecosystem like Lovable or Replit was a massive engineering problem. We had the theory down—we deeply believed planning was the key to scaling coding agents—but making it happen while the ecosystem was still young was incredibly difficult.
But we did it. Then, we built a CLI to test it on Terminal Bench and validated that we had built an incredible base harness. Using superior planning techniques and advanced agent memory architectures was the key to us beating Claude Code on the benchmark.
One thing for certain is that this very much feels like a race. It is soul-draining with constant FOMO. Everyone in AI is racing to keep up, but we’re racing in the hardest tier—the Tour de France of AI. The days are long, and they don’t stop until twilight. We’ve sacrificed time with loved ones and every single weekend we have; you can see the lines developing on our faces as if we just served a presidency. It’s not healthy, but we think we are at the point where we’ve given the system enough of a push that it has a genuine chance to slingshot into orbit.
Today
Predev is the only agent that natively plugs long-horizon planning into an efficient coding agent. This is the seed architecture I’ve been hinting at. We have a whole new dimension of self-improvement compared to Claude Code and Codex. We have given the ball a push down the hill and initiated the self-improvement process across multiple dimensions. It’s gaining momentum by the day, and as base LLMs get better, our lead only increases.
We aren’t doing this in private like Anthropic with Mythos, making bold, unsubstantiated claims. We are the only commercial agent company that has released our TBench harbor rewards and trajectories completely publicly. Furthermore, the exact same agent harness is available via web, SDK, CLI, and MCP to all our users starting at just $25.
There is no provider lock-in for our users; we handle model routing by default so you get the best token efficiency possible. We feature RLM capabilities, long-term agent memory, context management, native architecture awareness, multi-session git worktree and sandbox isolation, and seamless synchronization between local CLI and the cloud.
The exact same software pricing problems we uncovered with human engineering hours in the early iterations of predev are now resurfacing as agent hours—with token usage per hour becoming the biggest talking point in enterprise AI today.
The only remaining question left: is our escape velocity sufficient to break free of the frontier technology curve?
If so, then what?