Our Ambitious Goal: The Path to 90% Code Generation
Navigating the Hype and Hard Realities of AI Code Generation
At PromptOwl, we're not just building software. We're fundamentally reimagining how software gets built. Our mission is ambitious: to reach 90% code generation. This isn't just a vanity metric for us—it's our strategy that allows our scrappy little team to punch above our weight and deliver exceptional value to our users faster than any other project any of us have ever worked on.
Our initial foray into code generation, or vibe coding as some call it, focused on the user interface (UI). We saw immediate and encouraging results in this area using Figma drawings and using Vercel to generate the UI. The ability to rapidly generate UI components allowed us to iterate faster on our product, gather user feedback more quickly, and ultimately accelerate our development cycle.
This early success with UI generation provided the momentum and conviction we needed to pursue our ambitious 90% goal across our entire codebase. However, the backend was not as easy, at least not consistently.
In a recent discussion, the PromptOwl team explored our use of AI tools for code generation, reflecting on our experiences, challenges, and strategies. The conversation yielded valuable insights into the current state of AI in development workflows, highlighting both the potential and the limitations of these technologies.
So What Have We Learned?
Frontend vs. Backend: Not Even Close
UI generation consistently outperforms backend code generation by a significant margin. While it is hard to be scientifically specific here, our estimations are that we had about a 75% success rate for UI tasks versus roughly 40% for complex backend logic. Why? LLMs are basically pattern-matching machines, and UI follows more predictable patterns. Give an AI a Figma design and watch it spit out beautiful React components. Ask it to build a microservice with complex business logic and database interactions? That's when things get... interesting. Ask it to fix a bug, and you could be doing more damage than good.
The Tool Landscape is a Hot Mess
The team has explored a multitude of AI coding tools—22 different ones—yet only a few have demonstrated enough utility to merit serious usage. The landscape is indeed a messy, filled with tools that promise much but often underdeliver, requiring significant manual intervention to achieve the desired outcomes.
Cursor: This tool stands out due to its excellent IDE integration. It effectively injects generated code into the right places within the development environment, which streamlines the workflow. However, Cursor has a notable drawback: it tends to "hallucinate" dependencies, meaning it sometimes generates code that relies on libraries or components that don't actually exist or are not available in the project. This can lead to significant debugging overhead as developers must manually correct these inaccuracies.
OpenHands: Formerly known as OpenDevin, this open-source tool is impressive in its capacity for code analysis. It can provide a summary of a large code base, which is invaluable for quickly understanding the overall structure and functionality of a project. It also has the ability to find and fix bugs and generate documentation, which can save developers a significant amount of time. However, OpenHands is not without its issues. The tool occasionally "goes down rabbit holes," a colloquial way of saying it can get sidetracked or fixated on irrelevant details, which can reduce its efficiency and require developers to redirect its focus.
GitHub Copilot: This is a popular tool, particularly in enterprise settings, due to its availability and the backing of Microsoft. It integrates well with GitHub and can be a secure option for larger organizations. However, in terms of capability, GitHub Copilot sometimes feels a generation behind other tools. While it can generate code snippets, it often requires developers to copy and paste these into their projects, which can become cumbersome, especially if the generated code requires further refinement.
The team's experience with these tools illustrates the current state of AI-assisted coding: promising, but still requiring careful oversight and significant manual effort.
Models Matter (Until They Don't)
The team's exploration into AI-driven code generation involves extensive experimentation with various Large Language Models (LLMs). After running hundreds of identical prompts across different models, the team’s experience is universally that Claude models generally outperform others for code-related tasks. This observation underscores the importance of model selection in achieving optimal results in AI-assisted coding.
However, the landscape is nuanced. GPT-3o Mini, for instance, offers surprising capability for its size, demonstrating that even smaller models can be effective for specific coding needs. Certain Gemini variants also excel at specific programming languages, suggesting that model selection can be further optimized based on the specific coding languages used in a project.
Despite these findings, the team has encountered a significant challenge: the performance and behavior of these models can change rapidly. The rankings of these models shift unexpectedly, with leading models one week potentially being throttled or updated with behaviors that disrupt established workflows the next week. This inconsistency poses a challenge to the team's development process, as it requires continuous adaptation and recalibration of their strategies.
The team also notes that the models themselves can introduce variability. For example, Claude, a powerful model for code generation, has been observed to undergo changes in its behavior and error message formatting recently. It seems it spontaneously regressed around the time they released the new 3.7 model. These changes are affecting the reliability and predictability of the model's output, making it almost unusable, and forcing our engineers to return to other previously discarded models.
No patterns in AI development seem to be stable for long.
AI as Developer Multiplier, Not Replacement
Despite the buzz that software companies are dumping parts of their development staff, this is not because AI can wholesale replace a developer. Its because AI can raise the productivity of your core developers enough that the team size can shrink.
AI should be seen as a tool to enhance developer productivity rather than a replacement for human engineers. AI's primary role should be to make humans more productive. For now, this is especially true for junior engineers, where AI can "level up" their skills, making them more effective and efficient.
The team also notes that while AI can be beneficial for less experienced engineers, it is most effective when used in conjunction with skilled developers. This suggests that AI can help bridge the gap between engineers with varying levels of experience, but it cannot replace the expertise and problem-solving abilities of experienced developers. The value of all software projects still lies in the creativity and ingenuity of the humans behind them.
Lowering the Barriers to Entry, Unlocking More Projects
Interestingly, AI is lowering the barrier to entry for software projects, enabling individuals to bootstrap ideas and bring them to fruition more affordably, and more rapidly. This highlights AI's potential to democratize innovation and empower a wider range of individuals to create and develop software.
The Road Ahead
When are we going to hit our goal for 90% code generation?
I don't know, honestly. But I do know we're going to learn a ton trying to get there. The goal isn't to eliminate human developers—that's missing the point entirely. The goal is finding that sweet spot where AI handles the grunt work while humans focus on the creative, complex, and quintessentially human aspects of building great software.
This isn't just PromptOwl's journey—it's a preview of where the entire industry is heading. Some companies will resist it. Others will embrace it halfheartedly. We're going all-in, documenting what works, what breaks, and sharing it all as we go.
Want to follow along? We'll be publishing our learnings, failures (plenty of those!), and breakthroughs as we continue on our Path to 90%. Because whatever percentage we ultimately reach, the insights along the way might just change how you think about building software too.
Let's learn together! Share your best AI coding tips and tricks in the comments and help us on our Path to 90%.