Ambar - Blog

Writing Software: LLMs Can't Do It Alone

Luis Galeas

Jul 17, 2025

4 Mins Read

LLMS are useful; that is indisputable. But in the context of building software, how useful are they? Like most software questions, the answer is "it depends". Usage of LLMs goes from actively harmful to transformative [1][2][3].

Different levels of productivity

So what makes the difference? Why do some engineers or teams get better results than others? We have been speaking to engineers, team leads, and CTOs, and we've noticed a pattern. Can you spot it below?

Worse Off Overall (AI = slower development)
Security vulnerabilities from unchecked AI code
Mountains of spaghetti from "vibe coding"
Technical debt through broken third-party integrations
Debugging suggestions that cost more time than reading documentation

Modest Gains (10-50% faster)
Code autocomplete and boilerplate
Syntax fixes and variable renaming
Database query suggestions
Dependency upgrades

Real Impact (2-3x productivity, and beyond)
Test generation from natural language specs
Documentation that updates with code changes
Meeting summaries → draft specifications
Drafts for full-scale language migrations

How to be productive with LLMs?

So what makes the difference? If you'd like to be more productive, keep humans in the loop. Don't blindly utilize LLMs, use them for the tasks they excel at. For example, humans are better at reviewing that code and catching edge cases. LLMs excel at generating draft code that follows existing patterns without becoming tired. Humans shine at translating vague product requirements into concrete specifications. LLMs are better at reviewing code when it means iterating over a long list of internal naming conventions. In other words, match the tool to the task.

Will we always need humans in the loop?

Part of the software community argues that humans will eventually be removed from the equation. After all, we've made real progress since the landmark paper "Attention is all you need"[4]. Extended thinking, longer context windows, benchmark-crushing models—it's all impressive. But the data tells a different story. Even with extended thinking, multi-LLM voting, and many other clever techniques, there are problems where humans or traditional computation perform better[5][6][7][8].

Humans will not be displaced in a few months when foundational models receive new optimizations or additional GPUs. Despite optimizations coming up every month [9][10], the architecture in LLMs continues requiring computation proportional to the square of the input length (i.e., a lot). The architecture of LLMs means humans have an advantage in novel situations not found in the training data (novel situations occur frequently when writing software). It means humans have an advantage when the solution requires thinking about relationships between more than two concepts (LLMs identify relationships between two ideas, but struggle with three or more). Thus, tasks such as understanding complex relationships, managing large codebases, handling ambiguous requirements, and advising stakeholders will remain the domain of humans for a long time.

And to those who dispute the data behind these assertions [11][12], how soon will you let LLMs fully take on the job of your president, your CEO? Probably never.

What does this mean for tech leaders?

If you want transformative productivity gains, get good at distinguishing where humans shine and where LLMs excel. This proposition isn't romanticizing humans—it's being realistic about our tools.

Practice the distinction. Not just yourself, but in your tech team. Build intuition for where LLMs will improve rapidly and where they'll struggle for years. Utilize LLMs where they create real leverage, such as generating drafts, summarizing conversations, creating boilerplate, verifying code patterns, and creating test suite drafts. Keep humans where they matter: in code review, understanding business context, and making architecture decisions.

We're doing this and more at Ambar; our productivity gains go beyond 10x. If you'd like to learn about how we do it, get in touch.

References

[1] C. Ebert and P. Louridas, 'Generative AI for Software Practitioners', IEEE Softw., vol. 40, no. 4, pp. 30–38, Jul. 2023.

[2] M. Coutinho, L. Marques, A. Santos, M. Dahia, C. Franca, and R. de S. Santos, 'The role of generative AI in software development productivity: A pilot case study', arXiv [cs.SE], 2024.

[3] M. Alenezi and M. Akour, 'AI-driven innovations in software engineering: A review of current practices and future directions', Appl. Sci. (Basel), vol. 15, no. 3, p. 1344, Jan. 2025.

[4] A. Vaswani et al., 'Attention is all you need', arXiv [cs.CL], 2017.

[5] P. Shojaee, I. Mirzadeh, K. Alizadeh, M. Horton, S. Bengio, and M. Farajtabar, 'The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity', arXiv [cs.AI], 2025.

[6] K. Huang et al., 'MATH-perturb: Benchmarking LLMs' math reasoning abilities against hard perturbations', arXiv [cs.LG], 2025.

[7] Z. Yuan, H. Yuan, C. Tan, W. Wang, and S. Huang, 'How well do Large Language Models perform in Arithmetic tasks?', arXiv [cs.CL], 2023.

[8] L. Chen et al., 'Are more LLM calls all you need? Towards scaling laws of compound inference systems', arXiv [cs.LG], 2024.

[9] G. Xiao et al., 'DuoAttention: Efficient long-context LLM inference with retrieval and Streaming Heads', arXiv [cs.CL], 2024.

[10] A. Kiruluta, P. Raju, and P. Burity, 'Breaking quadratic barriers: A non-attention LLM for ultra-long context horizons', arXiv [cs.LG], 2025.

[11] A. Lawsen, 'Comment on the illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity', arXiv [cs.AI], 16-Jun-2025.

[12] S. Khan, S. Madhavan, and K. Natarajan, 'A comment on "The Illusion of thinking": Reframing the reasoning cliff as an agentic gap', arXiv [cs.AI], 2025.

Different levels of productivity

Worse Off Overall (AI = slower development)
Security vulnerabilities from unchecked AI code
Mountains of spaghetti from "vibe coding"
Technical debt through broken third-party integrations
Debugging suggestions that cost more time than reading documentation

Modest Gains (10-50% faster)
Code autocomplete and boilerplate
Syntax fixes and variable renaming
Database query suggestions
Dependency upgrades

Real Impact (2-3x productivity, and beyond)
Test generation from natural language specs
Documentation that updates with code changes
Meeting summaries → draft specifications
Drafts for full-scale language migrations