Artificial intelligence (AI) is here, and here to stay. Generative AI (GenAI), powered by Large Language Models (LLMs), is rapidly transforming software engineering, from code generation to test automation. Tools that promise to boost productivity and reduce development costs are abundant. It is clear that AI is increasingly reshaping many aspects of daily life, including its growing influence on software engineers. Was the Dutch newspaper Volkskrant correct when they stated in one of their recent headlines “Programming has had its day: 'If you program by hand in a year, you'll be a dinosaur'?”

GenAI: Changing the Scene, Supporting Productivity Boosts

In recent years, AI has impacted software development. Through AI, developers can now streamline certain software tasks, enhancing both productivity and efficiency. For instance, AI-powered solutions are capable of generating code snippets, performing code refactoring, and aiding in bug identification and resolution. These capabilities help save valuable time and allow teams to focus on more complex components of development. Additionally, AI continues to reshape DevOps practices and continuous integration/continuous delivery (CI/CD) pipelines. By analyzing code changes, test results and production data, AI techniques yield insights pertaining to performance, quality, and potential issues, thus refining the software development lifecycle, deployment procedures, and overall product quality.

While the continued evolution and improvement of these technologies seem inevitable, current GenAI tools are not without limitations. Beneath the surface lies a complex reality: questions around code quality, software architecture, security and other non-functional requirements, and  ultimately the future role of human developers.

Code Generation: Fast, but Not (Yet) Perfect

LLMs can generate source code from natural language prompts, making them powerful assistants for software developers. With LLMs trained on vast amounts of data, including source code from publicly available sources such as public repositories on GitHub. As such, the models can be regarded as a representation of the authors behind that source code – including their mistakes.

Benchmarks like HumanEval and SWE Bench have been developed to measure the (functional) correctness of the LLMs generating code. Recent studies show that correctness is far from guaranteed. While most recent models achieve 96%+ on the base HumanEval tests, scores in SWE Bench – which features more complex tests – drop to around 75%, even for top-performing models like OpenAI’s GPT-5, xAI’s Grok 4 and Anthropic’s Claude Opus 4.1. This result does not indicate model failure but rather highlights the inherent complexity of real-world programming tasks, where edge cases and unexpected inputs are the norm rather than the exception.

The Konwinski Prize: A Reality Check for AI Coding

Perplexity co-founder Andy Konwinski recognized a flaw in the current benchmarks: if LLMs are trained on existing data, such as code fragments, they can also be trained on the problems present in the benchmarks similar to how humans could study for a test they already have the answers for. To test the real-world capabilities of AI in software engineering, the Konwinski Prize (K-Prize) was launched in 2025 by the Laude Institute, Databricks, and Konwinski. Unlike traditional benchmarks, the K-Prize uses a ‘contamination-free’ methodology, sourcing GitHub issues flagged after model submissions to prevent training bias.

The results were sobering: the winning entry solved just 7.5% of the coding challenges. This stands in stark contrast to inflated scores on older benchmarks like SWE Bench and HumanEval. Konwinski summed it up: 

If we can’t even get more than 10% on a contamination-free SWE-Bench, that’s the reality check for me.

While the K-Prize results highlight the current limitations of AI in tackling novel, real-world coding challenges, they also provide valuable insights that drive future innovation. Each benchmark, even those with modest scores, helps the community identify gaps and accelerate progress. The journey toward robust AI coding assistants is ongoing, and every challenge overcome brings us closer to that goal. For now, human developers remain indispensable in this area.

Beyond Functionality: The Hidden Costs of AI-Generated Code

Maintainable code is not just about whether the code works today, it is about whether it can be understood, extended, and debugged months or years from now by the author or someone else (or an AI agent). While LLMs can generate functional code quickly, maintainability is prone to suffer, especially when code is produced without sufficient human oversight (also known as ‘vibe coding’). Without safeguards in place, the generated code is prone to exhibit issues such as:

  • Inconsistent structure and style: LLMs are trained on a vast and diverse corpus of code, which means they may mix paradigms, naming conventions, or formatting styles within a single output. This inconsistency can make the codebase harder to read and unify, especially in teams with established standards.
  • Poor alignment with architectural principles: LLMs operate at the level of local code generation. They do not have a holistic view of the system architecture. As a result, they may introduce tightly coupled components, violate separation of concerns, or bypass established design patterns, leading to architectural erosion.
  • Opaque logic and ‘black box’ behavior: Generated code may be syntactically correct but semantically unclear. Developers may find that the code executes correctly but lacks transparency or clear logic, increasing the risk of bugs, complicating debugging, and undermining trust in the codebase.
  • Lack of documentation and rationale: Unless explicitly prompted, LLMs rarely include meaningful comments or explanations. Even when they do, the comments may be superficial or incorrect. This absence of context makes it difficult for future developers to understand the intent behind the code.

Security remains one of the most pressing concerns in the use of LLMs for code generation. According to the 2025 Veracode GenAI Code Security Report, nearly 45% of AI-generated code samples contained known vulnerabilities, including SQL injection flaws, insecure cryptographic implementations, and improper input validation. These are not edge cases – they are foundational security issues that can expose systems to serious threats.

This vulnerability stems from how LLMs are trained as they learn from vast public code repositories, which include both good and bad practices. Without a built-in understanding of secure coding principles, LLMs may replicate outdated or risky patterns. Moreover, because the generated code often appears syntactically correct and well-structured, developers may overlook subtle security flaws during review.

Ensuring Quality in LLM-Generated Code

It is overly simplistic to claim that LLMs inherently produce unmaintainable and insecure code. In practice, the quality of output is highly dependent on how the model is prompted. Developers who provide clear instructions, enforce coding standards through prompt engineering, and use structured templates can guide LLMs to produce cleaner, more maintainable code. Prompting with explicit instructions such as ‘use parameterized queries’ or ‘follow OWASP guidelines’ can significantly improve the security posture of the generated code. Moreover, when integrated into a workflow that includes code reviews, automated testing, and linters, GenAI can complement – not compromise – software quality. Organizations should integrate GenAI into a secure software development lifecycle (SDLC), ensuring that AI-generated code is subject to the same security gates, such as static analysis, dependency scanning, and secure code reviews, similar to human-written code. One more recent development in this area is ‘swarm coding’, a design pattern in which complex tasks are broken down, assigned to specialized agents to generate code and critique one another, mirroring how human development teams would operate. Early research into the effects of this setup showcases an increase in quality, but also highlights important caveats: if the critiquing agents are themselves imperfect, they may miss issues or introduce new ones. Additionally, coordinating multiple agents and their interactions can be complex and computationally expensive. While the technology continues to evolve rapidly, human oversight remains essential.

Professional developers must embrace the rise of GenAI. Equipping them with training in (secure) prompt engineering and GenAI-aware threat modeling is essential. Developers must understand not only how to use these tools, but how to question and validate their outputs. Junior developers may be most vulnerable: LLMs can automate many entry-level tasks, but without mentorship, juniors risk developing shallow skills and producing brittle code. Nonetheless, organizations must continue investing in junior talent to ensure a sustainable pipeline of skilled developers.

Medior developers will need to evolve toward AI orchestration, system integration, and design. Their value lies in guiding AI tools and ensuring alignment with business goals. Senior developers will lead in governance, architecture, and mentoring, using GenAI to accelerate delivery while maintaining quality and security.

While these practices are crucial for professional developers, it is equally important to consider the increasing role of citizen developers in the software development landscape. As with any powerful tool (including a car), responsible use requires training and oversight (a driver’s license) before granting full autonomy. While the democratization of development through GenAI can accelerate innovation, it also introduces (hidden) risks around security, maintainability, and compliance. Organizations must ensure that these users are supported with guardrails, training, and oversight to prevent unintended consequences and shadow IT.

The golden rule may be to use GenAI to speed up implementation, but to leave core design decisions, such as architectural choices, to human experts. Tasks like writing boilerplate code, refactoring, and developing prototypes or proof of concepts benefit most from the efficiency that AI brings to software engineering. In contrast, system architecture, security-critical design decisions, and complex business logic are often best left to human engineers. This approach ensures that while AI enhances productivity, critical aspects of software development remain under the careful scrutiny of experienced professionals.

Will AI Replace Developers?

Rather than eliminating jobs, GenAI is expected to reshape roles, with a growing demand for ‘AI engineers’ who combine software development, data science, and machine learning expertise. Ultimately, LLMs are tools. Their effectiveness depends not just on their capabilities, but on how thoughtfully they are used.

For C-level executives, the message is clear: GenAI is not a replacement for human talent, it is a catalyst for transformation. And as with every transformation, AI in software engineering comes with investments in upskilling across all levels of engineering, establishing governance frameworks for AI-generated code and (re)defining roles to include AI fluency and oversight.

With the right technical, procedural, and educational safeguards, GenAI will be a secure and scalable asset in modern software engineering. In a way, Volkskrant was correct: The future of software engineering will be AI-augmented, not AI-dominated. Human creativity, judgment, and collaboration remain irreplaceable. Just like Harvard’s professor Karim Lakhani stated back in 2023: “AI Won’t Replace Humans - But Humans With AI Will Replace Humans Without AI”.

If you are interested in exploring how GenAI can transform your software development processes, please feel free to reach out to Dennis Stam. At KPMG's Digital Advisory, we are committed to helping you navigate the complexities of AI integration and achieve your business goals. Let's innovate together!