A study conducted by Scale AI in conjunction with the Center for Artificial Intelligence Security has shown that modern AI systems are not yet able to fully replace specialists in the fields of design, programming, and analytics, the Washington Post reports.
The researchers tested ChatGPT, Gemini, and Claude on hundreds of real-world freelance projects. The tasks covered a wide range of tasks, from creating 3D animations and web games to formatting scientific articles and building analytical dashboards.
The results were quite modest. The most effective model was able to perform only 2.5% of the tasks qualitatively. Almost half of the projects were implemented with low quality, and about a third remained unfinished. In many cases, the AI created corrupted files or ignored key customer requirements. Even those results that looked plausible, upon detailed inspection, contained critical errors.
Problems also appeared in specific industries. In interior design tests, the AI created a realistic-looking floor plan that was technically incorrect and lacked the necessary detail. During data analysis, the AI mixed up colors, superimposed text on graphs, and omitted entire countries in visualizations. In game development, the system created a workable product but completely ignored the given topic — instead of a game about brewing, it produced an abstract project.
One of the study's authors, Jason Hausenloy, explains these results by two key limitations. First, modern chatbots do not have long-term memory, so they do not learn from their mistakes over long projects. Second, they have problems with visual understanding, because when creating 3D models, they work mainly through code, rather than through a full-fledged visual interface.
At the same time, researchers note gradual progress. For example, the Gemini 3 Pro model in November 2025 was able to complete 1.3% of tasks, while its previous version showed a result of only 0.8%.
Despite the development of AI autonomy, full replacement of human specialists remains unlikely in the near future. While the economic benefits are obvious — it cost about $1,485 to create a game by a human, while Claude Sonnet cost less than $30 to launch — the difference in quality still makes human labor indispensable.

