Automated web development with artificial intelligence?

Desarrollo Frontend con IA
Hernán Sosa at Digital Jump

Author: Hernán Ariel

Web Developer

Day by day we see how artificial intelligence is sweeping the world. Their resolution capabilities are increasing, to the point that they are capable of carrying out complex processes that were even thought impossible. Of course, we are still far from these technologies being completely perfect. As we have seen and analyzed in the article “The power of artificial intelligence and human creativity in content generation”, the hand of man is still important in making adjustments.

But like everything related to technology, what we said yesterday may no longer be in force the next day and in this article we will see if this is the case.

A paper was recently published on Arxiv titled “How far are we from automating front-end engineering?” where the effectiveness of GPT-4V technology vs other generative models is studied to be able to develop websites automatically. According to the article, it is possible to make simpler websites that replace hand-coded ones 49% of the time and are considered better designed than the original ones in 64% of cases.

Method Overview

To measure performance on the task, the authors selected 484 diverse real-world web pages. They developed a set of automatic evaluation metrics that capture both high-level visual similarity (using what they called “CLIP embeddings”) and low-level element matching (taking into account bounding box matches, text content, the position and color of matching visual elements).

Also, the same prompt was used, which we see below:

Prompt IA

Regarding performance measurement, the authors use the following metrics:

High-level visual similarity:

  • CLIP Similarity: Measures the similarity between the reference web page screenshot and the generated web page screenshot

Low-level item matching:

  • Block Matching: Measures the total size of blocks of matching visual elements between the reference and generated web pages, relative to the total size of all blocks (matched and unmatched). This evaluates whether all visual elements are reproduced correctly without important elements being missing.
  • Text: Measures, through a formula, the similarity in characters between the text content of the matching blocks in the reference and the generated web pages.
  • Position: Calculates the similarity of the position of matching blocks by comparing the normalized coordinates of their centers.
  • Color: Using the CIEDE2000 formula, the difference in perception between the colors of the matching text blocks in the reference and the generated web pages is evaluated.

The authors do not intentionally combine these metrics into an aggregate score, as they are designed as detailed diagnostic scores and, ideally, models should score well on all dimensions. CLIP Similarity captures high-level visual resemblance, while element matching metrics provide a detailed breakdown of performance across different aspects of web page generation.


Metrics have shown that GPT-4V performs better on this task compared to other generative AI models.

Benchmark de las IA

In human analysis they find that in 49% of cases, web pages generated by GPT-4V can replace the original reference web pages in terms of visual appearance and content. Surprisingly, in 64% of cases, web pages generated by GPT-4V are considered better than the original reference web pages.

What conclusions do we draw from this article?

First of all, we must bear in mind that the reference websites are not necessarily, according to our criteria, cases of high or medium complexity.

Comparativa de Referencia vs Resultado final de la IA

When we analyze some of these cases, we see that they are very simple websites and do not necessarily represent an overcoming challenge for an artificial intelligence, much less for an experienced developer.

Modelos de referencia IA

Although the power that GPT-4V has had to carry out this task is notable, we believe that even so, we are far from being able to consider these advances as a “threat.” To make the study more enriching, perhaps tests should be run with sites that have greater complexity and really challenge GPT-4V so that we can see if it is really capable of replacing a front-end engineer.

And you, what do you think? Will the day come when artificial intelligence will replace us? We are here to read you.