Design2Code：前端离失业还有多远

最近，就看了不少关于AI做前端还原的一些文章，之前写过一个URL就把别人网址复制了的这种耸人听闻的文章，根据里面的原理介绍，想必读过的人也知道，这种方式的弊端。那就是copy别人的网站虽然是容易的，但是AI写的代码是非常缺乏维护性的，就连最基本的列表，他都不是list.map(it⇒item)的方式去写，而是呆板的一个一个去写。

自动化前端工程是否出现了新的转机

今天看了一个github上开源的工程，Design2Code：https://github.com/NoviScl/Design2Code，那么，这个方式实现的自动化前端工程，是否是我们的前端开发小伙伴们的诺亚方舟(坟墓)呢？让我们一起来揭开他神秘的面纱吧。

这个开源的项目相关联的是这篇论文 https://arxiv.org/pdf/2403.03163.pdf ，他是这篇论文中的代码实践部分，因此我么通常可以直接看论文去了解他的原理，和他实现的效果，他既然敢公布测试的代码，那说明这篇论文里面的数据是比较可信的。

这篇论文的作者是4个大佬，分别是：

他们研究的主要目标就是，根据网页设计的屏幕截图自动生成能够渲染出该网页的HTML/CSS代码。他们的主要工作和贡献如下:

正式定义了Design2Code任务,并构建了包含484个真实网页的测试集作为评测基准。论文详细介绍了数据集的构建过程。
开发了一套自动评价指标,包括高层次的视觉相似度(CLIP相似度)以及细粒度的元素匹配(如文本匹配、布局匹配等)。
提出了多模态提示增强方法,如文本增强提示和自修订提示,用于提高商业大模型(GPT-4V、Gemini)在该任务上的表现。
在开源模型CogAgent-18B基础上,进行了专门的微调,得到Design2Code-18B模型,其性能可以与商业Gemini模型相媲美。
通过人工评估和自动指标,发现GPT-4V在该任务上表现最佳。人工评估显示GPT-4V生成的49%网页足以替代原始参考网页,且64%网页的设计被评为比原始参考更好。
细粒度分析表明,开源模型在召回输入网页的视觉元素和生成正确布局设计方面还有待提高,而文本内容和色彩等方面可通过微调得到极大改善。

下面，我们就看看他论文里的一些数据了

基准性能：Automatic Metrics

对于自动评估，他们考虑了高级视觉相似性（CLIP）和低级元素匹配（块匹配、文本、位置、颜色）。沿着这些不同的维度比较了所有基准模型。

可以发现，GPT-4V依然是遥遥领先的，不过，他们训练的模型倒是比Gemini Pro要略微强那么一些。

基准性能：Human Evaluation(人工评估)

那么这几种方式的实现代码是怎么样的呢？其实我们通过了解prompt就ok了，一下三个是从代码仓库中找到的，源码在这里：

https://github.com/NoviScl/Design2Code/blob/main/Design2Code/prompting/gpt4v.py

直接提示法（Direct Prompting）

python 复制代码

def direct_prompting(openai_client, image_file):
	'''
	{original input image + prompt} -> {output html}
	'''

	## the prompt 
	direct_prompt = ""
	direct_prompt += "You are an expert web developer who specializes in HTML and CSS.\n"
	direct_prompt += "A user will provide you with a screenshot of a webpage.\n"
	direct_prompt += "You need to return a single html file that uses HTML and CSS to reproduce the given website.\n"
	direct_prompt += "Include all CSS code in the HTML file itself.\n"
	direct_prompt += "If it involves any images, use \"rick.jpg\" as the placeholder.\n"
	direct_prompt += "Some images on the webpage are replaced with a blue rectangle as the placeholder, use \"rick.jpg\" for those as well.\n"
	direct_prompt += "Do not hallucinate any dependencies to external files. You do not need to include JavaScript scripts for dynamic interactions.\n"
	direct_prompt += "Pay attention to things like size, text, position, and color of all the elements, as well as the overall layout.\n"
	direct_prompt += "Respond with the content of the HTML+CSS file:\n"
	
	## encode image 
	base64_image = encode_image(image_file)

	## call GPT-4V
	html, prompt_tokens, completion_tokens, cost = gpt4v_call(openai_client, base64_image, direct_prompt)

	return html, prompt_tokens, completion_tokens, cost

文本增强提示法（Text Augmented Prompting）

python 复制代码

def text_augmented_prompting(openai_client, image_file):
	'''
	{original input image + extracted text + prompt} -> {output html}
	'''

	## extract all texts from the webpage 
	with open(image_file.replace(".png", ".html"), "r") as f:
		html_content = f.read()
	texts = "\n".join(extract_text_from_html(html_content))

	## the prompt
	text_augmented_prompt = ""
	text_augmented_prompt += "You are an expert web developer who specializes in HTML and CSS.\n"
	text_augmented_prompt += "A user will provide you with a screenshot of a webpage, along with all texts that they want to put on the webpage.\n"
	text_augmented_prompt += "The text elements are:\n" + texts + "\n"
	text_augmented_prompt += "You should generate the correct layout structure for the webpage, and put the texts in the correct places so that the resultant webpage will look the same as the given one.\n"
	text_augmented_prompt += "You need to return a single html file that uses HTML and CSS to reproduce the given website.\n"
	text_augmented_prompt += "Include all CSS code in the HTML file itself.\n"
	text_augmented_prompt += "If it involves any images, use \"rick.jpg\" as the placeholder.\n"
	text_augmented_prompt += "Some images on the webpage are replaced with a blue rectangle as the placeholder, use \"rick.jpg\" for those as well.\n"
	text_augmented_prompt += "Do not hallucinate any dependencies to external files. You do not need to include JavaScript scripts for dynamic interactions.\n"
	text_augmented_prompt += "Pay attention to things like size, text, position, and color of all the elements, as well as the overall layout.\n"
	text_augmented_prompt += "Respond with the content of the HTML+CSS file (directly start with the code, do not add any additional explanation):\n"

	## encode image 
	base64_image = encode_image(image_file)

	## call GPT-4V
	html, prompt_tokens, completion_tokens, cost = gpt4v_call(openai_client, base64_image, text_augmented_prompt)

	return html, prompt_tokens, completion_tokens, cost

视觉修订提示法（Visual Revision Prompting）

python 复制代码

def visual_revision_prompting(openai_client, input_image_file, original_output_image):
	'''
	{input image + initial output image + initial output html + oracle extracted text} -> {revised output html}
	'''

	## load the original output
	with open(original_output_image.replace(".png", ".html"), "r") as f:
		original_output_html = f.read()

	## encode the image 
	input_image = encode_image(input_image_file)
	original_output_image = encode_image(original_output_image)

	## extract all texts from the webpage 
	with open(input_image_file.replace(".png", ".html"), "r") as f:
		html_content = f.read()
	texts = "\n".join(extract_text_from_html(html_content))

	prompt = ""
	prompt += "You are an expert web developer who specializes in HTML and CSS.\n"
	prompt += "I have an HTML file for implementing a webpage but it has some missing or wrong elements that are different from the original webpage. The current implementation I have is:\n" + original_output_html + "\n\n"
	prompt += "I will provide the reference webpage that I want to build as well as the rendered webpage of the current implementation.\n"
	prompt += "I also provide you all the texts that I want to include in the webpage here:\n"
	prompt += "\n".join(texts) + "\n\n"
	prompt += "Please compare the two webpages and refer to the provided text elements to be included, and revise the original HTML implementation to make it look exactly like the reference webpage. Make sure the code is syntactically correct and can render into a well-formed webpage. You can use \"rick.jpg\" as the placeholder image file.\n"
	prompt += "Pay attention to things like size, text, position, and color of all the elements, as well as the overall layout.\n"
	prompt += "Respond directly with the content of the new revised and improved HTML file without any extra explanations:\n"

	html, prompt_tokens, completion_tokens, cost = gpt4v_revision_call(openai_client, input_image, original_output_image, prompt)

	return html, prompt_tokens, completion_tokens, cost

那么，这几种方式各有什么样的特点呢？

直接提示法（Direct Prompting）:

这种方法直接使用用户提供的网页截图作为输入，然后根据截图生成HTML+CSS代码。
优点是操作简单，用户只需要提供一张截图。
缺点是因为只依赖于图像信息，可能在文本提取、元素辨识上不够准确，特别是当截图质量不高或者元素细节较多时。

文本增强提示法（Text Augmented Prompting）:

这种方法在直接提示法的基础上增加了从网页中提取的所有文本信息。
用户提供网页截图和相应的所有文本内容，系统根据这些信息生成HTML+CSS代码。
优点是通过添加文本信息可以提高生成代码的准确度，尤其是在处理文本内容和布局时更为精确。
缺点是需要额外的步骤来提取网页文本

视觉修订提示法（Visual Revision Prompting）:

这种方法用于修正已有的HTML实现。它不仅使用了原始输入图像，还使用了初始的输出图像和HTML代码以及从参考网页提取的文本。
用户提供原始网页截图、当前的HTML实现（可能有误的）、以及这个HTML实现渲染的网页截图，系统根据这些信息进行修订和改进。
优点是能够对已有的实现进行针对性的修正，特别适合调整和完善细节，提高最终实现的质量。
缺点是需要更多的输入信息，包括初始的HTML代码和渲染后的截图，操作复杂度较高。

从结果上来看，GPT-4V Self-Revision Prompting的方式效果会更好一些：效果如下图

从图中，我们可以看到，还原度上，是绝对不能说100%的，甚至80%可能多有些勉强了，这对于像素眼的视觉设计师来讲，是万万不能接受的。因此，拿到这份AI自动转化的代码，可能还是需要很多的精力来做调整，谁能保证，比手工自己来写，然后配合copilot"结对编程（哈哈）"更加高效呢？我想那些经验十足的前端开发者们，已经迫不及待想和前端代码自动生成的各种模型来大干一架，让像素眼的设计师们评判一下，到底谁还是这个领域的王者。

总结

虽然，这篇论文中，我们需要肯定了Design2Code的意义，他可以降低前端开发的门槛，但我不认同他可以在短期内就取代前端开发，论文中也对各模型的细粒度表现进行了分析,指出了开源模型的不足之处,如召回输入元素、生成布局等方面有待提高。这个也基本上决定了在自动化前端工程方面，也承认了前端工程自动化还有比较远的路需要走，但是好在，一步一步的看清了方向，就像，10年前，谁会相信GPT这么霸道呢？

未来已来，每一步深入都不孤单。关注微信公众号 "老码沉思录" ，和我一起了解最新的技术