一、说明简介
**DeepSeek Janus-Pro** 是一款先进的多模态理解和生成模型,旨在实现高质量的文本-图像生成与多模态理解。它是由DeepSeek团队研发的,是之前Janus模型的升级版,能够同时处理文本和图像,即可以理解图片内容,也能生成图像。
技术特点
- 双路径视觉编码设计 :Janus-Pro采用双路径视觉编码设计,理解路径使用SigLIP-L编码器提取图像的高层语义特征,适用于问答、分类等任务;生成路径采用VQ tokenizer将图像转换为离散token序列,关注细节纹理,支持文本到图像生成。
- 统一Transformer架构:两种任务的特征序列通过统一的Transformer处理,实现知识融合与任务协同,同时简化模型结构1。
- 多模态对齐 :使用真实文生图数据替代ImageNet,提升训练效率与生成质量。
- 扩展训练数据:新增图像字幕、表格图表等复杂场景数据,增强模型泛化能力。
性能表现
在MMBench基准测试中,Janus-Pro得分79.2,超越LLaVA、MetaMorph等模型;在GenEval测试中得分0.80,优于DALL-E 3(0.67)和Stable Diffusion 3(0.74),尤其在细节与审美质量上表现突出。
应用场景
Janus-Pro可以精准识别地标(如杭州西湖三潭印月)并解析文化内涵,生成符合复杂指令的图像(如特定风格插画或场景设计)。此外,它还能对图片进行描述、识别地标景点、识别图像中的文字,并对图片中的知识进行介绍。
Janus Github主页: ++https://github.com/deepseek-ai/Janus++
二、 Janus 模型运行硬件要求
|----------|------------------|--------------------|
| 任务类型 | Janus-Pro-1B | Janus-Pro-7B |
| 图像识别 | 5G( 3060) | 15G(4080) |
| 图片生成 | 14G(4080) | 40G( 3090/4090*2) |
本文采用的系统是 下图免费基础型,只能图像识别7b模型,图片生成1b模型,据说1B模型效果不好,好在我们只是学习过程,好坏暂时不论

三、 janus模型下载和项目部署
1、下载源码
cd /workplace
git clone https://github.com/deepseek-ai/Janus.git
#难道关税大战,github也要干掉了,有点怀疑啊,慢的要死,我是直接上传,clone不成功
cd Janus
(base) root@VM-0-80-ubuntu:/workspace# cd Janus
(base) root@VM-0-80-ubuntu:/workspace/Janus# ls
LICENSE-CODE generation_inference.py janus_pro_tech_report.pdf
LICENSE-MODEL images pyproject.toml
Makefile inference.py requirements.txt
demo janus
2、安装虚拟环境和依赖
conda create -n janus python=3.10
conda init
source ~/.bashrc
conda activate janus
cd /workspace/Janus
# 注意后面的点
pip install -e .
pip install flash-attn
#安装jupyter
conda install ipykernel
conda install ipywidgets
python -m ipykernel install --user --name janus --display-name "Python (janus)"
选择janus环境

3、下载预训练模型
这里我们考虑在项目主目录下创建models文件夹,用于保存Janus-Pro-1B和7B模型权重。考虑到国 内网络环境,这里推荐直接在Modelscope上进行模型权重下载。
-
安装modelscope
pip install modelscope
-
创建模型文件夹
mkdir -p Janus-Pro-1B
mkdir -p Janus-Pro-7B -
下载Janus-Pro-1B模型权重
下载1B模型
modelscope download --model deepseek-ai/Janus-Pro-1B --local_dir ./Janus-Pro-1B
-
下载Janus-Pro-7B模型权重
下载7B模型
modelscope download --model deepseek-ai/Janus-Pro-7B --local_dir ./Janus-Pro-7B
四、Jannus本地调用流程
1、Janus-Pro-7B模型
/workspace/Janus/Janus-Pro-7B.ipynb
创建Janus-Pro-7B.ipynb
import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
# specify the path to the model
model_path = "./Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "User",
"content": "<image_placeholder>\nConvert the formula into latex code.",
"images": ["images/equation.png"],
},
{"role": "Assistant", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# # run the model to get the response
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
结论:我的环境16G显存+32G内存跑不动,精度已经设置为 float16
2、Janus-Pro-1B模型
识别图片

/workspace/Janus/Janus-Pro-1B.ipynb
创建Janus-Pro-1B.ipynb
import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
# specify the path to the model
model_path = "./Janus-Pro-1B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "User",
"content": "<image_placeholder>\nConvert the formula into latex code.",
"images": ["images/doge.png"],
},
{"role": "Assistant", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# # run the model to get the response
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
Python version is above 3.10, patching the collections module.
/root/miniforge3/envs/janus/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:602: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead warnings.warn(
/root/miniforge3/envs/janus/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:602: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead warnings.warn( Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
/root/miniforge3/envs/janus/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:602: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead warnings.warn( Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
/root/miniforge3/envs/janus/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:602: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead warnings.warn( Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. Some kwargs in processor config are unused and will not have any effect: add_special_token, image_tag, sft_format, num_image_tokens, mask_prompt, ignore_id.
You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. User: <image_placeholder> Convert the formula into latex code. Assistant: \begin{equation} A_n = a_0 \begin{bmatrix} 1 & + \frac{3}{4} \sum_{k=1}^{n} \begin{bmatrix} 4 & \\ 9 & \end{bmatrix}^k \end{bmatrix} \end{equation} The LaTeX code for the given formula is: \begin{equation} A_n = a_0 \begin{bmatrix} 1 & + \frac{3}{4} \sum_{k=1}^{n} \begin{bmatrix} 4 & \\ 9 & \end{bmatrix}^k \end{bmatrix} \end{equation}
识别图片,代码改为该图片


Some kwargs in processor config are unused and will not have any effect: add_special_token, image_tag, sft_format, num_image_tokens, mask_prompt, ignore_id.
You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. User: <image_placeholder> Convert the formula into latex code. Assistant: Sure, I can help you with that. Here's the formula in LaTeX code: \[ \text{Decoupling Visual Encoding} = \text{A} \times \text{B} \] Where: - \( A \) is the number of features in the visual representation. - \( B \) is the number of features in the textual representation. This formula represents the relationship between visual and textual features in visual encoding.
(janus) root@VM-0-80-ubuntu:/workspace/Janus# nvidia-smiFri Apr 4 10:26:39 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:09.0 Off | 0 |
| N/A 50C P0 29W / 70W | 7866MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
为什么比网上用得都多啊,显存啊显存
五、Janus-Pro-7B-8bit模型
没有玩7B心有不甘,从魔塔社区下载了Janus-Pro-7B-8bit模型
1、下载模型
# 下载模型
modelscope download --model yh123556/Janus-Pro-7B-8bit --local_dir ./Janus-Pro-7B-8bit
Downloading [model-00001-of-00002.safetensors]: 100%|█████████████████████████████████████████████████████▊| 4.63G/4.64G [15:16<00:02, 4.87MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|█████████████████████████████████████████████████████▉| 4.63G/4.64G [15:16<00:02, 4.26MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|█████████████████████████████████████████████████████▉| 4.63G/4.64G [15:16<00:01, 4.87MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|█████████████████████████████████████████████████████▉| 4.63G/4.64G [15:17<00:01, 4.89MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|█████████████████████████████████████████████████████▉| 4.63G/4.64G [15:17<00:01, 5.38MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|█████████████████████████████████████████████████████▉| 4.63G/4.64G [15:17<00:01, 5.72MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|█████████████████████████████████████████████████████▉| 4.63G/4.64G [15:17<00:01, 4.56MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|█████████████████████████████████████████████████████▉| 4.63G/4.64G [15:17<00:00, 5.04MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|█████████████████████████████████████████████████████▉| 4.63G/4.64G [15:17<00:00, 5.55MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|█████████████████████████████████████████████████████▉| 4.64G/4.64G [15:18<00:00, 5.95MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|█████████████████████████████████████████████████████▉| 4.64G/4.64G [15:18<00:00, 5.46MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|██████████████████████████████████████████████████████| 4.64G/4.64G [15:18<00:00, 5.42MB/s]
Processing 11 items: 100%|███████████████████████████████████████████████████████████████████████████████████| 11.0/11.0 [15:18<00:00, 83.5s/it
2、编码
/workspace/Janus/Janus-Pro-1B.ipynb
创建Janus-Pro-1B.ipynb
import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
torch.cuda.empty_cache()
# specify the path to the model
model_path = "./Janus-Pro-7B-8bit"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, load_in_8bit=True
)
vl_gpt = vl_gpt.eval()
conversation = [
{
"role": "User",
"content": "<image_placeholder>\nConvert the formula into latex code.",
"images": ["images/doge.png"],
},
{"role": "Assistant", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device,dtype=torch.float16)
# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# # run the model to get the response
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
错误出现
ImportError: Using `bitsandbytes` 8-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`
解决错误
pip install -U bitsandbytes
运行结果
Python version is above 3.10, patching the collections module.
/root/miniforge3/envs/janus/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:602: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead warnings.warn(
/root/miniforge3/envs/janus/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:602: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead warnings.warn( Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
/root/miniforge3/envs/janus/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:602: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead warnings.warn( Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Some kwargs in processor config are unused and will not have any effect: num_image_tokens, add_special_token, image_tag, sft_format, ignore_id, mask_prompt.
/root/miniforge3/envs/janus/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:602: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead warnings.warn( Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Some kwargs in processor config are unused and will not have any effect: num_image_tokens, add_special_token, image_tag, sft_format, ignore_id, mask_prompt. The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
/root/miniforge3/envs/janus/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:602: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead warnings.warn( Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Some kwargs in processor config are unused and will not have any effect: num_image_tokens, add_special_token, image_tag, sft_format, ignore_id, mask_prompt. The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead. /root/miniforge3/envs/janus/lib/python3.10/site-packages/transformers/quantizers/auto.py:212: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used. warnings.warn(warning_msg) `low_cpu_mem_usage` was None, now default to True since model is quantized.
Loading checkpoint shards: 100%
2/2 [00:09<00:00, 4.48s/it]
You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. User: <image_placeholder> Convert the formula into latex code. Assistant: Here is the LaTeX code for the given image: ```latex \begin{figure}[h] \centering \includegraphics[width=0.5\textwidth]{image.png} \caption{Decoupling Visual Encoding and Single Visual Encoder} \label{fig:decoupling} \end{figure} ``` This code will generate a figure with the same layout as the provided image.
六、Gradio 前端UI界面调用方法
1、安装环境
conda create -n janus python=3.9
conda init
source ~/.bashrc
conda activate janus
cd /workspace/Janus
# 注意后面的点
pip install -e .[gradio]
pip install flash-attn modelscope
pip install -U bitsandbytes
# 下载模型
modelscope download --model yh123556/Janus-Pro-7B-8bit --local_dir ./Janus-Pro-7B-8bit
2、构建app_januspro-7b-8bit.py文件且运行
修改:
model_path = "./Janus-Pro-7B-8bit"
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path,
language_config=language_config,
trust_remote_code=True,
load_in_8bit=True)
#全部注释
#if torch.cuda.is_available():
# vl_gpt = vl_gpt.to(torch.bfloat16).cuda()
#else:
# vl_gpt = vl_gpt.to(torch.float16)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(cuda_device, dtype=torch.float16 if torch.cuda.is_available() else torch.float16)
全部代码
import gradio as gr
import torch
from transformers import AutoConfig, AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
from PIL import Image
import numpy as np
import os
import time
# import spaces # Import spaces for ZeroGPU compatibility
# Load model and processor
model_path = "./Janus-Pro-7B-8bit"
config = AutoConfig.from_pretrained(model_path)
language_config = config.language_config
language_config._attn_implementation = 'eager'
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path,
language_config=language_config,
trust_remote_code=True,
load_in_8bit=True)
#if torch.cuda.is_available():
# vl_gpt = vl_gpt.to(torch.bfloat16).cuda()
# vl_gpt = vl_gpt.to(torch.float16).cuda()
#else:
# vl_gpt = vl_gpt.to(torch.float16)
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
cuda_device = 'cuda' if torch.cuda.is_available() else 'cpu'
@torch.inference_mode()
# @spaces.GPU(duration=120)
# Multimodal Understanding function
def multimodal_understanding(image, question, seed, top_p, temperature):
# Clear CUDA cache before generating
torch.cuda.empty_cache()
# set seed
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
conversation = [
{
"role": "<|User|>",
"content": f"<image_placeholder>\n{question}",
"images": [image],
},
{"role": "<|Assistant|>", "content": ""},
]
pil_images = [Image.fromarray(image)]
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(cuda_device, dtype=torch.float16 if torch.cuda.is_available() else torch.float16)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False if temperature == 0 else True,
use_cache=True,
temperature=temperature,
top_p=top_p,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
return answer
def generate(input_ids,
width,
height,
temperature: float = 1,
parallel_size: int = 5,
cfg_weight: float = 5,
image_token_num_per_image: int = 576,
patch_size: int = 16):
# Clear CUDA cache before generating
torch.cuda.empty_cache()
tokens = torch.zeros((parallel_size * 2, len(input_ids)), dtype=torch.int).to(cuda_device)
for i in range(parallel_size * 2):
tokens[i, :] = input_ids
if i % 2 != 0:
tokens[i, 1:-1] = vl_chat_processor.pad_id
inputs_embeds = vl_gpt.language_model.get_input_embeddings()(tokens)
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).to(cuda_device)
pkv = None
for i in range(image_token_num_per_image):
with torch.no_grad():
outputs = vl_gpt.language_model.model(inputs_embeds=inputs_embeds,
use_cache=True,
past_key_values=pkv)
pkv = outputs.past_key_values
hidden_states = outputs.last_hidden_state
logits = vl_gpt.gen_head(hidden_states[:, -1, :])
logit_cond = logits[0::2, :]
logit_uncond = logits[1::2, :]
logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)
probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated_tokens[:, i] = next_token.squeeze(dim=-1)
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
img_embeds = vl_gpt.prepare_gen_img_embeds(next_token)
inputs_embeds = img_embeds.unsqueeze(dim=1)
patches = vl_gpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int),
shape=[parallel_size, 8, width // patch_size, height // patch_size])
return generated_tokens.to(dtype=torch.int), patches
def unpack(dec, width, height, parallel_size=5):
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
visual_img = np.zeros((parallel_size, width, height, 3), dtype=np.uint8)
visual_img[:, :, :] = dec
return visual_img
@torch.inference_mode()
# @spaces.GPU(duration=120) # Specify a duration to avoid timeout
def generate_image(prompt,
seed=None,
guidance=5,
t2i_temperature=1.0):
# Clear CUDA cache and avoid tracking gradients
torch.cuda.empty_cache()
# Set the seed for reproducible results
if seed is not None:
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
np.random.seed(seed)
width = 384
height = 384
parallel_size = 5
with torch.no_grad():
messages = [{'role': '<|User|>', 'content': prompt},
{'role': '<|Assistant|>', 'content': ''}]
text = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(conversations=messages,
sft_format=vl_chat_processor.sft_format,
system_prompt='')
text = text + vl_chat_processor.image_start_tag
input_ids = torch.LongTensor(tokenizer.encode(text))
output, patches = generate(input_ids,
width // 16 * 16,
height // 16 * 16,
cfg_weight=guidance,
parallel_size=parallel_size,
temperature=t2i_temperature)
images = unpack(patches,
width // 16 * 16,
height // 16 * 16,
parallel_size=parallel_size)
return [Image.fromarray(images[i]).resize((768, 768), Image.LANCZOS) for i in range(parallel_size)]
# Gradio interface
with gr.Blocks() as deepseek:
gr.Markdown(value="# Multimodal Understanding")
with gr.Row():
image_input = gr.Image()
with gr.Column():
question_input = gr.Textbox(label="Question")
und_seed_input = gr.Number(label="Seed", precision=0, value=42)
top_p = gr.Slider(minimum=0, maximum=1, value=0.95, step=0.05, label="top_p")
temperature = gr.Slider(minimum=0, maximum=1, value=0.1, step=0.05, label="temperature")
understanding_button = gr.Button("Chat")
understanding_output = gr.Textbox(label="Response")
examples_inpainting = gr.Examples(
label="Multimodal Understanding examples",
examples=[
[
"explain this meme",
"images/doge.png",
],
[
"Convert the formula into latex code.",
"images/equation.png",
],
],
inputs=[question_input, image_input],
)
gr.Markdown(value="# Text-to-Image Generation")
with gr.Row():
cfg_weight_input = gr.Slider(minimum=1, maximum=10, value=5, step=0.5, label="CFG Weight")
t2i_temperature = gr.Slider(minimum=0, maximum=1, value=1.0, step=0.05, label="temperature")
prompt_input = gr.Textbox(label="Prompt. (Prompt in more detail can help produce better images!)")
seed_input = gr.Number(label="Seed (Optional)", precision=0, value=12345)
generation_button = gr.Button("Generate Images")
image_output = gr.Gallery(label="Generated Images", columns=2, rows=2, height=300)
examples_t2i = gr.Examples(
label="Text to image generation examples.",
examples=[
"Master shifu racoon wearing drip attire as a street gangster.",
"The face of a beautiful girl",
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
"A glass of red wine on a reflective surface.",
"A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
"The image features an intricately designed eye set against a circular backdrop adorned with ornate swirl patterns that evoke both realism and surrealism. At the center of attention is a strikingly vivid blue iris surrounded by delicate veins radiating outward from the pupil to create depth and intensity. The eyelashes are long and dark, casting subtle shadows on the skin around them which appears smooth yet slightly textured as if aged or weathered over time.\n\nAbove the eye, there's a stone-like structure resembling part of classical architecture, adding layers of mystery and timeless elegance to the composition. This architectural element contrasts sharply but harmoniously with the organic curves surrounding it. Below the eye lies another decorative motif reminiscent of baroque artistry, further enhancing the overall sense of eternity encapsulated within each meticulously crafted detail. \n\nOverall, the atmosphere exudes a mysterious aura intertwined seamlessly with elements suggesting timelessness, achieved through the juxtaposition of realistic textures and surreal artistic flourishes. Each component\u2014from the intricate designs framing the eye to the ancient-looking stone piece above\u2014contributes uniquely towards creating a visually captivating tableau imbued with enigmatic allure.",
],
inputs=prompt_input,
)
understanding_button.click(
multimodal_understanding,
inputs=[image_input, question_input, und_seed_input, top_p, temperature],
outputs=understanding_output
)
generation_button.click(
fn=generate_image,
inputs=[prompt_input, seed_input, cfg_weight_input, t2i_temperature],
outputs=image_output
)
deepseek.launch(share=True,server_name="0.0.0.0")
#deepseek.queue(concurrency_count=1, max_size=10).launch(server_name="0.0.0.0", server_port=37906, root_path="/path")
运行,并配置这个是外网穿透
运行之前
下载地址:https://download.csdn.net/download/jiangkp/90567145
mv frpc_linux_amd64_v0.2 /root/miniforge3/envs/janus/lib/python3.9/site-packages/gradio
chmod +x /root/miniforge3/envs/janus/lib/python3.9/site-packages/gradio/frpc_linux_amd64_v0.2
然后运行
python demo/app_januspro-7b-8bit.py

Running on local URL: http://0.0.0.0:7860
Running on public URL: https://d35019aaf88c9b8c69.gradio.live

图片识别
文生图
主要还是穷,溢出了吧
改为1B的
3、下载1B模型
modelscope download --model deepseek-ai/Janus-Pro-1B --local_dir ./Janus-Pro-1B
4、构建app_januspro-1b.py并运行
/workspace/Janus/demo/app_januspro-1b.py
在下载代码原来文件app_januspro.py上修改就可以了,不是我们改过的文件
model_path = "/workspace/Janus/Janus-Pro-1B"
运行之前,前面改过,就不用改了,这个是外网穿透
下载地址:https://download.csdn.net/download/jiangkp/90567145
mv frpc_linux_amd64_v0.2 /root/miniforge3/envs/janus/lib/python3.9/site-packages/gradio
chmod +x /root/miniforge3/envs/janus/lib/python3.9/site-packages/gradio/frpc_linux_amd64_v0.2
然后运行
python demo/app_januspro-1b.py
图文识别

文生图:

再来一张,这是1B,能出来就不错了喔

女孩呢?不识别中文吧
