AWS使用教程-运行环境创建

Establish an AWS account. (5 mins)

After logging in to AWS, search for EC2 and use it to create an Instance (Launch instance).

Bill

Remember to terminate the instance.

The payment is calculated based on the running time of the service.

状态 含义 费用
Running 实例正在运行 按小时计费(GPU 高,CPU 低)
Stopped 实例关机 CPU 不用钱,但 附加存储 EBS 仍收费(小额)
Terminated 实例彻底删除 不收费,但实例上的所有系统环境和数据都被清除

如果你 terminate GPU 实例,实例系统盘会被清除,所有环境和文件都没了。

❗ 也就是说,下次你再建 GPU 实例,就需要:

  1. 再次安装 Miniconda / Anaconda

  2. 再创建虚拟环境

  3. 再安装所需的 Python 包

这是 AWS 的正常流程,和本地电脑不同,因为 GPU 实例成本高,AWS 不会帮你保留临时环境。

Launch instance

Note:

New users may have a limited vCPU quota.

Choose Ubuntu as the operating system

Choose a suitable instance type

The instance type is important, since it determines the GPU. Free plan users may not be able to use a GPU.

https://aws.amazon.com/ec2/instance-types/

I use m7i-flex.large

m7i-flex.large 规格

参数 数值
vCPU 2
内存(RAM) 8 GB
GPU ❌ 没有 GPU(CPU 实例)
存储 按需 EBS

Create a Key pair for login.

aws.pem

This file should be saved.

Configure storage

Set a suitable storage volume.

如何降低重复搭建环境的成本和麻烦

有几个方法可以让你 快速恢复环境

方法 A:保存镜像(AMI)

  1. 在 GPU 实例上安装好 Anaconda + Python 环境 + 包

  2. 创建 AMI(Amazon Machine Image)

  3. 下次创建 GPU 实例时,直接基于这个 AMI → 环境和包全都保留

✅ 优点:不用每次重新装环境

✅ 缺点:创建和管理 AMI 有少量学习成本


方法 B:使用 EBS 数据盘保存环境

  1. 创建独立 EBS 卷,挂载在实例上

  2. 将 Anaconda / 数据 /模型存放在这个卷上

  3. terminate 实例后,EBS 卷还在

  4. 下次创建实例 → 挂载同一个 EBS 卷 → 继续使用

✅ 优点:节省 AMI 创建时间

✅ 缺点:需要管理 EBS 卷(稍微复杂)


方法 C:使用容器(Docker / Singularity)

  • 你把 Python + 包打包成容器镜像

  • 下次启动 GPU 实例,直接运行容器

  • 不依赖系统环境,快速上手

✅ 优点:非常专业,科研用户常用

✅ 缺点:需要学习 Docker 基本操作

AWS Login & Anaconda Setup

Open a terminal

复制代码
chmod 400 aws.pem

ssh -i aws.pem ubuntu@x.xx.x.xxx   (IP)

Now, install Anaconda and build the environment.

复制代码
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh


bash Miniconda3-latest-Linux-x86_64.sh


Always input yes


source ~/.bashrc


conda --version

如果显示版本号,说明成功


conda create -n research python=3.10
conda activate research

Build a simple virtual env

复制代码
安装 PyTorch(CPU 版本)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu



pip install transformers accelerate
pip install datasets peft



pip install bitsandbytes


检查安装是否成功
python -c "import torch; import transformers; print(torch.__version__)"

LLM Inference

复制代码
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# -----------------------------
# 登录 Hugging Face
# -----------------------------
hf_token = "hf_g。。。。。。。。"
login(hf_token)  # 只执行一次即可

# -----------------------------
# 模型配置
# -----------------------------
model_name = "EleutherAI/gpt-neo-125M"

# 加载 tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 加载模型到 CPU
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu")

# -----------------------------
# 测试推理
# -----------------------------
prompt = "Hello, this is a test of GPT-Neo on CPU."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

LLM Training & Testing

train_ft.py

复制代码
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)

from peft import LoraConfig, get_peft_model

# -------------------------
# 1. Load tokenizer/model
# -------------------------
model_name = "EleutherAI/gpt-neo-125M"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Fix padding token issue
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)

# -------------------------
# 2. Apply LoRA (lightweight FT)
# -------------------------
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj","v_proj"],  # safe default
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# -------------------------
# 3. Tiny example dataset
# Replace with yours
# -------------------------
texts = [
    "Deep learning is a subset of machine learning.",
    "Transformers are powerful models for NLP.",
    "Large language models can generate text."
]

dataset = Dataset.from_dict({"text": texts})

# -------------------------
# 4. Tokenization
# -------------------------
def tokenize(example):
    tokens = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=64
    )

    # Critical: labels = input_ids
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

dataset = dataset.map(tokenize)

# -------------------------
# 5. Data collator
# -------------------------
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# -------------------------
# 6. Training args (CPU safe)
# -------------------------
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    logging_steps=1,
    save_steps=10,
    learning_rate=2e-4,
    fp16=False,  # CPU -> False
    report_to="none"
)

# -------------------------
# 7. Trainer
# -------------------------
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=data_collator,
)

# -------------------------
# 8. Train
# -------------------------
trainer.train()

# -------------------------
# 9. Save LoRA model
# -------------------------
model.save_pretrained("./lora_model")

print("Training complete!")

test_ft.py

复制代码
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# -------------------------
# 1. Paths
# -------------------------
base_model_name = "EleutherAI/gpt-neo-125M"
lora_path = "./lora_model"

# -------------------------
# 2. Tokenizer
# -------------------------
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# -------------------------
# 3. Base model
# -------------------------
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)

# -------------------------
# 4. Load LoRA
# -------------------------
model = PeftModel.from_pretrained(base_model, lora_path)

model.eval()

# -------------------------
# 5. Test
# -------------------------
print("testing FT model: \n")
prompt = "Medical Informatics is a subject "


inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=50,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

File Upload & Download

复制代码
下载单个文件
scp -i aws.pem ubuntu@18.xxx.xxx.xxx:/home/ubuntu/me/test.py .


上传单个文件
scp -i aws.pem test.py ubuntu@18.xxx.xxx.xxx:/home/ubuntu/me/

如果你经常传文件:

✅ 用图形工具(拖拽即可)

推荐:

  • WinSCP(Windows 超好用)

  • FileZilla

输入:

Host: 18.xxx.xxx.xxx

User: ubuntu

Key: aws.pem

直接拖文件即可。

✅ VSCode 直接远程编辑服务器文件

相关推荐
有谁看见我的剑了?5 小时前
windows 2016 模板机安装 CloudbaseInitSetup_x64
云计算
翼龙云_cloud6 小时前
阿里云代理商:OpenClaw 接入 Skills 的安装、分类实操及部署
阿里云·云计算·云服务器·openclaw
奇思智算6 小时前
2026年AI算力租用平台深度横评:阿里云_腾讯云_AutoDL_智星云谁更适合你?
人工智能·阿里云·云计算·腾讯云·gpu算力租用
easy_coder6 小时前
Agent:原理、架构与工程实践(中篇案例)
架构·云计算
easy_coder7 小时前
Agent:原理、架构与工程实践(终篇)
架构·云计算
Austindatabases8 小时前
阿里云MongoDB 部署安全吗? 多可用区怎么搞?
数据库·安全·mongodb·阿里云·云计算
小白考证进阶中8 小时前
阿里云ACP考试报名有门槛嘛?
阿里云·云计算·阿里云acp·阿里云acp云计算·阿里云acp认证·阿里云acp考试·阿里云acp备考
TG_yunshuguoji8 小时前
阿里云代理商:阿里云OpenClaw 集成claude-mem与OpenViking指南
阿里云·云计算·openclaw
亚林瓜子9 小时前
AWS Catalog中数据搬到Catalog中
大数据·python·spark·云计算·aws·pyspark·glue
byoass10 小时前
企业云盘权限体系实战:从粗放授权到最小权限的踩坑与重构
网络·安全·重构·云计算