Python的基本使用（numpy、pandas、matplotlib）

numpy、pandas、matplotlib

1. numpy

numpy（Numerical Python 的简称）是 Python 语言的一个扩展程序库，支持大量的维度数组与矩阵运算，此外也针对数组运算提供大量的数学函数库。它的主要特点是：

N维数组对象：用于存储单一数据类型的多维数组。

快速的元素级运算：如加法、减法、乘法等。

广播：一种强大的机制，使得不同大小的数组之间可以进行数学运算。

线性代数、统计和傅里叶变换等：提供了大量的高级数学函数。

常用代码：

numpy（Numerical Python 的简称）是 Python 中的一个基础库，用于处理大型多维数组和矩阵，以及执行各种与这些数组相关的数学操作。以下是一些 numpy 的常用代码示例：

1. 导入 numpy

python 复制代码

|---|--------------------|
| | import numpy as np |

2. 创建数组

python 复制代码

|---|-------------------------------------------------------------------|
| | # 一维数组 |
| | arr1d = np.array([1, 2, 3, 4, 5]) |
| | |
| | # 二维数组（矩阵） |
| | arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) |
| | |
| | # 使用 zeros, ones, empty 创建特定形状的数组 |
| | zeros_arr = np.zeros((3, 3)) |
| | ones_arr = np.ones((2, 4)) |
| | empty_arr = np.empty((2, 3)) # 注意：内容未初始化，可能是任何值 |
| | |
| | # 使用 arange, linspace 创建一维数组 |
| | arange_arr = np.arange(0, 10, 2) # 从 0 开始，到 10（不包括），步长为 2 |
| | linspace_arr = np.linspace(0, 1, 5) # 从 0 到 1，生成 5 个等间隔的数 |
| | |
| | # 使用 random 创建随机数组 |
| | random_arr = np.random.rand(3, 3) # 生成 0 到 1 之间的随机数 |
| | randint_arr = np.random.randint(0, 10, (3, 3)) # 生成 0 到 9 之间的随机整数 |

3. 数组操作

python 复制代码

|---|------------------------------------------------|
| | # 数组运算（元素级） |
| | result = arr1d + arr1d # 对应元素相加 |
| | result = arr2d * 2 # 所有元素乘以 2 |
| | |
| | # 索引和切片 |
| | element = arr2d[0, 0] # 获取第一个元素 |
| | row = arr2d[1, :] # 获取第二行 |
| | col = arr2d[:, 1] # 获取第二列 |
| | |
| | # 形状（shape）和大小（size） |
| | shape = arr2d.shape # 获取形状，例如 (3, 3) |
| | size = arr2d.size # 获取元素总数 |
| | |
| | # 数据类型（dtype） |
| | dtype = arr1d.dtype # 获取数据类型，例如 dtype('int64') |
| | |
| | # 排序 |
| | sorted_arr = np.sort(arr1d) |
| | |
| | # 条件选择 |
| | mask = arr1d > 3 |
| | selected_elements = arr1d[mask] |
| | |
| | # 数组重塑（reshape） |
| | reshaped_arr = arr1d.reshape((1, 5)) |
| | |
| | # 连接数组（concatenate） |
| | concat_arr = np.concatenate((arr1d, [6, 7])) |
| | |
| | # 数组转置（transpose） |
| | transposed_arr = arr2d.T |
| | |
| | # 矩阵乘法 |
| | dot_product = np.dot(arr2d, arr2d.T) |

4. 统计和聚合

python 复制代码

|---|-----------------------------------|
| | # 最小值、最大值、平均值、中位数等 |
| | min_val = np.min(arr1d) |
| | max_val = np.max(arr1d) |
| | mean_val = np.mean(arr1d) |
| | median_val = np.median(arr1d) |
| | |
| | # 标准差和方差 |
| | std_dev = np.std(arr1d) |
| | variance = np.var(arr1d) |
| | |
| | # 沿指定轴求和 |
| | sum_axis0 = np.sum(arr2d, axis=0) |
| | sum_axis1 = np.sum(arr2d, axis=1) |

5. 查找和搜索

python 复制代码

|---|--------------------------------------------------------------|
| | # 非零元素的索引 |
| | nonzero_indices = np.nonzero(arr1d) |
| | |
| | # 查找特定值的位置 |
| | positions = np.where(arr1d == 3) |
| | |
| | # 查找唯一值和它们的计数 |
| | unique_values, counts = np.unique(arr1d, return_counts=True) |

2. pandas

pandas 是一个强大的数据分析工具包，提供了数据结构和数据分析工具，能够处理和分析大量数据。其主要特点包括：

DataFrame：二维的、大小可变的、可以包含异质类型列的表格型数据结构。

Series：一维的、大小可变的、可以包含任何数据类型的数组，以及一组与之相关的数据标签（索引）。

数据读取/写入：可以从各种文件格式（如 CSV、Excel、SQL 数据库等）中读取数据，也可以将数据写入这些格式。

数据处理：提供了数据清洗、转换、合并、重塑等多种功能。

统计分析：提供了各种统计函数和方法。

pandas常用代码

pandas 是 Python 中一个强大的数据分析库，它提供了数据结构（如 DataFrame 和 Series）以及一系列用于数据清洗、转换、分析和可视化的工具。以下是一些 pandas 的常用代码示例：

1. 导入 pandas

python 复制代码

|---|---------------------|
| | import pandas as pd |

2. 创建 DataFrame

python 复制代码

|---|-------------------------------------------------------------------------------|
| | # 从字典创建 DataFrame |
| | data = { |
| | 'Name': ['Alice', 'Bob', 'Charlie'], |
| | 'Age': [25, 30, 35], |
| | 'City': ['New York', 'San Francisco', 'Los Angeles'] |
| | } |
| | df = pd.DataFrame(data) |
| | |
| | # 从 CSV 文件读取 DataFrame |
| | df = pd.read_csv('data.csv') |
| | |
| | # 从 SQL 数据库读取 DataFrame |
| | # 需要安装 sqlalchemy 和数据库连接库（如 pymysql） |
| | from sqlalchemy import create_engine |
| | engine = create_engine('mysql+pymysql://user:password@localhost:3306/dbname') |
| | df = pd.read_sql_table('table_name', engine) |

3. 查看 DataFrame 信息

python 复制代码

|---|-----------------------------------|
| | # 显示前几行 |
| | print(df.head()) |
| | |
| | # 显示后几行 |
| | print(df.tail()) |
| | |
| | # 显示 DataFrame 的结构（列名、数据类型和非空值数量） |
| | print(df.info()) |
| | |
| | # 显示 DataFrame 的前几行和列的数据类型 |
| | print(df.dtypes) |
| | |
| | # 显示 DataFrame 的描述性统计信息 |
| | print(df.describe()) |

4. 选择数据

python 复制代码

|---|---------------------------------------|
| | # 选择列 |
| | print(df['Age']) |
| | |
| | # 选择多列 |
| | print(df[['Name', 'Age']]) |
| | |
| | # 使用 loc 和 iloc 选择行 |
| | print(df.loc[0]) # 选择第一行 |
| | print(df.iloc[0]) # 同样选择第一行，但基于整数位置 |
| | |
| | # 基于条件选择行 |
| | print(df[df['Age'] > 30]) |

5. 数据清洗和转换

python 复制代码

|---|-----------------------------------------------------------------|
| | # 处理缺失值 |
| | df.fillna(0, inplace=True) # 将缺失值替换为 0 |
| | |
| | # 重命名列名 |
| | df.rename(columns={'Age': 'Age_Years'}, inplace=True) |
| | |
| | # 删除列 |
| | df.drop('City', axis=1, inplace=True) |
| | |
| | # 删除行 |
| | df.drop(df[df['Age_Years'] < 30].index, inplace=True) |
| | |
| | # 数据类型转换 |
| | df['Age_Years'] = df['Age_Years'].astype(int) |
| | |
| | # 字符串操作（例如，将字符串转为大写） |
| | df['Name'] = df['Name'].str.upper() |
| | |
| | # 应用函数到 DataFrame 的每个元素 |
| | df['Age_Squared'] = df['Age_Years'].apply(lambda x: x**2) |

6. 数据分组和聚合

python 复制代码

|---|--------------------------------------------------------------|
| | # 使用 groupby 进行分组 |
| | grouped = df.groupby('City') |
| | |
| | # 对分组后的数据进行聚合（例如，计算每个城市的平均年龄） |
| | agg_result = grouped['Age_Years'].mean() |
| | |
| | # 多重聚合 |
| | agg_result = grouped.agg({'Age_Years': ['mean', 'count']}) |

7. 数据排序

python 复制代码

|---|--------------------------------------------------------------|
| | # 按列排序 |
| | df_sorted = df.sort_values(by='Age_Years') |
| | |
| | # 按多列排序 |
| | df_sorted_multi = df.sort_values(by=['City', 'Age_Years']) |

8. 保存到文件

python 复制代码

|---|-------------------------------------------------------------------|
| | # 保存到 CSV 文件 |
| | df.to_csv('output.csv', index=False) |
| | |
| | # 保存到 Excel 文件 |
| | df.to_excel('output.xlsx', index=False) |
| | |
| | # 保存到 SQL 数据库 |
| | df.to_sql('table_name', engine, if_exists='replace', index=False) |

3. matplotlib

matplotlib 是一个 Python 2D 绘图库，它提供了类似于 MATLAB 的绘图框架和界面，可以用于绘制各种静态、动态、交互式的可视化图形。其主要特点包括：

简单的绘图语法：类似于 MATLAB 的绘图命令，易于上手。

丰富的图形类型：支持折线图、散点图、柱状图、饼图等多种图形类型。

精细的图形控制：可以控制图形的颜色、线条样式、坐标轴标签等。

交互性：可以与图形进行交互，如放大、缩小、拖动等。

集成性：可以与 numpy、pandas 等库无缝集成，方便地进行数据分析和可视化。

matplotlib 是 Python 中一个非常流行的绘图库，它提供了丰富的绘图功能和接口。以下是一些 matplotlib 的常用代码示例：

1. 导入 matplotlib

python 复制代码

|---|---------------------------------|
| | import matplotlib.pyplot as plt |

2. 绘制折线图

python 复制代码

|---|------------------------|
| | x = [1, 2, 3, 4, 5] |
| | y = [2, 4, 6, 8, 10] |
| | |
| | plt.plot(x, y) |
| | plt.title('Line Plot') |
| | plt.xlabel('X Axis') |
| | plt.ylabel('Y Axis') |
| | plt.show() |

3. 绘制散点图

python 复制代码

|---|---------------------------|
| | x = [1, 2, 3, 4, 5] |
| | y = [2, 3, 5, 7, 11] |
| | |
| | plt.scatter(x, y) |
| | plt.title('Scatter Plot') |
| | plt.xlabel('X Axis') |
| | plt.ylabel('Y Axis') |
| | plt.show() |

4. 绘制柱状图

python 复制代码

|---|---------------------------------|
| | x = ['A', 'B', 'C', 'D', 'E'] |
| | y = [2, 4, 6, 8, 10] |
| | |
| | plt.bar(x, y) |
| | plt.title('Bar Plot') |
| | plt.xlabel('Category') |
| | plt.ylabel('Value') |
| | plt.show() |

5. 绘制饼图

python 复制代码

|---|-------------------------------------------------------------------------------|
| | labels = ['A', 'B', 'C', 'D', 'E'] |
| | sizes = [15, 30, 45, 10, 5] |
| | |
| | plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90) |
| | plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle. |
| | plt.show() |

6. 绘制直方图

python 复制代码

|---|--------------------------------------------------------|
| | data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5] |
| | |
| | plt.hist(data, bins=5, edgecolor='black') |
| | plt.title('Histogram') |
| | plt.xlabel('Value') |
| | plt.ylabel('Frequency') |
| | plt.show() |

7. 绘制多个子图

python 复制代码

|---|-------------------------------------------------------|
| | plt.figure(figsize=(10, 6)) |
| | |
| | plt.subplot(2, 2, 1) # 2 rows, 2 columns, first plot |
| | plt.plot(x, y) |
| | plt.title('First Plot') |
| | |
| | plt.subplot(2, 2, 2) # second plot |
| | plt.scatter(x, y) |
| | plt.title('Second Plot') |
| | |
| | plt.subplot(2, 2, 3) # third plot |
| | plt.bar(x, y) |
| | plt.title('Third Plot') |
| | |
| | plt.tight_layout() # Adjusts spacing between subplots |
| | plt.show() |

8. 添加图例

python 复制代码

|---|------------------------------------|
| | x = [1, 2, 3, 4, 5] |
| | y1 = [2, 4, 6, 8, 10] |
| | y2 = [3, 5, 7, 9, 11] |
| | |
| | plt.plot(x, y1, label='Line 1') |
| | plt.plot(x, y2, label='Line 2') |
| | plt.legend() |
| | plt.title('Line Plot with Legend') |
| | plt.xlabel('X Axis') |
| | plt.ylabel('Y Axis') |
| | plt.show() |

9. 自定义颜色、线型等

python 复制代码

|---|---------------------------------------------------------|
| | x = [1, 2, 3, 4, 5] |
| | y = [2, 4, 6, 8, 10] |
| | |
| | plt.plot(x, y, color='red', linestyle='--', marker='o') |
| | plt.title('Customized Line Plot') |
| | plt.xlabel('X Axis') |
| | plt.ylabel('Y Axis') |
| | plt.show() |

numpy

python 复制代码

import numpy as np


#创建数组

# 一维数组
arr1d = np.array([1, 2, 3, 4, 5])

# 二维数组（矩阵）
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# 使用 zeros, ones, empty 创建特定形状的数组
zeros_arr = np.zeros((3, 3))
ones_arr = np.ones((2, 4))
empty_arr = np.empty((2, 3))  # 注意：内容未初始化，可能是任何值

# 使用 arange, linspace 创建一维数组
arange_arr = np.arange(0, 10, 2)  # 从 0 开始，到 10（不包括），步长为 2
linspace_arr = np.linspace(0, 1, 5)  # 从 0 到 1，生成 5 个等间隔的数

# 使用 random 创建随机数组
random_arr = np.random.rand(3, 3)  # 生成 0 到 1 之间的随机数
randint_arr = np.random.randint(0, 10, (3, 3))  # 生成 0 到 9 之间的随机整数

#数组操作

# 数组运算（元素级）
result = arr1d + arr1d  # 对应元素相加
result = arr2d * 2  # 所有元素乘以 2

# 索引和切片
element = arr2d[0, 0]  # 获取第一个元素
row = arr2d[1, :]  # 获取第二行
col = arr2d[:, 1]  # 获取第二列

# 形状（shape）和大小（size）
shape = arr2d.shape  # 获取形状，例如 (3, 3)
size = arr2d.size  # 获取元素总数

# 数据类型（dtype）
dtype = arr1d.dtype  # 获取数据类型，例如 dtype('int64')

# 排序
sorted_arr = np.sort(arr1d)

# 条件选择
mask = arr1d > 3
selected_elements = arr1d[mask]

# 数组重塑（reshape）
reshaped_arr = arr1d.reshape((1, 5))

# 连接数组（concatenate）
concat_arr = np.concatenate((arr1d, [6, 7]))

# 数组转置（transpose）
transposed_arr = arr2d.T

# 矩阵乘法
dot_product = np.dot(arr2d, arr2d.T)

#统计和聚合

# 最小值、最大值、平均值、中位数等
min_val = np.min(arr1d)
max_val = np.max(arr1d)
mean_val = np.mean(arr1d)
median_val = np.median(arr1d)

# 标准差和方差
std_dev = np.std(arr1d)
variance = np.var(arr1d)

# 沿指定轴求和
sum_axis0 = np.sum(arr2d, axis=0)
sum_axis1 = np.sum(arr2d, axis=1)

#查找和搜索

# 非零元素的索引
nonzero_indices = np.nonzero(arr1d)

# 查找特定值的位置
positions = np.where(arr1d == 3)

# 查找唯一值和它们的计数
unique_values, counts = np.unique(arr1d, return_counts=True)

pandas

python 复制代码

import pandas as pd


#创建DataFrame和Series

# 创建 DataFrame  
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)

# 创建 Series  
s = pd.Series([1, 2, 3, 4], name='Numbers')

#读取和写入数据

# 读取 CSV 文件  
df = pd.read_csv('data.csv')

# 写入 CSV 文件  
df.to_csv('output.csv', index=False)

# 读取 Excel 文件  
df = pd.read_excel('data.xlsx')

# 写入 Excel 文件  
df.to_excel('output.xlsx', index=False)

#选择数据

# 选择列  
ages = df['Age']

# 选择多列  
info = df[['Name', 'Age']]

# 选择行  
first_row = df.iloc[0]  # 使用整数位置  
bob_row = df[df['Name'] == 'Bob']  # 使用条件  

# 选择特定行和列  
selected_data = df.loc[df['Age'] > 30, ['Name', 'City']]

#数据处理

# 对某列应用函数  
df['AgeSquared'] = df['Age'] ** 2

# 替换值  
df.replace({'City': {'New York': 'NYC'}}, inplace=True)

# 删除列  
df.drop('AgeSquared', axis=1, inplace=True)

# 删除行（基于条件）  
df = df[df['Age'] > 20]

# 数据排序  
df_sorted = df.sort_values(by='Age')

# 数据分组和聚合  
grouped = df.groupby('City')['Age'].mean()

#数据合并和连接

# 合并两个 DataFrame（基于索引）  
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'key': ['K0', 'K0', 'K1', 'K1'],
                    'C': ['C0', 'C1', 'C2', 'C3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'key': ['K0', 'K1', 'K0', 'K1'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

merged = pd.merge(df1, df2, on='key')

# 连接两个 DataFrame（基于索引）  
concatenated = pd.concat([df1, df2], ignore_index=True)

#数据统计

# 描述性统计  
stats = df.describe()

# 唯一值计数  
unique_counts = df['City'].value_counts()

# 空值检查  
null_counts = df.isnull().sum()

#数据可视化
# 绘制直方图  
df['Age'].plot(kind='hist', bins=20)

# 使用 seaborn 进行更复杂的可视化  
import seaborn as sns

sns.barplot(x='City', y='Age', data=df)

matplotlib

python 复制代码

import matplotlib.pyplot as plt

#绘制折线图

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

plt.plot(x, y)
plt.title('Line Plot')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.show()

#绘制散点图

x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]

plt.scatter(x, y)
plt.title('Scatter Plot')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.show()

#绘制柱状图

x = ['A', 'B', 'C', 'D', 'E']
y = [2, 4, 6, 8, 10]

plt.bar(x, y)
plt.title('Bar Plot')
plt.xlabel('Category')
plt.ylabel('Value')
plt.show()

#绘制饼图

labels = ['A', 'B', 'C', 'D', 'E']
sizes = [15, 30, 45, 10, 5]

plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

#绘制直方图

data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]

plt.hist(data, bins=5, edgecolor='black')
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

#绘制多个子图

plt.figure(figsize=(10, 6))

plt.subplot(2, 2, 1)  # 2 rows, 2 columns, first plot
plt.plot(x, y)
plt.title('First Plot')

plt.subplot(2, 2, 2)  # second plot
plt.scatter(x, y)
plt.title('Second Plot')

plt.subplot(2, 2, 3)  # third plot
plt.bar(x, y)
plt.title('Third Plot')

plt.tight_layout()  # Adjusts spacing between subplots
plt.show()

#添加图例

x = [1, 2, 3, 4, 5]
y1 = [2, 4, 6, 8, 10]
y2 = [3, 5, 7, 9, 11]

plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')
plt.legend()
plt.title('Line Plot with Legend')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.show()

#自定义颜色、线型等

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

plt.plot(x, y, color='red', linestyle='--', marker='o')
plt.title('Customized Line Plot')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.show()