目录
- 引言
- [1. 基本数据结构](#1. 基本数据结构)
-
- [1.1. Series 的初始化和简单操作](#1.1. Series 的初始化和简单操作)
- [1.2. DataFrame 的初始化和简单操作](#1.2. DataFrame 的初始化和简单操作)
-
- [1.2.1. 初始化与持久化](#1.2.1. 初始化与持久化)
- [1.2.2. 读取查看](#1.2.2. 读取查看)
- [1.2.3. 行操作](#1.2.3. 行操作)
- [1.2.4. 列操作](#1.2.4. 列操作)
- [1.2.5. 选中筛查](#1.2.5. 选中筛查)
- [2. 数据预处理](#2. 数据预处理)
-
- [2.0. 生成样例表](#2.0. 生成样例表)
- [2.1. 缺失值处理](#2.1. 缺失值处理)
- [2.2. 类型转换和排序](#2.2. 类型转换和排序)
- [2.3. 统计分析](#2.3. 统计分析)
- [3. 数据透视](#3. 数据透视)
-
- [3.0. 生成样例表](#3.0. 生成样例表)
- [3.1. 生成透视表](#3.1. 生成透视表)
- [4. 数据重塑](#4. 数据重塑)
-
- [4.1. 层次化索引](#4.1. 层次化索引)
-
- [4.1.1. 双层索引的Series](#4.1.1. 双层索引的Series)
- [4.1.2. 双层索引的Dataframe](#4.1.2. 双层索引的Dataframe)
- [4.2. 离散化处理](#4.2. 离散化处理)
-
- [4.2.1. 分组运算](#4.2.1. 分组运算)
- [4.2.2. 分级标签](#4.2.2. 分级标签)
- [4.3. 数据集合并](#4.3. 数据集合并)
引言
Pandas (Python Data Analysis Library)
是基于Numpy
的一种用于数据分析的工具包,其中纳入了大量库和一些标准数据模型,提供了高效操作大型数据集所需的工具。
以下对Pandas
库函数的介绍中,已传入的参数为默认值 ,并且无返回值的函数不会以赋值形式演示。
1. 基本数据结构
1.1. Series 的初始化和简单操作
Pandas
中的Series
,与Numpy
中的array
和Python
中的基本数据结构list
类似,是一种能保存不同数据类型 的一维数组。
python
import pandas as pd
import numpy as np
# 默认行标签建表,并查看索引和值
s1 = pd.Series([-1, 0.7, False, np.nan])
'''
0 -1
1 0.7
2 False
3 NaN
dtype: object
'''
print(s1.index) # RangeIndex(start=0, stop=4, step=1)
print(s1.values) # [-1 0.7 False nan]
# 设定行标签、表格名称和索引名称
s2 = pd.Series([-1, 0.7, False, np.nan], index=list('abcd'), name='demo')
s2.index.name = 'index'
'''
index
a -1
b 0.7
c False
d NaN
Name: demo, dtype: object
'''
print(s2.index) # Index(['a', 'b', 'c', 'd'], dtype='object')
print(s2.values) # [-1 0.7 False nan]
# 查询值
print(s1[3]) # nan
print(s2['d'])
# 切片:标签闭区间
print(s1[::2])
'''
0 -1
2 False
dtype: object
'''
print(s2['a':'c'])
'''
a -1
b 0.7
c False
Name: demo, dtype: object
'''
1.2. DataFrame 的初始化和简单操作
Pandas
中的DataFrame
,与R
中的data.frame
类似,是一种二维表格型数据结构 ,相当于Series
的容器。
1.2.1. 初始化与持久化
python
import pandas as pd
import numpy as np
# 1.字典建表
df = pd.DataFrame({'A': pd.Timestamp('20250110'), 'B': pd.Series(0, index=list(range(4))), 'C': np.array([1]*4), 'D': 2,
'E': pd.Categorical(['test', 'train', 'test', 'train'])})
'''
A B C D E
0 2025-01-10 0 1 2 test
1 2025-01-10 0 1 2 train
2 2025-01-10 0 1 2 test
3 2025-01-10 0 1 2 train
'''
# 生成时序
date = pd.date_range('20250110', periods=5)
'''
DatetimeIndex(['2025-01-10', '2025-01-11', '2025-01-12', '2025-01-13',
'2025-01-14'],
dtype='datetime64[ns]', freq='D')
'''
# 2.时序建表
df = pd.DataFrame(np.random.randn(5, 3), index=date, columns=list('xyz'))
'''
x y z
2025-01-10 -0.274766 -0.593336 0.724735
2025-01-11 1.552149 -0.300292 0.061253
2025-01-12 0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14 0.683950 -0.483677 -2.019955
'''
# 保存表格:不额外添加索引
df.to_csv(r'E:\Pycharm_Python\course\demo.csv')
1.2.2. 读取查看
python
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
# 查看头尾(默认5行)
print(df.head(2))
'''
x y z
2025-01-10 -0.274766 -0.593336 0.724735
2025-01-11 1.552149 -0.300292 0.061253
'''
print(df.tail(1))
'''
x y z
2025-01-14 0.68395 -0.483677 -2.019955
'''
# 查看格式
print(df.dtypes)
'''
x float64
y float64
z float64
dtype: object
'''
# 查看行标签
print(df.index)
'''Index(['2025-01-10', '2025-01-11', '2025-01-12', '2025-01-13', '2025-01-14'], dtype='object')'''
# 查看列标签
print(df.columns)
'''Index(['x', 'y', 'z'], dtype='object')'''
# 查看数据
print(df.values)
'''
[[-0.27476573 -0.59333579 0.72473541]
[ 1.55214904 -0.30029235 0.06125304]
[ 0.41190756 -0.47019098 -0.89324306]
[-1.32816905 -0.99938983 -0.08141872]
[ 0.6839496 -0.48367661 -2.01995517]]
'''
1.2.3. 行操作
python
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
# 单行选中
print(df.iloc[1])
'''
x 1.552149
y -0.300292
z 0.061253
Name: 2025-01-11, dtype: float64
'''
# 多行选中
# print(df.loc['20250113':'20250114'])
print(df.iloc[3:5])
'''
x y z
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14 0.683950 -0.483677 -2.019955
'''
# 单行添加
df = df._append(pd.Series(dict(zip('xyz', np.random.randn(3))), name='2025-01-15'))
'''
x y z
2025-01-10 -0.274766 -0.593336 0.724735
2025-01-11 1.552149 -0.300292 0.061253
2025-01-12 0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14 0.683950 -0.483677 -2.019955
2025-01-15 1.205855 0.841471 0.843053
'''
# 单行删除
df = df.drop(['2025-01-15'])
'''
x y z
2025-01-10 -0.274766 -0.593336 0.724735
2025-01-11 1.552149 -0.300292 0.061253
2025-01-12 0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14 0.683950 -0.483677 -2.019955
'''
1.2.4. 列操作
python
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
# 单列选中
print(df['x'])
'''
2025-01-10 -0.274766
2025-01-11 1.552149
2025-01-12 0.411908
2025-01-13 -1.328169
2025-01-14 0.683950
Name: x, dtype: float64
'''
# 多列选中
print(df[['x', 'z']])
'''
x z
2025-01-10 -0.274766 0.724735
2025-01-11 1.552149 0.061253
2025-01-12 0.411908 -0.893243
2025-01-13 -1.328169 -0.081419
2025-01-14 0.683950 -2.019955
'''
# 单列添加
df['p'] = np.random.rand(5)
'''
x y z p
2025-01-10 -0.274766 -0.593336 0.724735 0.070785
2025-01-11 1.552149 -0.300292 0.061253 0.034027
2025-01-12 0.411908 -0.470191 -0.893243 0.446612
2025-01-13 -1.328169 -0.999390 -0.081419 0.545531
2025-01-14 0.683950 -0.483677 -2.019955 0.261958
'''
# 单列删除
df = df.drop('p', axis=1)
print(df)
'''
x y z
2025-01-10 -0.274766 -0.593336 0.724735
2025-01-11 1.552149 -0.300292 0.061253
2025-01-12 0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14 0.683950 -0.483677 -2.019955
'''
1.2.5. 选中筛查
python
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
# 单值选中
print(df.loc['2025-01-12', 'y'])
'''-0.47019098075848'''
# 区域选中
# print(df.loc[['2025-01-12', '2025-01-13'], ['x', 'z']])
print(df[['x', 'z']][2:4])
'''
x z
2025-01-12 0.411908 -0.893243
2025-01-13 -1.328169 -0.081419
'''
# 条件判断
print(df['x'] > 0)
'''
2025-01-10 False
2025-01-11 True
2025-01-12 True
2025-01-13 False
2025-01-14 True
Name: x, dtype: bool
'''
# 条件筛选
print(df[(df['x'] > 0)&(df['z'] > 0)])
'''
x y z
2025-01-11 1.552149 -0.300292 0.061253
'''
# 区间条件筛选
print(df[df['x'] > 0][1:4])
'''
x y z
2025-01-12 0.411908 -0.470191 -0.893243
2025-01-14 0.683950 -0.483677 -2.019955
'''
2. 数据预处理
2.0. 生成样例表
python
import pandas as pd
import numpy as np
# 生成时序
date = pd.date_range('20250110', periods=5)
# 建表
df = pd.DataFrame(np.random.randn(5, 3), index=date, columns=list('xyz'))
# 生成随机布尔表
mask = np.random.randint(0, 2, df.shape, dtype='bool')
# 随机生成空值
df[pd.DataFrame(mask, index=df.index, columns=df.columns)] = np.nan
# 保存表格
df.to_csv(r'E:\Pycharm_Python\course\demo_nan.csv')
2.1. 缺失值处理
isnull()
:返回一个与原表尺寸相同 的布尔类型 的表格,原表里的缺失值在其中对应位置上的值为True
,其余为False
fillna(value, inplace=False)
:返回将原表的缺失值填充 为value
后的表格,inplace=True
时将原表格替换为输出(下同)replace(to_replace, value, inplace=False)
:返回原表的待替换值to_replace
全部替换 为value
后的表格,前两个参数是列表 时表示批量替换dropna(axis=0, inplace=False)
:返回将原表中含有缺失值的指定维度 的记录删除后的表格
python
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo_nan.csv', index_col=0)
'''
x y z
2025-01-10 0.606495 NaN 0.456811
2025-01-11 NaN 0.743876 NaN
2025-01-12 0.024458 0.733735 NaN
2025-01-13 0.306332 NaN -0.586894
2025-01-14 NaN NaN NaN
'''
# 判断缺失值
print(df.isnull())
'''
x y z
2025-01-10 False True False
2025-01-11 True False True
2025-01-12 False False True
2025-01-13 False True False
2025-01-14 True True True
'''
# 填充缺失值
print(df['x'].fillna(df['x'].mean()))
'''
2025-01-10 0.606495
2025-01-11 0.312428
2025-01-12 0.024458
2025-01-13 0.306332
2025-01-14 0.312428
Name: x, dtype: float64
'''
print(df.fillna(0, inplace=True))
'''
x y z
2025-01-10 0.606495 0.000000 0.456811
2025-01-11 0.000000 0.743876 0.000000
2025-01-12 0.024458 0.733735 0.000000
2025-01-13 0.306332 0.000000 -0.586894
2025-01-14 0.000000 0.000000 0.000000
'''
# 替换缺失值
print(df.replace(np.nan, 0))
# 删除缺失值
print(df.dropna())
# 保存表格
df.to_csv(r'E:\Pycharm_Python\course\demo.csv')
2.2. 类型转换和排序
python
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
# 类型转换
print(df.astype(bool))
'''
x y z
2025-01-10 True False True
2025-01-11 False True False
2025-01-12 True True False
2025-01-13 True False True
2025-01-14 False False False
'''
# 逆序排列
print(df.sort_values(by='x', ascending=False))
'''
x y z
2025-01-10 0.606495 0.000000 0.456811
2025-01-13 0.306332 0.000000 -0.586894
2025-01-12 0.024458 0.733735 0.000000
2025-01-11 0.000000 0.743876 0.000000
2025-01-14 0.000000 0.000000 0.000000
'''
# 有优先级的正序排列
print(df.sort_values(by=['z', 'y'], ascending=True))
'''
x y z
2025-01-13 0.306332 0.000000 -0.586894
2025-01-14 0.000000 0.000000 0.000000
2025-01-12 0.024458 0.733735 0.000000
2025-01-11 0.000000 0.743876 0.000000
2025-01-10 0.606495 0.000000 0.456811
'''
2.3. 统计分析
python
# 描述性统计
print(df.describe())
'''
x y z
count 5.000000 5.000000 5.000000
mean 0.187457 0.295522 -0.026017
std 0.267662 0.404676 0.370721
min 0.000000 0.000000 -0.586894
25% 0.000000 0.000000 0.000000
50% 0.024458 0.000000 0.000000
75% 0.306332 0.733735 0.000000
max 0.606495 0.743876 0.456811
'''
# 最大值
print(df['x'].max()) # 0.606494812593188
# 最小值
print(df['x'].min()) # 0.0
# 均值
print(df['x'].mean()) # 0.18745690178195
# 中值
print(df['x'].median()) # 0.0244579357029649
# 方差
print(df['x'].var()) # 0.07164321143837143
# 标准差
print(df['x'].std()) # 0.26766249538994336
# 计数
print(df['z'].count()) # 5
# 种类
print(df['z'].unique()) # [ 0.45681124 0. -0.58689448]
# 分类计数
print(df['z'].value_counts())
'''
z
0.000000 3
0.456811 1
-0.586894 1
'''
# 求和
print(df.sum())
'''
x 0.937285
y 1.477611
z -0.130083
dtype: float64
'''
# 相关系数
print(df.corr())
'''
x y z
x 1.000000 -0.597883 0.306501
y -0.597883 1.000000 0.064061
z 0.306501 0.064061 1.000000
'''
# 协方差
print(df.cov())
'''
x y z
x 0.071643 -0.064761 0.030414
y -0.064761 0.163763 0.009611
z 0.030414 0.009611 0.137434
'''
3. 数据透视
3.0. 生成样例表
python
import pandas as pd
import numpy as np
# 生成数据
hour = np.random.randint(0, 24, (1000, 1))
area = np.random.randint(0, 10, 1000)
displacement = np.random.randn(1000, 3)
# 拼接表格
a = np.concatenate((hour, displacement), axis=1)
df = pd.DataFrame(a, index=area, columns=['hour', 'x', 'y', 'z'])
df.index.name = 'area'
# 类型转换
df['hour'] = df['hour'].astype('int64')
# 保存表格
df.to_csv(r'E:\Pycharm_Python\course\demo.csv')
'''
hour x y z
area
9 18 1.453873 -0.452853 0.126672
5 20 -0.541874 -0.798552 0.209252
9 12 0.848762 -0.734806 0.124415
1 13 0.794053 1.838139 -0.268814
8 2 -0.115496 2.054565 0.860301
... ... ... ... ...
9 21 -0.212381 0.355993 -1.124492
1 20 -0.010173 0.408953 -0.275197
2 15 0.334253 0.231890 3.557654
0 3 -0.383228 -0.562431 2.418784
8 12 -1.004758 -0.539583 1.589166
[1000 rows x 4 columns]
'''
3.1. 生成透视表
python
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
# 设置显示上限
pd.set_option('display.max_columns', 4)
pd.set_option('display.max_rows', 10)
# 生成透视表:将缺失值填充为0,并显示各列总和
pt = pd.pivot_table(df, index=['area', 'hour'], values=['x', 'y'], aggfunc=['sum', 'mean'], fill_value=0, margins=True)
'''
sum mean
x y x y
area hour
0 0 4.372290 0.019988 1.093072 0.004997
1 -5.463018 3.510755 -1.092604 0.702151
2 0.429444 -2.022444 0.429444 -2.022444
3 1.954055 -1.683926 0.488514 -0.420981
4 -2.226930 -3.827011 -1.113465 -1.913506
... ... ... ... ...
9 20 -1.422801 0.262971 -0.237134 0.043829
21 -2.720063 1.411410 -0.544013 0.282282
22 -0.342656 -0.502878 -0.171328 -0.251439
23 -1.024892 -0.385198 -0.170815 -0.064200
All -24.847056 19.244009 -0.024847 0.019244
[239 rows x 4 columns]
'''
# 字典指定各标签的聚合函数
pt = pd.pivot_table(df, index=['area', 'hour'], aggfunc={'x': 'sum', 'y': 'mean'})
'''
x y
area hour
0 0 4.372290 0.004997
1 -5.463018 0.702151
2 0.429444 -2.022444
3 1.954055 -0.420981
4 -2.226930 -1.913506
... ... ...
9 19 1.744798 -0.405821
20 -1.422801 0.043829
21 -2.720063 0.282282
22 -0.342656 -0.251439
23 -1.024892 -0.064200
[238 rows x 2 columns]
'''
4. 数据重塑
4.1. 层次化索引
4.1.1. 双层索引的Series
python
import pandas as pd
import numpy as np
# 双层索引
index = pd.MultiIndex.from_arrays([list('aaabbccdd'), list(map(int, '123121212'))], names=('area', 'numbers'))
'''MultiIndex([('a', '1'),
('a', '2'),
('a', '3'),
('b', '1'),
('b', '2'),
('c', '1'),
('c', '2'),
('d', '1'),
('d', '2')],
names=['area', 'numbers'])
'''
# 初始化
s = pd.Series(np.random.randn(9), index=index)
print(s)
'''
area numbers
a 1 0.417328
2 0.168057
3 1.252186
b 1 -1.835490
2 0.951358
c 1 -1.903762
2 -0.075067
d 1 0.782123
2 0.355078
dtype: float64
'''
# 单个选中
print(s['a', 1]) # 0.417328381875337
# 单层选中
print(s['a'])
'''
numbers
1 0.417328
2 0.168057
3 1.252186
dtype: float64
'''
# 单层切片
print(s['a':'b'])
'''
area numbers
a 1 0.417328
2 0.168057
3 1.252186
b 1 -1.835490
2 0.951358
dtype: float64
'''
print(s[:, 2])
'''
area
a 0.168057
b 0.951358
c -0.075067
d 0.355078
dtype: float64
'''
4.1.2. 双层索引的Dataframe
python
# 设置显示上限
pd.set_option('display.max_columns', 4)
pd.set_option('display.max_rows', 8)
# 双层标签
index = pd.MultiIndex.from_arrays([list('aabbccdd'), list(map(int, '12121212'))], names=('area', 'numbers'))
columns = pd.MultiIndex.from_tuples([('t1', 'x'), ('t1', 'y'), ('t2', 'x'), ('t2', 'y')])
df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
'''
t1 t2
x y x y
area numbers
a 1 -0.125867 -1.722040 -0.266579 0.910084
2 0.060483 0.750894 -0.479338 0.608312
b 1 0.345995 1.470237 1.763323 -0.336475
2 -1.977062 0.071204 0.000797 -0.323753
c 1 0.963804 0.186688 0.443276 0.615650
2 1.729371 -0.775489 1.663172 -0.657688
d 1 0.376276 0.693671 0.982811 -0.393840
2 -0.632945 -2.046240 0.865305 1.150940
'''
# 单值选中
print(df.loc[('a', 1), ('t1', 'x')]) # -0.12586716795606423
# 单层行选中
print(df.loc['a'])
'''
t1 t2
x y x y
numbers
1 -0.125867 -1.722040 -0.266579 0.910084
2 0.060483 0.750894 -0.479338 0.608312
'''
# 单层列选中
print(df['t1'])
'''
x y
area numbers
a 1 -0.125867 -1.722040
2 0.060483 0.750894
b 1 0.345995 1.470237
2 -1.977062 0.071204
c 1 0.963804 0.186688
2 1.729371 -0.775489
d 1 0.376276 0.693671
2 -0.632945 -2.046240
'''
# 索引交换
df = df.swaplevel('area', 'numbers')
'''
t1 t2
x y x y
numbers area
1 a -0.125867 -1.722040 -0.266579 0.910084
2 a 0.060483 0.750894 -0.479338 0.608312
1 b 0.345995 1.470237 1.763323 -0.336475
2 b -1.977062 0.071204 0.000797 -0.323753
1 c 0.963804 0.186688 0.443276 0.615650
2 c 1.729371 -0.775489 1.663172 -0.657688
1 d 0.376276 0.693671 0.982811 -0.393840
2 d -0.632945 -2.046240 0.865305 1.150940
'''
4.2. 离散化处理
4.2.1. 分组运算
python
# 设置显示上限
pd.set_option('display.max_rows', 6)
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
'''
hour x y z
area
9 18 1.453873 -0.452853 0.126672
5 20 -0.541874 -0.798552 0.209252
9 12 0.848762 -0.734806 0.124415
... ... ... ... ...
2 15 0.334253 0.231890 3.557654
0 3 -0.383228 -0.562431 2.418784
8 12 -1.004758 -0.539583 1.589166
[1000 rows x 4 columns]
'''
# 对表格按索引和标签分组运算
print(df.groupby([df.index, df['hour']]).mean())
'''
x y z
area hour
0 0 1.093072 0.004997 -0.183101
1 -1.092604 0.702151 -0.556797
2 0.429444 -2.022444 -0.346545
... ... ... ...
9 21 -0.544013 0.282282 -0.191250
22 -0.171328 -0.251439 -0.022121
23 -0.170815 -0.064200 -0.244528
[238 rows x 3 columns]
'''
# 对某列按索引分组运算
print(df['x'].groupby(df.index).sum())
'''
area
0 -10.897082
1 -4.915652
2 -13.841750
...
7 -4.954806
8 -6.258694
9 2.828964
Name: x, Length: 10, dtype: float64
'''
4.2.2. 分级标签
python
# 设置显示上限
pd.set_option('display.max_rows', 6)
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
'''
hour x y z
area
9 18 1.453873 -0.452853 0.126672
5 20 -0.541874 -0.798552 0.209252
9 12 0.848762 -0.734806 0.124415
... ... ... ... ...
2 15 0.334253 0.231890 3.557654
0 3 -0.383228 -0.562431 2.418784
8 12 -1.004758 -0.539583 1.589166
[1000 rows x 4 columns]
'''
# 按数值区间分级
bins = [0, 0.1, 0.4, 0.8, 1.6, 3.2, 4.0]
labels = ['E', 'D', 'C', 'B', 'A', 'S']
df['rank_x'] = pd.cut(df['x'].abs(), bins, labels=labels)
'''
hour x y z rank_x
area
9 18 1.453873 -0.452853 0.126672 B
5 20 -0.541874 -0.798552 0.209252 C
9 12 0.848762 -0.734806 0.124415 B
... ... ... ... ... ...
2 15 0.334253 0.231890 3.557654 D
0 3 -0.383228 -0.562431 2.418784 D
8 12 -1.004758 -0.539583 1.589166 B
[1000 rows x 5 columns]
'''
# 按分位区间分级
bins = np.percentile(df['x'], [0, 25, 50, 70, 85, 95, 100])
labels = ['E', 'D', 'C', 'B', 'A', 'S']
df['rank_x'] = pd.cut(df['x'].abs(), bins, labels=labels)
print(df)
'''
hour x y z rank_x
area
9 18 1.453873 -0.452853 0.126672 A
5 20 -0.541874 -0.798552 0.209252 B
9 12 0.848762 -0.734806 0.124415 B
... ... ... ... ... ...
2 15 0.334253 0.231890 3.557654 C
0 3 -0.383228 -0.562431 2.418784 C
8 12 -1.004758 -0.539583 1.589166 A
[1000 rows x 5 columns]
'''
4.3. 数据集合并
python
# 设置显示上限
pd.set_option('display.max_rows', 6)
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
'''
hour x y z
area
9 18 1.453873 -0.452853 0.126672
5 20 -0.541874 -0.798552 0.209252
9 12 0.848762 -0.734806 0.124415
... ... ... ... ...
2 15 0.334253 0.231890 3.557654
0 3 -0.383228 -0.562431 2.418784
8 12 -1.004758 -0.539583 1.589166
[1000 rows x 4 columns]
'''
# 单表添加和多表拼接
print(df.iloc[:3]._append(df.iloc[3:6]))
print(pd.concat([df.iloc[:2], df.iloc[2:4], df.iloc[4:6]], axis=0))
'''
hour x y z
area
9 18 1.453873 -0.452853 0.126672
5 20 -0.541874 -0.798552 0.209252
9 12 0.848762 -0.734806 0.124415
1 13 0.794053 1.838139 -0.268814
8 2 -0.115496 2.054565 0.860301
9 9 1.235167 1.030952 -0.517618
'''
# 合并
'''样表'''
df1 = df.iloc[2:5][['x', 'y']]
'''
hour x y
area
9 12 0.848762 -0.734806
1 13 0.794053 1.838139
8 2 -0.115496 2.054565
'''
df2 = df.iloc[2:5][['x', 'z']].sample(frac=1)
'''
hour x z
area
8 2 -0.115496 0.860301
1 13 0.794053 -0.268814
9 12 0.848762 0.124415
'''
print(pd.merge(df1, df2, on='area'))
'''
x_x y x_y z
area
9 0.848762 -0.734806 0.848762 0.124415
1 0.794053 1.838139 0.794053 -0.268814
8 -0.115496 2.054565 -0.115496 0.860301
'''