pandas如何进行数据拼接
pandas数据拼接有可能会用到,比如出现重复数据,需要合并两份数据的交集,并集就是个不错的选择,我们本着技多不压身的态度蛮学习了一下下;
我们在进行学习数据转换之前,先学习一些数拼接相关的知识
1 join()联结
有关merge操作知识追寻者这边不提及,有空可能后面会专门出一篇相关文章,因为其学习方式根SQL的表联结类似,不是几行能说清楚的知识点;
join操作能将 2 个DataFrame 合并为一块,前提是DataFrame 之间的列没有重复;
# -*- coding: utf-8 -*- import pandas as pd import numpy as np data1 = { 'user' : ['zszxz','craler','rose'], 'price' : [100, 200, 300], 'hobby' : ['reading','running','hiking'] } index1 = ['user1','user2','user3'] frame1 = pd.DataFrame(data1,index1) data2 = { 'person' : ['zszxz','craler','rose'], 'number' : [100, 2000, 3000], 'activity' : ['swing','riding','climbing'] } index2 = ['user1','user2','user3'] frame2 = pd.DataFrame(data2,index2) join = frame1.join(frame2) print(join)
输出
user price hobby person number activity
user1 zszxz 100 reading zszxz 100 swing
user2 craler 200 running craler 2000 riding
user3 rose 300 hiking rose 3000 climbing
2 concat()拼接
使用 concat() 函数能将2个 Series 拼接为一个,默认按行拼接;
ser1 = pd.Series(['111','222',np.NaN]) ser2 = pd.Series(['333','444',np.NaN]) # 默认按行拼接 print(pd.concat([ser1, ser2]))
如果按列拼接则 axis = 1
ser1 = pd.Series(['111','222',np.NaN]) ser2 = pd.Series(['333','444',np.NaN]) # 按列拼接 print(pd.concat([ser1, ser2],axis=1))
输出
0 1
0 111 333
1 222 444
2 NaN NaN
更近一步,指定key 参数 输出的数据格式就和 DataFrame 一样
ser1 = pd.Series(['111','222',np.NaN]) ser2 = pd.Series(['333','444',np.NaN]) # 按列拼接 data = pd.concat([ser1, ser2],axis=1, keys=['zszxz', 'rzxx']) print(data) 输出 zszxz rzxx 0 111 333 1 222 444 2 NaN NaN
注 : DataFrame 的 concat 操作 和 Series 类似;
2.3 combine_first()组合
索引重复时就可以使用combine_first进行拼接
ser1 = pd.Series(['111','222',np.NaN],index=[1,2,3]) ser2 = pd.Series(['333','444',np.NaN,'555'],index=[1,2,3,4]) data = ser1.combine_first(ser2) print(data)
输出
1 111
2 222
3 NaN
4 555
dtype: object
将Series 位置互换一下,可以看见基准将以 ser2为准;
ser1 = pd.Series(['111','222',np.NaN],index=[1,2,3]) ser2 = pd.Series(['333','444',np.NaN,'555'],index=[1,2,3,4]) data = ser2.combine_first(ser1) print(data)
输出
1 333
2 444
3 NaN
4 555
dtype: object
4 轴转换
准备的数据
# -*- coding: utf-8 -*- import pandas as pd import numpy as np data = { 'user' : ['zszxz','craler','rose'], 'price' : [100, 200, 300], 'hobby' : ['reading','running','hiking'] } index = ['user1','user2','user3'] frame = pd.DataFrame(data,index) print(frame)
输出
user price hobby user1 zszxz 100 reading user2 craler 200 running user3 rose 300 hiking stack() 将 列转为行; # -*- coding: utf-8 -*- import pandas as pd import numpy as np data = { 'user' : ['zszxz','craler','rose'], 'price' : [100, 200, 300], 'hobby' : ['reading','running','hiking'] } index = ['user1','user2','user3'] frame = pd.DataFrame(data,index) print(frame.stack())
输出
user1 user zszxz
price 100
hobby reading
user2 user craler
price 200
hobby running
user3 user rose
price 300
hobby hiking
dtype: object
使用 unstack()将 数据结构重新返回
# -*- coding: utf-8 -*- import pandas as pd import numpy as np data = { 'user' : ['zszxz','craler','rose'], 'price' : [100, 200, 300], 'hobby' : ['reading','running','hiking'] } index = ['user1','user2','user3'] frame = pd.DataFrame(data,index) sta = frame.stack() print(sta.unstack())
输出
user price hobby
user1 zszxz 100 reading
user2 craler 200 running
user3 rose 300 hiking