所有 Notebook 的源文件都可以在 GitHub 上访问到呀~

Pivot Tables 透视表

  • pivot_table 是一种为了特定目标而对dataframe中的数据进行汇总的一种方式
  • 频繁使用聚合功能aggregation
  • pivot_table本身是一个dataframe,行代表你感兴趣的一个变量,列代表另一个变量,以及单元格的一些aggregate值
  • pivot_table还会包括一些margina value,也就是每一行、列的总和,这对于观察两个变量之间的关系非常方便
In [6]:
import pandas as pd
import numpy as np
In [7]:
#数据来源于政府公开数据集,其中包含可以购买的不同电动车的效能
df = pd.read_csv("cars.csv")
In [8]:
#通过head可以看到,有model\year\size\statistics
df.head()
Out[8]:
YEAR Make Model Size (kW) Unnamed: 5 TYPE CITY (kWh/100 km) HWY (kWh/100 km) COMB (kWh/100 km) CITY (Le/100 km) HWY (Le/100 km) COMB (Le/100 km) (g/km) RATING (km) TIME (h)
0 2012 MITSUBISHI i-MiEV SUBCOMPACT 49 A1 B 16.9 21.4 18.7 1.9 2.4 2.1 0 NaN 100 7
1 2012 NISSAN LEAF MID-SIZE 80 A1 B 19.3 23.0 21.1 2.2 2.6 2.4 0 NaN 117 7
2 2013 FORD FOCUS ELECTRIC COMPACT 107 A1 B 19.0 21.1 20.0 2.1 2.4 2.2 0 NaN 122 4
3 2013 MITSUBISHI i-MiEV SUBCOMPACT 49 A1 B 16.9 21.4 18.7 1.9 2.4 2.1 0 NaN 100 7
4 2013 NISSAN LEAF MID-SIZE 80 A1 B 19.3 23.0 21.1 2.2 2.6 2.4 0 NaN 117 7
In [9]:
#pivot table可以将其中的一列透视到一个新的column header中,并和其它列对比
#目的:比较电动车的品牌、年份、电池容量
#传入values为(kW),index为YEAR,columns为Make,指定aggregation函数为np.mean
df.pivot_table(values = "(kW)", index = "YEAR", columns = "Make", aggfunc = np.mean)
Out[9]:
Make BMW CHEVROLET FORD KIA MITSUBISHI NISSAN SMART TESLA
YEAR
2012 NaN NaN NaN NaN 49.0 80.0 NaN NaN
2013 NaN NaN 107.0 NaN 49.0 80.0 35.0 280.000000
2014 NaN 104.0 107.0 NaN 49.0 80.0 35.0 268.333333
2015 125.0 104.0 107.0 81.0 49.0 80.0 35.0 320.666667
2016 125.0 104.0 107.0 81.0 49.0 80.0 35.0 409.700000
  • 从结果中可以看到,其中有一些NaN,这是因为有一些品牌如FORD没有2012的条目
  • 还可以看出,大部分品牌随着时间推移,电池容量并没有改变,除了TESLA

Test

Suppose we have a DataFrame with price and ratings for different bikes, broken down by manufacturer and type of bicycle. Creat a pivot table that shows the mean price and mean rating for every "Manufacturer" / "Bike Type" combination

answer

print(pd.pivot_table(Bikes, index=["Manufacturer","Bike Type"]))

mysolution

Bikes.pivot_table(values = ["Price","Rating"],index=["Manufacturer","Bike Type"],aggfunc={"Price" : np.mean, "Bike Type":np.mean})
In [5]:
#aggfunc不仅仅可以是一个,还可以传递多个,,使用list/dict
df.pivot_table(values = "(kW)", index = "YEAR", columns = "Make", aggfunc = [np.mean, np.min], margins = True)
Out[5]:
mean amin
Make BMW CHEVROLET FORD KIA MITSUBISHI NISSAN SMART TESLA All BMW CHEVROLET FORD KIA MITSUBISHI NISSAN SMART TESLA All
YEAR
2012 NaN NaN NaN NaN 49.0 80.0 NaN NaN 64.500000 NaN NaN NaN NaN 49.0 80.0 NaN NaN 49
2013 NaN NaN 107.0 NaN 49.0 80.0 35.0 280.000000 158.444444 NaN NaN 107.0 NaN 49.0 80.0 35.0 270.0 35
2014 NaN 104.0 107.0 NaN 49.0 80.0 35.0 268.333333 135.000000 NaN 104.0 107.0 NaN 49.0 80.0 35.0 225.0 35
2015 125.0 104.0 107.0 81.0 49.0 80.0 35.0 320.666667 181.428571 125.0 104.0 107.0 81.0 49.0 80.0 35.0 280.0 35
2016 125.0 104.0 107.0 81.0 49.0 80.0 35.0 409.700000 252.263158 125.0 104.0 107.0 81.0 49.0 80.0 35.0 283.0 35
All 125.0 104.0 107.0 81.0 49.0 80.0 35.0 345.478261 190.622642 125.0 104.0 107.0 81.0 49.0 80.0 35.0 225.0 35
In [ ]:

In [ ]: