所有 Notebook 的源文件都可以在 GitHub 上访问到呀~

Advanced Python Pandas - 2

  • Python 程序员经常说,这个语音可以通过很多种方式去解决一个特定的问题
  • 但是,其中有一些方法是比其它的合适的
  • 最好的解决方法被称为idiomatic Python(大概就是符合Python设计理念的代码)
  • 一个idiomatic解决方法通常是拥有高性能和高可读性的,这并不一定正确,作为Python的一个库Pandas拥有自己的一套习俗
  • 比如尽可能使用向量化,如果不需要就不使用迭代循环
  • 在Pandas的社区,有一些开发者和用户将这些习俗称为pandorable

Idiomatic Pandas : Making Code Pandorable

让你的代码pandorable的几个关键特征?

方法链method chaining

  • 在之前查询DataFrame的时候,可以一起调用pandas,比如说想要根据某些index来选择行,如何只选择其中的某些列。如:df.loc["Washtenaw"]["Total Population"] 就是一个链式索引
  • 一般来说,这是一种不好的方法,因为根据NumPy库,Pandas可能返回的是df的副本或视图 ### Code smell
  • Tom Osberger:“如果你看到背靠背的方括号,那么应该仔细考虑是否有做链式索引”
  • 但是方法链与之不同,其背后的思想是,对象上的每个方法都会返回一个对该对象的引用,这样的好处是可以把对DataFrame的许多不同操作浓缩到一行或至少一条代码语句中
In [3]:
import pandas as pd
df = pd.read_csv("census.csv")
df.head()
Out[3]:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 ... RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013 RDOMESTICMIG2014 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015
0 40 3 6 1 0 Alabama Alabama 4779736 4780127 4785161 ... 0.002295 -0.193196 0.381066 0.582002 -0.467369 1.030015 0.826644 1.383282 1.724718 0.712594
1 50 3 6 1 1 Alabama Autauga County 54571 54571 54660 ... 7.242091 -2.915927 -3.012349 2.265971 -2.530799 7.606016 -2.626146 -2.722002 2.592270 -2.187333
2 50 3 6 1 3 Alabama Baldwin County 182265 182265 183193 ... 14.832960 17.647293 21.845705 19.243287 17.197872 15.844176 18.559627 22.727626 20.317142 18.293499
3 50 3 6 1 5 Alabama Barbour County 27457 27457 27341 ... -4.728132 -2.500690 -7.056824 -3.904217 -10.543299 -4.874741 -2.758113 -7.167664 -3.978583 -10.543299
4 50 3 6 1 7 Alabama Bibb County 22915 22919 22861 ... -5.527043 -5.068871 -6.201001 -0.177537 0.177258 -5.088389 -4.363636 -5.403729 0.754533 1.107861

5 rows × 100 columns

  • 方法链的pandorable 下面的代码先运行where查询,再运行dropna,再运行set_index,再运行rename
In [4]:
(df.where(df["SUMLEV"] == 50)
 .dropna()
.set_index(["STNAME", "CTYNAME"])
.rename(columns = {"ESTIMATESBASE2010" : "Estimates Base 2010"}))
Out[4]:
SUMLEV REGION DIVISION STATE COUNTY CENSUS2010POP Estimates Base 2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 ... RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013 RDOMESTICMIG2014 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015
STNAME CTYNAME
Alabama Autauga County 50.0 3.0 6.0 1.0 1.0 54571.0 54571.0 54660.0 55253.0 55175.0 ... 7.242091 -2.915927 -3.012349 2.265971 -2.530799 7.606016 -2.626146 -2.722002 2.592270 -2.187333
Baldwin County 50.0 3.0 6.0 1.0 3.0 182265.0 182265.0 183193.0 186659.0 190396.0 ... 14.832960 17.647293 21.845705 19.243287 17.197872 15.844176 18.559627 22.727626 20.317142 18.293499
Barbour County 50.0 3.0 6.0 1.0 5.0 27457.0 27457.0 27341.0 27226.0 27159.0 ... -4.728132 -2.500690 -7.056824 -3.904217 -10.543299 -4.874741 -2.758113 -7.167664 -3.978583 -10.543299
Bibb County 50.0 3.0 6.0 1.0 7.0 22915.0 22919.0 22861.0 22733.0 22642.0 ... -5.527043 -5.068871 -6.201001 -0.177537 0.177258 -5.088389 -4.363636 -5.403729 0.754533 1.107861
Blount County 50.0 3.0 6.0 1.0 9.0 57322.0 57322.0 57373.0 57711.0 57776.0 ... 1.807375 -1.177622 -1.748766 -2.062535 -1.369970 1.859511 -0.848580 -1.402476 -1.577232 -0.884411
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Wyoming Sweetwater County 50.0 4.0 8.0 56.0 37.0 43806.0 43806.0 43593.0 44041.0 45104.0 ... 1.072643 16.243199 -5.339774 -14.252889 -14.248864 1.255221 16.243199 -5.295460 -14.075283 -14.070195
Teton County 50.0 4.0 8.0 56.0 39.0 21294.0 21294.0 21297.0 21482.0 21697.0 ... -1.589565 0.972695 19.525929 14.143021 -0.564849 0.654527 2.408578 21.160658 16.308671 1.520747
Uinta County 50.0 4.0 8.0 56.0 41.0 21118.0 21118.0 21102.0 20912.0 20989.0 ... -17.755986 -4.916350 -6.902954 -14.215862 -12.127022 -18.136812 -5.536861 -7.521840 -14.740608 -12.606351
Washakie County 50.0 4.0 8.0 56.0 43.0 8533.0 8533.0 8545.0 8469.0 8443.0 ... -11.637475 -0.827815 -2.013502 -17.781491 1.682288 -11.990126 -1.182592 -2.250385 -18.020168 1.441961
Weston County 50.0 4.0 8.0 56.0 45.0 7208.0 7208.0 7181.0 7114.0 7065.0 ... -11.752361 -8.040059 12.372583 1.533635 6.935294 -12.032179 -8.040059 12.372583 1.533635 6.935294

3142 rows × 98 columns

  • 传统的编写方法 在功能意义上并没有错,只是不像前面例子那样可圈可点 但是在运行两种方法时会发现,这个传统的方法更快,所以这是一个时间与可读性背道而驰的例子
In [5]:
df = df[df["SUMLEV"] == 50]
df.set_index(["STNAME", "CTYNAME"], inplace = True)
df.rename(columns = {"ESTIMATEBASE2010" : "Estimate Base 2010"})
Out[5]:
SUMLEV REGION DIVISION STATE COUNTY CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 ... RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013 RDOMESTICMIG2014 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015
STNAME CTYNAME
Alabama Autauga County 50 3 6 1 1 54571 54571 54660 55253 55175 ... 7.242091 -2.915927 -3.012349 2.265971 -2.530799 7.606016 -2.626146 -2.722002 2.592270 -2.187333
Baldwin County 50 3 6 1 3 182265 182265 183193 186659 190396 ... 14.832960 17.647293 21.845705 19.243287 17.197872 15.844176 18.559627 22.727626 20.317142 18.293499
Barbour County 50 3 6 1 5 27457 27457 27341 27226 27159 ... -4.728132 -2.500690 -7.056824 -3.904217 -10.543299 -4.874741 -2.758113 -7.167664 -3.978583 -10.543299
Bibb County 50 3 6 1 7 22915 22919 22861 22733 22642 ... -5.527043 -5.068871 -6.201001 -0.177537 0.177258 -5.088389 -4.363636 -5.403729 0.754533 1.107861
Blount County 50 3 6 1 9 57322 57322 57373 57711 57776 ... 1.807375 -1.177622 -1.748766 -2.062535 -1.369970 1.859511 -0.848580 -1.402476 -1.577232 -0.884411
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Wyoming Sweetwater County 50 4 8 56 37 43806 43806 43593 44041 45104 ... 1.072643 16.243199 -5.339774 -14.252889 -14.248864 1.255221 16.243199 -5.295460 -14.075283 -14.070195
Teton County 50 4 8 56 39 21294 21294 21297 21482 21697 ... -1.589565 0.972695 19.525929 14.143021 -0.564849 0.654527 2.408578 21.160658 16.308671 1.520747
Uinta County 50 4 8 56 41 21118 21118 21102 20912 20989 ... -17.755986 -4.916350 -6.902954 -14.215862 -12.127022 -18.136812 -5.536861 -7.521840 -14.740608 -12.606351
Washakie County 50 4 8 56 43 8533 8533 8545 8469 8443 ... -11.637475 -0.827815 -2.013502 -17.781491 1.682288 -11.990126 -1.182592 -2.250385 -18.020168 1.441961
Weston County 50 4 8 56 45 7208 7208 7181 7114 7065 ... -11.752361 -8.040059 12.372583 1.533635 6.935294 -12.032179 -8.040059 12.372583 1.533635 6.935294

3142 rows × 98 columns

Test

Suppose we are working on a DataFrame that holds information on our equipment for an upcoming backpacking trip. Can you use method chaining to modify the DataFrame df in one statement to drop any entries where 'Quantity' is 0 and rename the column 'Weight' to 'Weight (oz.)'?

print(df.drop(df[df['Quantity'] == 0].index).rename(columns={'Weight': 'Weight (oz.)'}))

In [6]:
import numpy as np
def min_max(row):
    data = row[["POPESTIMATE2010",
                "POPESTIMATE2011",
                "POPESTIMATE2012",
                "POPESTIMATE2013",
                "POPESTIMATE2014",
                "POPESTIMATE2015"]]
    return pd.Series({"min" : np.min(data), "max" : np.max(data)})
In [7]:
#在这里axis=1表示按照行
df.apply(min_max, axis = 1)
Out[7]:
min max
STNAME CTYNAME
Alabama Autauga County 54660.0 55347.0
Baldwin County 183193.0 203709.0
Barbour County 26489.0 27341.0
Bibb County 22512.0 22861.0
Blount County 57373.0 57776.0
... ... ... ...
Wyoming Sweetwater County 43593.0 45162.0
Teton County 21297.0 23125.0
Uinta County 20822.0 21102.0
Washakie County 8316.0 8545.0
Weston County 7065.0 7234.0

3142 rows × 2 columns

In [9]:
def min_max(row):
    data = row[["POPESTIMATE2010",
               "POPESTIMATE2011",
               "POPESTIMATE2012",
               "POPESTIMATE2013",
               "POPESTIMATE2014",
               "POPESTIMATE2015"]]
    row["max"] = np.max(data)
    row["min"] = np.min(data)
    return row
df.apply(min_max, axis = 1)
Out[9]:
SUMLEV REGION DIVISION STATE COUNTY CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 ... RDOMESTICMIG2013 RDOMESTICMIG2014 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015 max min
STNAME CTYNAME
Alabama Autauga County 50.0 3.0 6.0 1.0 1.0 54571.0 54571.0 54660.0 55253.0 55175.0 ... -3.012349 2.265971 -2.530799 7.606016 -2.626146 -2.722002 2.592270 -2.187333 55347.0 54660.0
Baldwin County 50.0 3.0 6.0 1.0 3.0 182265.0 182265.0 183193.0 186659.0 190396.0 ... 21.845705 19.243287 17.197872 15.844176 18.559627 22.727626 20.317142 18.293499 203709.0 183193.0
Barbour County 50.0 3.0 6.0 1.0 5.0 27457.0 27457.0 27341.0 27226.0 27159.0 ... -7.056824 -3.904217 -10.543299 -4.874741 -2.758113 -7.167664 -3.978583 -10.543299 27341.0 26489.0
Bibb County 50.0 3.0 6.0 1.0 7.0 22915.0 22919.0 22861.0 22733.0 22642.0 ... -6.201001 -0.177537 0.177258 -5.088389 -4.363636 -5.403729 0.754533 1.107861 22861.0 22512.0
Blount County 50.0 3.0 6.0 1.0 9.0 57322.0 57322.0 57373.0 57711.0 57776.0 ... -1.748766 -2.062535 -1.369970 1.859511 -0.848580 -1.402476 -1.577232 -0.884411 57776.0 57373.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Wyoming Sweetwater County 50.0 4.0 8.0 56.0 37.0 43806.0 43806.0 43593.0 44041.0 45104.0 ... -5.339774 -14.252889 -14.248864 1.255221 16.243199 -5.295460 -14.075283 -14.070195 45162.0 43593.0
Teton County 50.0 4.0 8.0 56.0 39.0 21294.0 21294.0 21297.0 21482.0 21697.0 ... 19.525929 14.143021 -0.564849 0.654527 2.408578 21.160658 16.308671 1.520747 23125.0 21297.0
Uinta County 50.0 4.0 8.0 56.0 41.0 21118.0 21118.0 21102.0 20912.0 20989.0 ... -6.902954 -14.215862 -12.127022 -18.136812 -5.536861 -7.521840 -14.740608 -12.606351 21102.0 20822.0
Washakie County 50.0 4.0 8.0 56.0 43.0 8533.0 8533.0 8545.0 8469.0 8443.0 ... -2.013502 -17.781491 1.682288 -11.990126 -1.182592 -2.250385 -18.020168 1.441961 8545.0 8316.0
Weston County 50.0 4.0 8.0 56.0 45.0 7208.0 7208.0 7181.0 7114.0 7065.0 ... 12.372583 1.533635 6.935294 -12.032179 -8.040059 12.372583 1.533635 6.935294 7234.0 7065.0

3142 rows × 100 columns

In [10]:
rows = ["POPESTIMATE2010",
        "POPESTIMATE2011",
        "POPESTIMATE2012",
        "POPESTIMATE2013",
        "POPESTIMATE2014",
        "POPESTIMATE2015"
       ]
df.apply(lambda x : np.max(x[rows]), axis =1)
Out[10]:
STNAME   CTYNAME          
Alabama  Autauga County        55347.0
         Baldwin County       203709.0
         Barbour County        27341.0
         Bibb County           22861.0
         Blount County         57776.0
                                ...   
Wyoming  Sweetwater County     45162.0
         Teton County          23125.0
         Uinta County          21102.0
         Washakie County        8545.0
         Weston County          7234.0
Length: 3142, dtype: float64
In [ ]: