所有 Notebook 的源文件都可以在 GitHub 上访问到呀~

Advanced Python Pandas - 1

回顾

  • Pandas 的两种不同的数据结构: Series(一维) 以及 DataFrame(二维)
  • 查询这两种数据结构是通过不同的方法
    • 可以使用 iloc 或 loc 基于行查询
    • 可以使用 [] 基于列查询
    • 更重要的是可以通过布尔过滤boolean masking进行查询 boolean masking 是一种强大的过滤方法,允许我们使用广播去决定哪些数据应该被保留在分析中

本系列内容

  • 更详细地介绍如何操作 DataFrame
  • 如何使用GroupByApply减少和处理数据
  • 如何将多文件中的数据集连接起来形成一个数据集
  • Pandas 对于传统统计分析和机器学习有用的一些特征

合并数据集 Merging Dataframes

  • 如何为已经存在的 DataFrame 添加数据?
  • 对一个新的列名使用 [] 添加新的值,只要共享一个index,就会添加数据
  • 如果没有共享索引,而传入的是一个标量值(单一的值:整数/字符串),那么新增加的值列以标量值作为默认值(每一行的新增数据都是一样的)
  • 如何给每一行分配一个不同值?
In [4]:
import pandas as pd
In [5]:
#df 的index是一个商场列表,columns是购买数据
df = pd.DataFrame([{"Name" : "Chris","Item Purchased" : "Sponge", "Cost" : 22.50},
                  {"Name" : "Kevyn","Item Purchased" : "Kitty Litter", "Cost" : 2.50},
                  {"Name" : "Filip","Item Purchased" : "Spoon", "Cost" : 5.00}],
                 index = ["Store1", "Store1", "Store2"])
df
Out[5]:
Name Item Purchased Cost
Store1 Chris Sponge 22.5
Store1 Kevyn Kitty Litter 2.5
Store2 Filip Spoon 5.0
In [3]:
#添加一个新的列,名为Date,只需要使用 [] 操作符就可以了
#只要这个list和其他记录长度一样就可以
df["Date"] = ["December 1", "January 1", "mid-May"]
df
Out[3]:
Name Item Purchased Cost Date
Store1 Chris Sponge 22.5 December 1
Store1 Kevyn Kitty Litter 2.5 January 1
Store2 Filip Spoon 5.0 mid-May
In [6]:
#如果想添加新的字段,可能是传递True/False,也很容易,因为这是一个标量值
df["Delivered"] = True
df
Out[6]:
Name Item Purchased Cost Delivered
Store1 Chris Sponge 22.5 True
Store1 Kevyn Kitty Litter 2.5 True
Store2 Filip Spoon 5.0 True
In [7]:
#如果只想添加一部分字段就会出现问题,
#为了达到这个目的,需要为pandas提供足够长的DataFrame列表,以便使得每一行都被填充
#这也就意味着需要自己输入None值
df["Feedback"] = ["Positive", None, "Negative"]
df
Out[7]:
Name Item Purchased Cost Delivered Feedback
Store1 Chris Sponge 22.5 True Positive
Store1 Kevyn Kitty Litter 2.5 True None
Store2 Filip Spoon 5.0 True Negative
In [9]:
#如果每一行都有一个唯一的index,那么就可以只是将列标识符分配给该Series
#可以重新设置索引,从0到2,接着就可以用这些标签创建一个新的Series
#这种方法的好处就是,可以忽略那些不知道的项,pandas会将其设为NaN
adf = df.reset_index()
adf["Date"] = pd.Series({0 : "December 1", 2 : "mid-May"})
adf
Out[9]:
index Name Item Purchased Cost Delivered Feedback Date
0 Store1 Chris Sponge 22.5 True Positive December 1
1 Store1 Kevyn Kitty Litter 2.5 True None NaN
2 Store2 Filip Spoon 5.0 True Negative mid-May
In [10]:
#将两个DataFrame连接起来,这两个DataFrame有重叠的部分
#设置两个DataFrame以Name为index
staff_df = pd.DataFrame([{"Name" : "Kelly", "Role" : "Director of HR"},
                        {"Name" : "Sally", "Role" : "Course liasion"},
                        {"Name" : "James", "Role" : "Grader"}])
staff_df = staff_df.set_index("Name")
student_df = pd.DataFrame([{"Name" : "James", "School" : "Business"},
                          {"Name" : "Mike", "School" : "Law"},
                          {"Name" : "Sally", "School" : "Engineering"}])
student_df = student_df.set_index("Name")
print("staff_df is \n",staff_df.head())
print("student_df is \n", student_df.head())
staff_df is 
                  Role
Name                 
Kelly  Director of HR
Sally  Course liasion
James          Grader
student_df is 
             School
Name              
James     Business
Mike           Law
Sally  Engineering
In [11]:
pd.merge(staff_df, student_df, how = "outer", left_index = True, right_index = True)
Out[11]:
Role School
Name
James Grader Business
Kelly Director of HR NaN
Mike NaN Law
Sally Course liasion Engineering
In [ ]:

In [12]:
pd.merge(staff_df, student_df, how = "inner", left_index = True, right_index = True)
Out[12]:
Role School
Name
Sally Course liasion Engineering
James Grader Business
In [13]:
pd.merge(staff_df, student_df, how = "left", left_index = True, right_index = True)
Out[13]:
Role School
Name
Kelly Director of HR NaN
Sally Course liasion Engineering
James Grader Business
In [14]:
pd.merge(staff_df, student_df, how = "right", left_index = True, right_index = True)
Out[14]:
Role School
Name
James Grader Business
Mike NaN Law
Sally Course liasion Engineering
  • 多index、多column的情况
  • 有可能学生的名是相同的,但是性不同,这种情况下,使用多个index/column的list作为left_on/right_on的key
In [15]:
staff_df = pd.DataFrame([{"Name" : "Kelly", "Role" : "Director of HR", "Location" : "State Street"},
                        {"Name" : "Sally", "Role" : "Course liasion", "Location" : "Washington Avenue"},
                        {"Name" : "James", "Role" : "Grader", "Location" : "Washington Avenue"}])
student_df = pd.DataFrame([{"Name" : "James", "School" : "Business", "Location" : "1024 Billiard Avenue"},
                          {"Name" : "Mike", "School" : "Law", "Location" : "Fraternity House #22"},
                          {"Name" : "Sally", "School" : "Engineering", "Location" : "512 Wilson Crescent"}])
pd.merge(staff_df, student_df, how = "left", left_on = "Name", right_on = "Name")
Out[15]:
Name Role Location_x School Location_y
0 Kelly Director of HR State Street NaN NaN
1 Sally Course liasion Washington Avenue Engineering 512 Wilson Crescent
2 James Grader Washington Avenue Business 1024 Billiard Avenue
In [19]:
staff_df = pd.DataFrame([{"First Name" : "Kelly", "Last Name" : "Desjardins", "Role" : "Director of HR"},
                        {"First Name" : "Sally", "Last Name" : "Brooks", "Role" : "Course liasion"},
                        {"First Name" : "James", "Last Name" : "Wilde", "Role" : "Grader"}])
student_df = pd.DataFrame([{"First Name" : "James", "Last Name" : "Hammond", "School" : "Business"},
                          {"First Name" : "Mike", "Last Name" : "Smith",  "School" : "Law"},
                          {"First Name" : "Sally", "Last Name" : "Brooks",  "School" : "Engineering"}])
staff_df
student_df
pd.merge(staff_df, student_df, how = "inner", left_on = ["First Name", "Last Name"], right_on = ["First Name", "Last Name"])
Out[19]:
First Name Last Name Role School
0 Sally Brooks Course liasion Engineering

Test1

Here are two DataFrames, products and invoices. The product DataFrame has an identifier and a sticker price. The invoices DataFrame lists the people, product identifiers, and quantity. Assuming that we want to generate totals, how do we join these two DataFrames together so that we have one which lists all of the information we need? products DataFrame:

In [26]:
products = pd.DataFrame([{"ProductID" : 4109, "Price" : 5.0,"Product" : "Sushi Roll"},
                       {"ProductID" : 1412, "Price" : 0.5,"Product" : "Egg"},
                       {"ProductID" : 8931, "Price" : 1.5,"Product" : "Bagel"}])
products = products.set_index("ProductID")
products
Out[26]:
Price Product
ProductID
4109 5.0 Sushi Roll
1412 0.5 Egg
8931 1.5 Bagel
In [32]:
invoices = pd.DataFrame([{"Customer" : "Ali", "ProductID" : 4109, "Quantity" : 1},
                       {"Customer" : "Eric", "ProductID" : 1412, "Quantity" : 12},
                       {"Customer" : "Ande", "ProductID" : 8931, "Quantity" : 6},
                       {"Customer" : "Sam", "ProductID" : 4109, "Quantity" : 2}])
invoices
Out[32]:
Customer ProductID Quantity
0 Ali 4109 1
1 Eric 1412 12
2 Ande 8931 6
3 Sam 4109 2
In [33]:
answer = pd.merge(products, invoices,left_index = True,right_on = "ProductID")
answer
Out[33]:
Price Product Customer ProductID Quantity
0 5.0 Sushi Roll Ali 4109 1
3 5.0 Sushi Roll Sam 4109 2
1 0.5 Egg Eric 1412 12
2 1.5 Bagel Ande 8931 6
In [ ]: