所有 Notebook 的源文件都可以在 GitHub 上访问到呀~

Assignment 3

Question 1

step1

  • 下载 Energy+Indicators.xls,该文件是联合国 2013 年能源供应和可再生电力生产的指标清单
  • 应将其放入变量名为 energy 的 DataFrame 中
    • 注意,这是一个 Excel 文件,而不是 CSV 文件
    • 请确保将数据文件中的页眉、页脚信息排除在外
    • 前两列是不需要的,所以应该去掉
    • 应该改变列标签,使列是 ["Country", "Energy Supply", "Energy Supply per Capita", "% Renewable"]
  • 将 "Energy Supply" 转换为千兆焦耳gigajoules(1,000,000 gigajoules = 1 petajoule)
  • 对所有数据缺失的地区(例如:带有...字样的数据),确保其反映为 np.NaN 值
  • 给下面清单中的地区重命名(供之后的问题使用):

    "Republic of Korea": "South Korea",

    "United States of America": "United States",

    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",

    "China, Hong Kong Special Administrative Region": "Hong Kong"

  • 还有一些地区的名称中带有数字/括号,请删除这些数字/括号

    • 'Bolivia (Plurinational State of)'应改为'Bolivia'

    • 'Switzerland17'应改为'Switzerland'

step2

  • 下载 world_bank.csv ,从中获取 GDP 数据,该文件是一个 CSV 文件,包含了从世界银行获取的 1960-2015 年各个地区的 GDP
  • 将其放入变量名为 GDP 的 DataFrame 中
  • 务必跳过标题
  • 重命名以下地区

    "Korea, Rep.": "South Korea",

    "Iran, Islamic Rep.": "Iran",

    "Hong Kong SAR, China": "Hong Kong"

step3

  • 下载 scimagojr-3.xlsx,从中获取能源工程与动力技术的 Scimago 期刊和地区排名数据,该数据根据各国在上述领域的期刊贡献进行排名
  • 将其放入变量名为 ScimEn 的 DataFrame 中
  • 将 GDP、Energy、ScimEn组成一个新的数据集,使用地区名称作为交叉点
  • 只使用过去 10 年(2006-2015 年)的 GDP 数据,并且只使用 Scimagojr 排名前 15 名的地区
  • 这个 DataFrame 的索引应该是地区的名称,列应该是 ['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']

def answer_one()

这个函数应该返回一个有 20 列和 15 个条目的 DataFrame

In [29]:
import pandas as pd
import numpy as np
In [30]:
def answer_one():
    #读取energy
    #去掉不需要的信息
    energy = pd.read_excel("Energy+Indicators.xls")
    energy = energy.iloc[17:244,2:]
    #改变colunms label
    energy.columns = ["Country", "Energy Supply", "Energy Supply per Capita", "% Renewable"]
    #转换单位
    energy["Energy Supply"] *= 1000000
    #缺失值处理
    #df.where(cond, other=nan,inplace=False,axis=None,level=None,errors="raise",try_cast=False)
    #df.where 将条件为False的部分替换
    #cond参数,当cond为True,保持原有值,当cond为False,替换为other值
    #other参数,当cond为False时替换为other
    energy.where(energy != "...", np.NaN)
    #删除数字、括号
    energy.loc[:,"Country"].replace(r" \(.*\)", "", regex = True, inplace = True)
    energy.loc[:,"Country"].replace(r"\d", "", regex = True, inplace = True)
    #重命名
    energy.loc[:,"Country"].replace({
    "Republic of Korea": "South Korea",
    "United States of America": "United States",
    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
    "China, Hong Kong Special Administrative Region": "Hong Kong"}
    ,inplace = True)
    #读取GDP
    GDP = pd.read_csv("world_bank.csv", skiprows = 4)
    #重命名
    GDP.loc[:, "Country Name"].replace({
        "Korea, Rep.": "South Korea",
        "Iran, Islamic Rep.": "Iran",
        "Hong Kong SAR, China": "Hong Kong"
        }, inplace = True)

    #选取部分column并修改Country Name为Country
    GDP = GDP[["Country Name", '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]
    GDP.columns = ["Country", '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']    
    #读取ScimEn
    ScimEn = pd.read_excel("scimagojr-3.xlsx")
    #将Rank前15的存入ScimEn1
    ScimEn1 = ScimEn[:15]
    #合并ScimEn1、energy、GDP
    df0 = pd.merge(ScimEn1, energy, on = "Country")
    df = pd.merge(df0, GDP, on = "Country")
    df = df.set_index("Country")
    return df
answer_one()
Out[30]:
Rank Documents Citable documents Citations Self-citations Citations per document H index Energy Supply Energy Supply per Capita % Renewable 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Country
China 1 127050 126767 597237 411683 4.70 138 127191000000 93 19.7549 3.992331e+12 4.559041e+12 4.997775e+12 5.459247e+12 6.039659e+12 6.612490e+12 7.124978e+12 7.672448e+12 8.230121e+12 8.797999e+12
United States 2 96661 94747 792274 265436 8.20 230 90838000000 286 11.571 1.479230e+13 1.505540e+13 1.501149e+13 1.459484e+13 1.496437e+13 1.520402e+13 1.554216e+13 1.577367e+13 1.615662e+13 1.654857e+13
Japan 3 30504 30287 223024 61554 7.31 134 18984000000 149 10.2328 5.496542e+12 5.617036e+12 5.558527e+12 5.251308e+12 5.498718e+12 5.473738e+12 5.569102e+12 5.644659e+12 5.642884e+12 5.669563e+12
United Kingdom 4 20944 20357 206091 37874 9.84 139 7920000000 124 10.6005 2.419631e+12 2.482203e+12 2.470614e+12 2.367048e+12 2.403504e+12 2.450911e+12 2.479809e+12 2.533370e+12 2.605643e+12 2.666333e+12
Russian Federation 5 18534 18301 34266 12422 1.85 57 30709000000 214 17.2887 1.385793e+12 1.504071e+12 1.583004e+12 1.459199e+12 1.524917e+12 1.589943e+12 1.645876e+12 1.666934e+12 1.678709e+12 1.616149e+12
Canada 6 17899 17620 215003 40930 12.01 149 10431000000 296 61.9454 1.564469e+12 1.596740e+12 1.612713e+12 1.565145e+12 1.613406e+12 1.664087e+12 1.693133e+12 1.730688e+12 1.773486e+12 1.792609e+12
Germany 7 17027 16831 140566 27426 8.26 126 13261000000 165 17.9015 3.332891e+12 3.441561e+12 3.478809e+12 3.283340e+12 3.417298e+12 3.542371e+12 3.556724e+12 3.567317e+12 3.624386e+12 3.685556e+12
India 8 15005 14841 128763 37209 8.58 115 33195000000 26 14.9691 1.265894e+12 1.374865e+12 1.428361e+12 1.549483e+12 1.708459e+12 1.821872e+12 1.924235e+12 2.051982e+12 2.200617e+12 2.367206e+12
France 9 13153 12973 130632 28601 9.93 114 10597000000 166 17.0203 2.607840e+12 2.669424e+12 2.674637e+12 2.595967e+12 2.646995e+12 2.702032e+12 2.706968e+12 2.722567e+12 2.729632e+12 2.761185e+12
South Korea 10 11983 11923 114675 22595 9.57 104 11007000000 221 2.27935 9.410199e+11 9.924316e+11 1.020510e+12 1.027730e+12 1.094499e+12 1.134796e+12 1.160809e+12 1.194429e+12 1.234340e+12 1.266580e+12
Italy 11 10964 10794 111850 26661 10.20 106 6530000000 109 33.6672 2.202170e+12 2.234627e+12 2.211154e+12 2.089938e+12 2.125185e+12 2.137439e+12 2.077184e+12 2.040871e+12 2.033868e+12 2.049316e+12
Spain 12 9428 9330 123336 23964 13.08 115 4923000000 106 37.9686 1.414823e+12 1.468146e+12 1.484530e+12 1.431475e+12 1.431673e+12 1.417355e+12 1.380216e+12 1.357139e+12 1.375605e+12 1.419821e+12
Iran 13 8896 8819 57470 19125 6.46 72 9172000000 119 5.70772 3.895523e+11 4.250646e+11 4.289909e+11 4.389208e+11 4.677902e+11 4.853309e+11 4.532569e+11 4.445926e+11 4.639027e+11 NaN
Australia 14 8831 8725 90765 15606 10.28 107 5386000000 231 11.8108 1.021939e+12 1.060340e+12 1.099644e+12 1.119654e+12 1.142251e+12 1.169431e+12 1.211913e+12 1.241484e+12 1.272520e+12 1.301251e+12
Brazil 15 8668 8596 60702 14396 7.00 86 12149000000 59 69.648 1.845080e+12 1.957118e+12 2.056809e+12 2.054215e+12 2.208872e+12 2.295245e+12 2.339209e+12 2.409740e+12 2.412231e+12 2.319423e+12

Question2

Question1 将三个数据集连接在一起,并减少至15条数据,当连接这三个数据集但还没有减少至15条数据的时候,我们失去了多少数据?

def answer_two()

这个函数返回一个数值

In [31]:
def answer_two():
    energy = pd.read_excel("Energy+Indicators.xls")
    energy = energy.iloc[17:244,2:]
    energy.columns = ["Country", "Energy Supply", "Energy Supply per Capita", "% Renewable"]
    energy["Energy Supply"] *= 1000000
    energy.where(energy != "...", np.NaN)
    energy.loc[:,"Country"].replace(r" \(.*\)", "", regex = True, inplace = True)
    energy.loc[:,"Country"].replace(r"\d", "", regex = True, inplace = True)
    energy.loc[:,"Country"].replace({
    "Republic of Korea": "South Korea",
    "United States of America": "United States",
    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
    "China, Hong Kong Special Administrative Region": "Hong Kong"}
    ,inplace = True)
    
    GDP = pd.read_csv("world_bank.csv", skiprows = 4)
    GDP.loc[:, "Country Name"].replace({
        "Korea, Rep.": "South Korea",
        "Iran, Islamic Rep.": "Iran",
        "Hong Kong SAR, China": "Hong Kong"
        }, inplace = True)
    GDP = GDP[["Country Name", '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]
    GDP.columns = ["Country", '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']    
   
    ScimEn = pd.read_excel("scimagojr-3.xlsx")
    #失去的数据即为 outer 方法merge,与inner方法merge的差值
    df1 = pd.merge(ScimEn, energy, on = "Country")
    df = pd.merge(df1, GDP, on = "Country")
    df3 = df.set_index("Country")
    df2 = pd.merge(ScimEn, energy, how = "outer", on = "Country")
    df = pd.merge(df2, GDP, how = "outer", on = "Country")
    df4 = df.set_index("Country")
    lose = df4.shape[0] - df3.shape[0]
    return lose
answer_two()
Out[31]:
156

接下来的问题调用answer_one的结果

Question3

过去10年,每个地区的平均GDP是多少?

def answer_three()

这个函数需要返回一个Series,名为avgGDP,有15个地区和各自的平均GDP值,并按照降序排序

In [32]:
#DataFrame.mean(axis=None, skipna=None,level=None,numeric_only=None,**kwargs)返回传入的axis的平均值
#axis参数,0表示index,1表示columns

def answer_three():
    top = answer_one()
    avgGDP = top[['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']].mean(axis = 1).rename("avgGDP")
    avgGDP = avgGDP.sort_values(axis = 0, ascending = False)
    return avgGDP
answer_three()
Out[32]:
Country
United States         1.536434e+13
China                 6.348609e+12
Japan                 5.542208e+12
Germany               3.493025e+12
France                2.681725e+12
United Kingdom        2.487907e+12
Brazil                2.189794e+12
Italy                 2.120175e+12
India                 1.769297e+12
Canada                1.660647e+12
Russian Federation    1.565459e+12
Spain                 1.418078e+12
Australia             1.164043e+12
South Korea           1.106715e+12
Iran                  4.441558e+11
Name: avgGDP, dtype: float64

Question4

对于平均GDP值排名第6的地区,他的GDP在过去十年变化了多少?

def answer_four()

这个函数返回一个数值

In [33]:
#即2015年GDP与2006年GDP的差值
def answer_four():
    avgGDP = answer_three()
    top = answer_one()
    sixth = top.loc[avgGDP.index[5]]
    change = sixth.loc["2015"] - sixth.loc["2006"]
    return change
answer_four()
Out[33]:
246702696075.3999

Question5

Energy Supply per Capita的平均值是多少?

def answer_five()

这个函数返回一个数值

In [34]:
def answer_five():
    top = answer_one()
    supply_mean = top.loc[:,"Energy Supply per Capita"].mean()
    return supply_mean
answer_five()
Out[34]:
157.6

Question6

哪个地区的% Renewable 值最大,比例是多少?

def answer_six()

这个函数返回一个tuple,包含地区的名称以及比例

In [35]:
#使用sort+index时间复杂度较高,sort时间复杂度是nlogn
def answer_six():
    top = answer_one()
    renew_max = top.loc[:,"% Renewable"].sort_values(axis = 0,ascending = False)
    return (renew_max.index[0], renew_max[0])
#answer_six()
##直接遍历一遍,时间复杂度为n
def answer_six():
    top = answer_one()
    max_index = top.index[0]
    max_value = top.loc[max_index,"% Renewable"]
    #itterrows返回一个tuple,即(index,Series)
    for x in top.iterrows():
        if x[1].iloc[9] > max_value:
            max_value = x[1].iloc[9]
            max_index = x[0]
    return (max_index, max_value)
answer_six()
Out[35]:
('Brazil', 69.64803)

Question7

创建一个新的column,即为自引率self-citations/total citations,这个column的最大值是什么?是哪个地区?

def answer_seven()

这个函数返回一个tuple,包含地区的名称以及ratio

In [36]:
def answer_seven():
    top = answer_one()
    top["ratio"] = top["Self-citations"] / top["Citations"]
    max_ratio_index = top.index[0]
    max_ratio_value = top.loc[max_ratio_index, "ratio"]
    for x in top.iterrows():
        if x[1].loc["ratio"] > max_ratio_value:
            max_ratio_value = x[1].loc["ratio"]
            max_ratio_index = x[0]
    return (max_ratio_index, max_ratio_value)
answer_seven()
Out[36]:
('China', 0.6893126179389422)

Question8

创建一个新的column,使用Energy Supply 和 Energy Supply per capita估计人数,哪个地区是第三人口大国?

def answer_eight()

这个函数返回一个string

In [37]:
def answer_eight():
    top = answer_one()
    top["population"] = top["Energy Supply"] / top["Energy Supply per Capita"]
    third = top.sort_values("population", axis = 0, ascending = False).iloc[2].name
    return third
answer_eight()
Out[37]:
'United States'

Question9

创建一个新的column估计每个人可引用的文章数量,在每个地区可以引用的文章数量和每个地区energy supply有什么关系?(使用.corr()方法,皮尔逊相关)

def answer_nine()

这个函数返回一个数值

In [38]:
#先利用population计算citable documents per capita
#在计算corr的时候,如果直接计算会发现top.corr()的结果中并不包含需要的两项,可以查看他们的type为object,将其改为float既可
def answer_nine():
    top = answer_one()
    top["population"] = top["Energy Supply"] / top["Energy Supply per Capita"]
    top["citable documents per Capita"] = top["Citable documents"] / top["population"]
    top["citable documents per Capita"] = top["citable documents per Capita"].astype(float)
    top["Energy Supply per Capita"] = top["Energy Supply per Capita"].astype(float)
    cor = top["Energy Supply per Capita"].corr(top["citable documents per Capita"])
    return cor
answer_nine()
Out[38]:
0.7940010435442947