博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Python 的 pandas 实践
阅读量:5742 次
发布时间:2019-06-18

本文共 22578 字,大约阅读时间需要 75 分钟。

Python 的 pandas 实践:

1 # !/usr/bin/env python  2 # encoding: utf-8  3 __author__ = 'Administrator'  4 import pandas as pd  5 import numpy as np  6 import matplotlib.pyplot as plt  7   8   9 #一、创建对象 10 #1. 通过传递一个list对象来创建一个Series,pandas会默认创建整型索引: 11 s=pd.Series([1,3,4,np.nan,6,8]) 12 print(s) 13 # 0    1.0 14 # 1    3.0 15 # 2    4.0 16 # 3    NaN 17 # 4    6.0 18 # 5    8.0 19 # dtype: float64 20  21 #2.通过传递一个numpy array,时间索引以及列标签来创建一个DataFrame: 22 dates=pd.date_range('20180301',periods=6) 23 print(dates) 24 # DatetimeIndex(['2018-03-01', '2018-03-02', '2018-03-03', '2018-03-04', 25 #                '2018-03-05', '2018-03-06'], 26 #               dtype='datetime64[ns]', freq='D') 27 df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD')) 28 # numpy.random.randn(d0, d1, …, dn)是从标准正态分布中返回一个或多个样本值。(可含负数) 29 # numpy.random.rand(d0, d1, …, dn)的随机样本位于[0, 1)中。 30 #P=numpy.random.rand(N,K) #随机生成一个 N行 K列的矩阵 31 print(df) 32 #                    A         B         C         D 33 # 2018-03-01 -0.451506 -0.884044 -0.916664 -0.763684 34 # 2018-03-02 -0.463568  0.340688 -0.077484 -0.237660 35 # 2018-03-03 -1.533427  0.301283  0.268640 -0.011027 36 # 2018-03-04  1.036050  0.402203  0.485365  2.086525 37 # 2018-03-05  0.221578 -0.821756 -0.265241  0.277563 38 # 2018-03-06  1.774195 -0.288553  1.527936  0.119153 39  40 # ''' 41  42 #3.通过传递一个能够被转换成类似序列结构的字典对象来创建一个DataFrame: 43 df2=pd.DataFrame({ 44     'A':1., 45     'B':pd.Timestamp('20180301'), 46     'C':pd.Series(1,index=list(range(4)),dtype='float32'), 47     'D':np.array([3]*4,dtype='int32'), 48     'E':pd.Categorical(["test","train","test","train"]), 49     'F':'foo'}) 50 print(df2) 51 #      A          B    C  D      E    F 52 # 0  1.0 2018-03-01  1.0  3   test  foo 53 # 1  1.0 2018-03-01  1.0  3  train  foo 54 # 2  1.0 2018-03-01  1.0  3   test  foo 55 # 3  1.0 2018-03-01  1.0  3  train  foo 56  57 #4.查看不同列的数据类型: 58 print(df2.dtypes) 59 # A           float64 60 # B    datetime64[ns] 61 # C           float32 62 # D             int32 63 # E          category 64 # F            object 65 # dtype: object 66  67 #二、查看数据 68 #1. 查看dataframe中头部和尾部的行: 69 print(df.head()) 70 #                    A         B         C         D 71 # 2018-03-01 -0.250132 -1.403066  1.234990 -3.077763 72 # 2018-03-02  0.387496 -0.389183  0.186663  1.124608 73 # 2018-03-03 -0.105463 -0.230739 -0.227575  0.308565 74 # 2018-03-04 -1.703507  0.194876  1.790366 -0.561566 75 # 2018-03-05 -0.511609  0.695915  0.398392  0.107062 76 print(df.tail(3)) 77 #                    A         B         C         D 78 # 2018-03-04  0.704065  0.492649  0.533961 -1.518723 79 # 2018-03-05  2.192819 -0.508099 -0.173966 -0.401864 80 # 2018-03-06 -0.839634 -0.314676 -0.808266 -0.578229 81  82 #2.显示索引、列和底层的numpy数据: 83 print(df.index) 84 # DatetimeIndex(['2018-03-01', '2018-03-02', '2018-03-03', '2018-03-04', 85 #                '2018-03-05', '2018-03-06'], 86 #               dtype='datetime64[ns]', freq='D') 87 print(df.columns) 88 #Index(['A', 'B', 'C', 'D'], dtype='object') 89 print(df.values) 90 # [[ 1.65612186 -0.47932887  0.9673593  -0.63872414] 91 #  [ 0.12229686  0.08831358  1.07344126 -0.12742276] 92 #  [ 0.54654075  0.77281164 -0.6396787   0.1585142 ] 93 #  [-0.70695944 -2.12273423 -0.24549759 -0.09530991] 94 #  [ 2.66920788  0.6520858   1.72857641 -1.34418643] 95 #  [ 1.87333346 -0.42716996  0.49558928 -1.47606701]] 96  97 #3. describe()函数对于数据的快速统计汇总: 98 print(df.describe()) 99 #               A         B         C         D100 # count  6.000000  6.000000  6.000000  6.000000101 # mean   0.399068  0.339270  0.755588 -0.459344102 # std    0.890360  1.011113  0.851783  1.759264103 # min   -1.002101 -0.806772 -0.333761 -2.411582104 # 25%   -0.087757 -0.400563  0.338822 -1.782221105 # 50%    0.577418  0.244011  0.502612 -0.622453106 # 75%    1.096592  0.941454  1.376095  0.433235107 # max    1.281508  1.795854  1.910586  2.284103108 109 #4. 对数据的转置:110 print(df.T)111 #    2018-03-01  2018-03-02  2018-03-03  2018-03-04  2018-03-05  2018-03-06112 # A    0.843347   -0.906826   -0.528945    1.186650   -1.839152   -0.508169113 # B   -0.105481    2.084689   -1.106710    0.521137    0.741946    0.399700114 # C   -0.786144    0.269116   -0.180710    3.345385    1.310786   -0.204216115 # D    0.453731   -0.243617    0.701440    2.541094    1.337923   -0.673128116 117 #5. 按轴进行排序118 print(df.sort_index(axis=1,ascending=False)) #  axis = 0是按行进行操作, axis=1是按列进行操作;  ascending=False是只递减,否则递增119 #                    D         C         B         A120 # 2018-03-01  0.389294 -0.227394  0.649234  0.639820121 # 2018-03-02  0.680265  0.466626 -1.940228  0.843753122 # 2018-03-03  1.520800  0.570192  1.244427 -0.715080123 # 2018-03-04  0.309068 -0.224222 -0.226254  1.416381124 # 2018-03-05 -1.854131 -0.403245 -0.017054  0.840840125 # 2018-03-06 -1.991173  1.275825  0.913996  1.561550126 127 #6. 按值进行排序128 # print(df.sort(column='B')) #?? AttributeError: 'DataFrame' object has no attribute 'sort'129 130 #三、选择131 # 虽然标准的Python/Numpy的选择和设置表达式都能够直接派上用场,132 # 但是作为工程使用的代码,我们推荐使用经过优化的pandas数据访问方式: .at, .iat, .loc, .iloc 和 .ix133 #(一)获取:134 #1. 选择一个单独的列,这将会返回一个Series,等同于 df.A:135 print(df['A'])136 # 2018-03-01    0.156236137 # 2018-03-02   -0.041257138 # 2018-03-03   -0.970551139 # 2018-03-04   -1.751839140 # 2018-03-05    1.521352141 # 2018-03-06    0.828690142 # Freq: D, Name: A, dtype: float64143 144 #2. 通过[]进行选择,这将会对行进行切片145 print(df[0:3])146 #                    A         B         C         D147 # 2018-03-01 -0.432011  0.697033 -3.028116 -0.217882148 # 2018-03-02 -1.744071  0.647694  1.031179 -1.043985149 # 2018-03-03 -0.673125  0.689913  0.648986 -1.471825150 print(df['20180302':'20180304'])151 #                    A         B         C         D152 # 2018-03-02 -0.803947  0.147807 -0.248534  0.496719153 # 2018-03-03 -1.518123  0.376390 -0.793349  0.612074154 # 2018-03-04  0.146634  0.506102  1.316693 -0.801691155 156 #(二)通过标签选择:157 #1. 使用标签来获取一个交叉的区域:158 print(df.loc[dates[0]])159 # A   -1.593039160 # B    0.400735161 # C   -0.870638162 # D   -0.551766163 # Name: 2018-03-01 00:00:00, dtype: float64164 #2. 通过标签来在多个轴上进行选择:165 print(df.loc[:,['A','B']])166 #                    A         B167 # 2018-03-01  0.326446  0.633246168 # 2018-03-02  0.169674  0.892832169 # 2018-03-03 -0.755691 -2.028912170 # 2018-03-04 -1.005360  0.529193171 # 2018-03-05 -0.457140  0.842211172 # 2018-03-06  0.343157  0.879763173 174 #3. 标签切片175 print(df.loc['20180302':'20180304',['A','B']])176 #                    A         B177 # 2018-03-02  0.197173  0.040377178 # 2018-03-03  2.064367  1.112152179 # 2018-03-04  0.888216 -0.591129180 181 #4. 对于返回的对象进行维度缩减182 print(df.loc['20180302',['A','B']])183 # A   -0.259955184 # B   -0.019266185 # Name: 2018-03-02 00:00:00, dtype: float64186 187 #5. 获取一个标量188 print(df.loc[dates[0],'A']) #-0.313259346223189 190 #6. 快速访问一个标量(与上一个方法等价)191 print(df.at[dates[0],'A'])  #-0.313259346223192 193 #(三)通过位置选择:194 #1. 通过传递数值进行位置选择(选择的是行)195 print(df.iloc[3])196 # A    1.661488197 # B   -1.175748198 # C    0.642823199 # D   -0.491914200 # Name: 2018-03-04 00:00:00, dtype: float64201 202 #2. 通过数值进行切片,与numpy/python 中的情况类似203 print(df.iloc[3:5,0:2]) #选择第3、第4行,第1、第2列204 #                    A         B205 # 2018-03-04  0.492426  0.412712206 # 2018-03-05  0.541252 -0.009380207 208 #3. 通过制定一个位置的列表,与numpy/python中的情况类似209 print(df.iloc[[1,2,4],[0,2]])210 #                    A         C211 # 2018-03-02 -0.638074  1.794516212 # 2018-03-03 -0.403471 -0.934373213 # 2018-03-05 -1.309320  1.353276214 215 #4. 对行进行切片216 print(df.iloc[1:3,:])217 #                    A         B         C         D218 # 2018-03-02  1.980513 -0.218688  2.627449  1.314947219 # 2018-03-03 -0.532379  1.382092 -1.270961  0.722475220 221 #5. 对列进行切片222 print(df.iloc[:,1:3])223 #                    B         C224 # 2018-03-01  0.332228 -1.682811225 # 2018-03-02 -0.533398 -0.254960226 # 2018-03-03 -0.926688  0.890513227 # 2018-03-04 -0.448742  0.763850228 # 2018-03-05 -0.841622  0.514873229 # 2018-03-06 -1.346557  1.516414230 231 #6. 获取特定的值232 print(df.iloc[1,1]) #0.481882236461233 print(df.iat[1,1]) #0.481882236461234 235 236 237 #(四)布尔索引:238 #1. 使用一个单独列的值来选择数据:239 print(df[df.A>0])240 #                    A         B         C         D241 # 2018-03-02  0.566243  1.510954 -0.898180  0.856439242 # 2018-03-03  1.008447 -1.597226 -0.665134 -0.287472243 # 2018-03-05  0.952498 -0.144979  0.620468 -0.830652244 245 #2. 使用where操作来选择数据:246 print(df[df>0])247 #                    A         B         C         D248 # 2018-03-01  0.892660       NaN       NaN       NaN249 # 2018-03-02  1.512600       NaN       NaN  1.375527250 # 2018-03-03  0.970026  1.184603  1.182990       NaN251 # 2018-03-04  1.913993       NaN  0.914778  0.137170252 # 2018-03-05  0.482589       NaN       NaN  0.668817253 # 2018-03-06       NaN  0.539344  0.142892       NaN254 255 #3. 使用isin()方法来过滤:256 df2=df.copy()257 df2['E']=['one','one','two','three','four','three']258 print(df2)259 #                    A         B         C         D      E260 # 2018-03-01 -1.138724  0.566583  0.338254  2.072839    one261 # 2018-03-02 -0.366949  0.335546  1.653024  1.445071    one262 # 2018-03-03  0.724615  1.715933 -0.754757 -1.452252    two263 # 2018-03-04 -0.881962 -0.173858 -0.340868 -0.556665  three264 # 2018-03-05 -2.126513 -0.113010 -0.796566  0.210673   four265 # 2018-03-06  0.716490  0.223395 -1.428238  0.328406  three266 print(df2[df2['E'].isin(['two','four'])])267 #                    A         B         C         D     E268 # 2018-03-03 -0.737833 -1.161520  0.897204 -0.029158   two269 # 2018-03-05  1.072054  1.234587  0.935680 -1.284542  four270 271 272 273 #(五)设置:274 #1. 设置一个新的列:275 s1=pd.Series([1,2,3,4,5,6],index=pd.date_range('20180302',periods=6))276 print(s1)277 # 2018-03-02    1278 # 2018-03-03    2279 # 2018-03-04    3280 # 2018-03-05    4281 # 2018-03-06    5282 # 2018-03-07    6283 # Freq: D, dtype: int64284 df['F']=s1285 print(df)286 #                    A         B         C         D    F287 # 2018-03-01  2.413592 -0.336264  0.165597  2.143270  NaN288 # 2018-03-02 -1.921596 -2.100707 -0.454461  0.563247  1.0289 # 2018-03-03 -0.235034 -0.517009 -2.409731 -0.711854  2.0290 # 2018-03-04  0.667604 -0.838737 -0.425916 -0.238519  3.0291 # 2018-03-05  1.057415  1.457143  0.440690  0.948613  4.0292 # 2018-03-06  0.539187 -0.952633  0.316752  0.422146  5.0293 294 #2. 通过标签设置新的值:295 df.at[dates[0],'A']=0296 297 #3. 通过位置设置新的值:298 df.iat[0,1]=0299 300 #4. 通过一个numpy数组设置一组新值:301 df.loc[:,'D']=np.array([5]*len(df))302 print(df)303 #                    A         B         C  D    F304 # 2018-03-01  0.000000  0.000000  0.164267  5  NaN305 # 2018-03-02  0.614534 -0.865975 -0.977389  5  1.0306 # 2018-03-03 -0.253095 -1.451951  2.360233  5  2.0307 # 2018-03-04  0.143115  0.363544  1.587648  5  3.0308 # 2018-03-05  0.010932  0.802590 -1.701589  5  4.0309 # 2018-03-06 -0.354579  0.830066  0.404646  5  5.0310 311 #5. 通过where操作来设置新的值:312 df2=df.copy()313 df2[df2>0]=-df2314 print(df2)315 #                    A         B         C  D    F316 # 2018-03-01  0.000000  0.000000 -1.385454 -5  NaN317 # 2018-03-02 -0.773506 -0.444692 -0.620307 -5 -1.0318 # 2018-03-03 -0.506590 -2.445527 -0.664229 -5 -2.0319 # 2018-03-04 -0.568711 -0.709224 -2.582502 -5 -3.0320 # 2018-03-05 -1.074985 -2.480905 -0.537869 -5 -4.0321 # 2018-03-06 -2.659346 -1.055430 -0.379758 -5 -5.0322 323 324 325 #四、缺失值处理326 # 在pandas中,使用np.nan来代替缺失值,这些值将默认不会包含在计算中,详情请参阅:Missing Data Section。327 #1. reindex()方法可以对指定轴上的索引进行改变/增加/删除操作,这将返回原始数据的一个拷贝:328 df1=df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])329 df1.loc[dates[0]:dates[1],'E']=1330 print(df1)331 #                    A         B         C         D    E332 # 2018-03-01 -0.275255 -0.290044  0.707118  1.094318  1.0333 # 2018-03-02 -1.340747  0.633546 -0.911210 -0.275105  1.0334 # 2018-03-03 -1.044219  0.659945  1.370910  0.262282  NaN335 # 2018-03-04 -0.015582  1.540852 -0.792882 -0.380751  NaN336 337 #2. 去掉包含缺失值的行:338 # df1=df1.dropna(how='any')339 # print(df1)340 # #                    A         B         C         D    E341 # 2018-03-01 -0.914568  0.784980 -1.698139 -0.096874  1.0342 # 2018-03-02 -0.410249 -0.494166  0.932946 -0.467547  1.0343 344 #3. 对缺失值进行填充:345 df1=df1.fillna(value=5)346 print(df1)347 #                    A         B         C         D    E348 # 2018-03-01 -1.265605  0.778767 -0.947968 -1.330982  1.0349 # 2018-03-02  1.778973 -1.428542  1.257860  0.362724  1.0350 # 2018-03-03 -1.589094 -0.517478 -0.164942 -0.507224  5.0351 # 2018-03-04  2.363145  2.089114 -0.081683 -0.184851  5.0352 353 #4.对数据进行布尔填充354 df1=pd.isnull(df1)355 print(df1)356 #                 A      B      C      D      E357 # 2018-03-01  False  False  False  False  False358 # 2018-03-02  False  False  False  False  False359 # 2018-03-03  False  False  False  False  False360 # 2018-03-04  False  False  False  False  False361 362 363 364 365 #五、相关操作366 # (一)统计(相关操作通常情况下不包括缺失值)367 # #1. 执行描述性统计:368 print(df.mean())369 # A   -0.066441370 # B    0.154609371 # C   -0.154372372 # D   -0.155221373 # dtype: float64374 375 #2. 在其他轴上进行相同的操作:376 print(df.mean(1))377 # 2018-03-01   -0.138352378 # 2018-03-02   -0.226558379 # 2018-03-03    0.121705380 # 2018-03-04    0.855662381 # 2018-03-05   -0.892621382 # 2018-03-06    0.062726383 # Freq: D, dtype: float64384 385 #3.对于拥有不同维度,需要对齐的对象进行操作。Pandas会自动的沿着指定的维度进行广播:386 387 388 # (二)Apply389 #1. 对数据应用函数:390 print(df)391 print(df.apply(np.cumsum))392 #                    A         B         C         D393 # 2018-03-01 -0.381460 -0.296346  1.229803 -1.300226394 # 2018-03-02  0.365891  0.974026  1.570268 -2.572981395 # 2018-03-03  0.624070  0.211935  0.635084 -1.110378396 # 2018-03-04  2.945062 -0.406832 -0.043918 -0.470773397 # 2018-03-05  3.542080  0.092974 -1.585544 -0.658267398 # 2018-03-06  3.440084  0.448828 -2.400617 -0.734055399 print(df.apply(lambda x:x.max()-x.min()))400 # A    2.702452401 # B    2.032463402 # C    2.771429403 # D    2.762828404 # dtype: float64405 406 # (三)直方图407 s=pd.Series(np.random.randint(0,7,size=10))408 print(s)409 # 0    2410 # 1    6411 # 2    6412 # 3    3413 # 4    3414 # 5    4415 # 6    4416 # 7    6417 # 8    6418 # 9    2419 # dtype: int32420 print(s.value_counts())421 # 6    4422 # 4    2423 # 3    2424 # 2    2425 # dtype: int64426 427 428 # (四)字符串方法429 # Series对象在其str属性中配备了一组字符串处理方法,可以很容易的应用到数组中的每个元素,如下段代码所示。430 s=pd.Series(['A','B','C','Aaba','Baca',np.nan,'CABA','dog','cat'])431 print(s.str.lower())432 # 0       a433 # 1       b434 # 2       c435 # 3    aaba436 # 4    baca437 # 5     NaN438 # 6    caba439 # 7     dog440 # 8     cat441 # dtype: object442 443 444 445 #六、合并446 #Pandas提供了大量的方法能够轻松的对Series,DataFrame和Panel对象进行各种符合各种逻辑关系的合并操作。447 #1、Concat448 df=pd.DataFrame(np.random.randn(10,4))449 print(df)450 #           0         1         2         3451 # 0  0.620744 -0.921194  0.130483 -0.305914452 # 1  0.311699 -0.085041  0.638297 -0.077868453 # 2  0.327473 -0.732598 -0.134463  0.498805454 # 3 -0.622715 -0.819375 -0.473504 -0.379117455 # 4 -1.309207 -0.794917 -1.284665  0.830677456 # 5 -1.170121 -2.063048 -0.836381  0.925829457 # 6 -0.766342  0.454018 -0.181846 -1.052607458 # 7 -0.996856  0.189226  0.428375 -1.149523459 # 8  1.080517  1.884718 -0.065141 -0.781686460 # 9  0.087353  0.209678 -1.333989  0.863220461 462 #break it into pieces463 pieces=[df[:3],df[3:7],df[7:]]464 print(pieces)465 print(pd.concat(pieces))466 #           0         1         2         3467 # 0  1.187009 -0.493550  0.777065  1.494107468 # 1 -0.915190  1.228669  0.216910  1.610432469 # 2 -0.647737  1.961472  1.369682 -1.195257470 # 3  1.474973  1.968576  1.282678 -1.798167471 # 4  1.449858 -1.828631 -0.217424  0.992141472 # 5 -1.056223  0.464964  0.135468  0.181781473 # 6 -1.677772  1.456419  0.642563 -0.895238474 # 7  0.123780  0.030988  1.960217  0.140918475 # 8  1.071418  1.737486 -0.170948  0.859271476 # 9 -0.056640 -1.439686 -0.358960 -1.765060477 478 479 #2、Join .类似于SQL类型的合并。480 left=pd.DataFrame({
'key':['foo','foo'],'lval':[1,2]})481 print(left)482 # key lval483 # 0 foo 1484 # 1 foo 2485 right=pd.DataFrame({
'key':['foo','foo'],'rval':[4,5]})486 print(right)487 # key rval488 # 0 foo 4489 # 1 foo 5490 pd1=pd.merge(left,right,on='key')491 print(pd1)492 # key lval rval493 # 0 foo 1 4494 # 1 foo 1 5495 # 2 foo 2 4496 # 3 foo 2 5497 498 #3、Append。将一行连接到一个DataFrame上。499 df=pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])500 print(df)501 # A B C D502 # 0 0.205671 -1.236797 -1.127111 1.422836503 # 1 0.646151 0.202197 -0.160218 -0.839145504 # 2 1.479783 -0.678455 0.649959 -1.085791505 # 3 -0.851987 -0.821248 0.125836 0.819543506 # 4 -1.312988 -0.898903 -0.420592 1.672173507 # 5 0.240516 -0.711331 -0.717536 0.620066508 # 6 -0.442280 0.539277 -1.428910 1.060193509 # 7 0.257239 -2.034086 1.121833 1.518571510 s=df.iloc[3]511 df1=df.append(s,ignore_index=True)512 print(df1)513 # A B C D514 # 0 0.205671 -1.236797 -1.127111 1.422836515 # 1 0.646151 0.202197 -0.160218 -0.839145516 # 2 1.479783 -0.678455 0.649959 -1.085791517 # 3 -0.851987 -0.821248 0.125836 0.819543518 # 4 -1.312988 -0.898903 -0.420592 1.672173519 # 5 0.240516 -0.711331 -0.717536 0.620066520 # 6 -0.442280 0.539277 -1.428910 1.060193521 # 7 0.257239 -2.034086 1.121833 1.518571522 # 8 -0.851987 -0.821248 0.125836 0.819543523 524 525 #七、分组526 #对于“group by”操作,我们通常是指以下一个或多个操作步骤:527 # * (splitting)按照一些规则将数据分为不同的组;528 # * (applying)对于每组数据分别执行一个函数;529 # * (combining)将结果组合到一个数据结构中;530 531 df=pd.DataFrame({
'A':['foo','bar','foo','bar','foo','bar','foo','foo'],532 'B':['one','one','two','three','two','two','one','three'],533 'C':np.random.randn(8),534 'D':np.random.randn(8) })535 print(df)536 # A B C D537 # 0 foo one 0.792610 0.153922538 # 1 bar one 1.497661 0.548711539 # 2 foo two 0.038679 1.100214540 # 3 bar three -1.074874 0.238335541 # 4 foo two 1.176477 1.260415542 # 5 bar two -0.629367 -1.098556543 # 6 foo one 0.015918 -1.646855544 # 7 foo three -0.486434 -0.930165545 546 #1、分组并对每个分组执行sum函数:547 dfg=df.groupby('A').sum()548 print(dfg)549 # C D550 # A551 # bar -0.20658 -0.311509552 # foo 1.53725 -0.062469553 #2、通过多个列进行分组形成一个层次索引,然后执行函数:554 dfg2=df.groupby(['A','B']).sum()555 print(dfg2)556 # C D557 # A B558 # bar one 1.497661 0.548711559 # three -1.074874 0.238335560 # two -0.629367 -1.098556561 # foo one 0.808528 -1.492933562 # three -0.486434 -0.930165563 # two 1.215156 2.360629564 565 #八、Reshapeing566 #1、Stack567 tuples=list(zip(*[['bar','bar','baz','baz','foo','foo','quz','quz'],568 ['one','two','one','two','one','two','one','two']]))569 index=pd.MultiIndex.from_tuples(tuples,names=['first','second'])570 df=pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B'])571 df2=df[:4]572 print(df2)573 # A B574 # first second575 # bar one 1.146806 0.413660576 # two -0.241280 -0.756498577 # baz one -0.429149 -1.598932578 # two 0.103805 -2.092773579 580 stacked=df2.stack()581 print(stacked)582 # first second583 # bar one A -0.671894584 # B 0.488440585 # two A -0.085894586 # B -0.888060587 # baz one A -0.647487588 # B -1.573074589 # two A 0.084324590 # B -0.216785591 # dtype: float64592 593 stacked0=stacked.unstack()594 print(stacked0)595 # A B596 # first second597 # bar one -2.281352 0.683124598 # two -2.555841 0.020481599 # baz one 1.007699 -0.605463600 # two 1.177308 0.833826601 stacked1=stacked.unstack(1)602 print(stacked1)603 # second one two604 # first605 # bar A -2.281352 -2.555841606 # B 0.683124 0.020481607 # baz A 1.007699 1.177308608 # B -0.605463 0.833826609 stacked2=stacked.unstack(0)610 print(stacked2)611 # first bar baz612 # second613 # one A -0.279379 0.011654614 # B 0.713347 0.482510615 # two A -0.980093 0.536366616 # B -0.378279 -1.023949617 618 #2、数据透视表619 df=pd.DataFrame({
'A':['one','one','two','three']*3,620 'B':['A','B','C']*4,621 'C':['foo','foo','foo','bar','bar','bar']*2,622 'D':np.random.randn(12),623 'E':np.random.randn(12) })624 print(df)625 # A B C D E626 # 0 one A foo -1.037929 -0.967839627 # 1 one B foo 0.143201 1.936801628 # 2 two C foo -1.108452 1.350176629 # 3 three A bar 0.696497 0.578974630 # 4 one B bar -1.206393 1.218049631 # 5 one C bar -0.814728 0.440277632 # 6 two A foo -2.039865 -1.298114633 # 7 three B foo -0.155810 -0.249138634 # 8 one C foo -0.436593 0.548266635 # 9 one A bar -2.236853 -1.218478636 # 10 two B bar -0.542738 -1.018322637 # 11 three C bar -0.657995 -0.772053638 #可以从这个数据中轻松的生成数据透视表:639 pdtable=pd.pivot_table(df,values='D',index=['A','B'],columns=['C'])640 print(pdtable)641 # C bar foo642 # A B643 # one A 0.878124 0.739554644 # B 1.508778 -0.261956645 # C 0.452780 0.850025646 # three A -0.616593 NaN647 # B NaN -0.924248648 # C -0.778909 NaN649 # two A NaN -0.249317650 # B 0.341066 NaN651 # C NaN 0.706030652 # '''653 #九、时间序列654 #Pandas在对频率转换进行重新采样时拥有简单、强大且高效的功能(如将按秒采样的数据转换为按5分钟为单位进行采样的数据)。这种操作在金融领域非常常见。655 # rng=pd.date_range('1/1/2018',periods=100,freq='S')656 # ts=pd.Series(np.random.randint(0,500,len(rng)),index=rng)657 # ts0=ts.resample('5Min',how='sum')658 # ........659 # ........660 661 #十、Categorical662 #从0.15版本开始,pandas可以在DataFrame中支持Categorical类型的数据663 664 #1、将原始的grade转换为Categorical数据类型:665 # ........666 # ........667 668 #十一、画图669 ts=pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2018',periods=1000))670 ts=ts.cumsum()671 ts.plot()672 # ........673 # ........674 675 #十二、导入和保存数据676 #(一)CSV677 #1、写入 csv文件678 df.to_csv('foo.csv')679 #2、从CSV文件中读取:680 pd.read_csv('foo.csv')681 682 #(二)HDF5683 #1、684 # ........685 # ........686 687 #(三)Excel688 #1、写入excel文件:689 df.to_excel('foo.xlsx',sheet_name='Sheet1')690 #2、从excel文件中读取:691 pd.read_excel('foo.xlsx','Sheet1',index_col=None,na_values=['NA'])

 

【Reference】 1、 2、
 

转载地址:http://ognzx.baihongyu.com/

你可能感兴趣的文章
IOS 网络编程总结
查看>>
Linux虚拟机上网
查看>>
如何判断Office的语言
查看>>
mysql命令集锦
查看>>
我的友情链接
查看>>
JAVA加减日期
查看>>
Redis持久化配置(rdb,aof)
查看>>
beego实现过滤器功能(beego.InsertFilter)
查看>>
C++程序的设计机制3 RAII机制(2)
查看>>
创建可扩展性系统-10
查看>>
使用CSS设置行间距,字间距
查看>>
影响网站排名的因素
查看>>
Hyper-v下安装网络流量监测图形分析工具 Cacti
查看>>
Innodb逻辑存储
查看>>
Windows Server 2008搭建域环境---安装活动目录
查看>>
笑傲江湖-神雕侠侣
查看>>
php中引用&的真正理解-变量引用、函数引用、对象引用
查看>>
关于跨域请求的问题
查看>>
REDHAT5 配置dhcp服务器
查看>>
IIS 内部运行机制
查看>>