pandas入門のバックアップ差分(No.5) - 闘うITエンジニアの覚え書き

追加された行はこの色です。
削除された行はこの色です。
#author("2019-01-08T09:40:38+00:00","","")
#author("2019-01-13T04:41:56+00:00","","")
[[Python]] &gt;
* pandas入門 [#y658b3b9]
#setlinebreak(on);

#TODO

#contents
-- 参考
--- https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.html
--- https://deepage.net/features/pandas-series.html
-- 関連
--- [[Python]]
--- [[Python覚え書き]]

// https://deepage.net/features/pandas-series.html
// https://deepage.net/features/pandas-dataframe.html
// https://deepage.net/features/pandas-numpy.html

** pandasとは [#f2c87a29]
#html(<div style="padding-left:10px;">)

Pandas は Pythonのデータ解析用のライブラリ。
CSVやテキストファイルなど様々なフォーマットの1次元、2次元データを扱う事ができ、
基本的な各種データ操作(読込、追加、更新、削除)はもちろん、集計、グループ化、時系列データ操作などの様々な機能がある。
#html(</div>)

** pandas と numpy [#lc680a3a]
#html(<div style="padding-left:10px;">)

numpy は主に多次元配列の数値データを扱う事に特化したライブラリ。
数値データ以外を殆ど扱えない代わりに高速に動作する。

pandas は 内部で numpy を利用しつつ使いやすくしたもの。
pandas では抽象化や関数ラップによって、様々な演算を利用しやすくなっているが、素の numpy を使用した方が数値演算は高速に行う事ができる。

#html(</div>)

** インストール [#x9fa32ea]
#html(<div style="padding-left:10px;">)

普通に pip install するだけ。
#myterm2(){{
pip install pandas
}}
※numpy などの他に必要なライブラリもインストールされる。

#html(</div>)

** 1次元データの操作 [#df8c38f8]
#html(<div style="padding-left:10px;">)


1次元データの操作は pandas の最も基本的なオブジェクトである Series を利用する。
※以下では記載しないが numpy の ndarray から作成する事も出来る。

*** リストからSeries を作成する。 [#yebbdb7a]
#html(<div style="padding-left:10px;">)

#mycode2(){{
import pandas as pd

series = pd.Series([1,2,3,4,5])
print(series)
}}

結果
#myterm2(){{
0     1
1     2
2     3
3     4
4     5
dtype: int64
}}

インデックスに連番以外を指定する事も可能
#mycode2(){{
import pandas as pd

series = pd.Series([1,2,3,4,5], index=['one','two','three','four','five'])
print(series)
}}

結果
#myterm2(){{
one      1
two      2
three    3
four     4
five     5
dtype: int64
}}

後からインデックスを指定する事も可能
#mycode2(){{
import pandas as pd

series = pd.Series([1,2,3,4,5])
series.index = ['one','two','three','four','five']
print(series)
}}

結果
#myterm2(){{
one      1
two      2
three    3
four     4
five     5
dtype: int64
}}
#html(</div>)

*** 辞書からSeries を作成する。 [#i71803b9]
#html(<div style="padding-left:10px;">)

検索エンジンのシェアを格納した dict を Series化してみる。
http://gs.statcounter.com/search-engine-market-share

#mycode2(){{
import pandas as pd

# 辞書から作成
series = pd.Series({"google": 92.31, "yahoo": 2.51, "bing": 2.27})
print(series)
}}

結果
#myterm2(){{
google    92.31
yahoo      2.51
bing       2.27
dtype: float64
}}
#html(</div>)

*** Seriesで提供される属性とメソッド [#ue306291]
#html(<div style="padding-left:10px;">)

めちゃくちゃいっぱいある。
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.html

#mycode2(){{
series = pd.Series([1,2,3,4,5,4,6,1,3,4])

print(series.size)  # サイズ
print(series.values)  # 値のリスト
print(series.sum())  # 合計値
print(series.mean())  # 平均値
print(series.gt(2))  # 2より大きいもの
print(series.head(2))  # 先頭の2つ
print(series.max())  # 最大値
print(series.to_json())  # jsonに変換
print(series.filter(regex="3"))  # 添字の正規表現
print(series.drop_duplicates())  # 重複排除

}}

#html(</div>)


#html(</div>)

** 2次元データの操作 [#n193eef1]
#html(<div style="padding-left:10px;">)

2次元データの操作には DataFrame を利用する。

*** リストからDataFrameを生成する [#l44c7837]
#html(<div style="padding-left:10px;">)
#mycode2(){{
# coding: utf-8

import pandas as pd

df = pd.DataFrame([[1,2,3],[4,5,6]])
print(df)
}}

結果
#myterm2(){{
   0  1  2
0  1  2  3
1  4  5  6
}}

#html(</div>)

*** 複数のSeriesからDataFrameを作る [#wd55864c]
#html(<div style="padding-left:10px;">)

#mycode2(){{
xxxx
import pandas as pd

search_engine = pd.Series(["google", "yahoo", "bing"])
share = pd.Series([92.31, 2.51, 2.27])
df = pd.DataFrame({
    "engine" : search_engine,
    "share": share
})
print(df)
}}

結果
#myterm2(){{
xxxx
   engine  share
0  google  92.31
1   yahoo   2.51
2    bing   2.27
}}
#html(</div>)

*** 複数の辞書からDataFrameを作る [#wd55864c]
#html(<div style="padding-left:10px;">)

#mycode2(){{
xxxx
import pandas as pd

search_engine = pd.Series(["google", "yahoo", "bing"])
share = pd.Series([92.31, 2.51, 2.27])

data = {
    "share": {"google": 92.31, "yahoo": 2.51, "bing": 2.27}
}
df = pd.DataFrame(data)
print(df)
}}

結果
#myterm2(){{
xxxx
        share
bing     2.27
google  92.31
yahoo    2.51
}}
#html(</div>)

*** 複数リストと辞書からDataFrameを作る1 [#wd55864c]
*** 辞書のリストからDataFrameを作る [#a196780b]
#html(<div style="padding-left:10px;">)

#mycode2(){{
xxxx
import pandas as pd

data = [
    {"google": 92.31, "yahoo": 2.51, "bing": 2.27}
]
df = pd.DataFrame(data, index=["share"])
print(df)
}}

結果
#myterm2(){{
xxxx
       bing  google  yahoo
share  2.27   92.31   2.51
}}
#html(</div>)

*** 複数リストと辞書からDataFrameを作る2 [#wd55864c]
*** リストのリストからDataFrameを作る [#wd55864c]
#html(<div style="padding-left:10px;">)

#mycode2(){{
xxxx
import pandas as pd

df = pd.DataFrame([
        [92.31, 2.27, 2.51],
        [92.74, 2.17, 2.32],
        [92.37, 2.37, 2.25],
        [92.25, 2.41, 2.07]
    ],  
    index=["2018-09", "2018-10", "2018-11", "2018-12"],
    columns=["google", "yahoo", "bing"]
)

print(df)

print("-- google --")
print(df["google"])

print("-- google(list) --")
print(list(df["google"]))

print("-- google(dict) --")
print(dict(df["google"]))

print("-- 2018-09 --")
print(df.loc["2018-09"])

print("-- 2018-09(list) --")
print(list(df.loc["2018-09"]))

print("-- 2018-09(dict) --")
print(dict(df.loc["2018-09"]))
}}

結果
#myterm2(){{
xxxx
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
-- google --
2018-09    92.31
2018-10    92.74
2018-11    92.37
2018-12    92.25
Name: google, dtype: float64
-- google(list) --
[92.31, 92.74, 92.37, 92.25]
-- google(dict) --
{'2018-09': 92.31, '2018-10': 92.74, '2018-11': 92.37, '2018-12': 92.25}
-- 2018-09 --
google    92.31
yahoo      2.27
bing       2.51
Name: 2018-09, dtype: float64
-- 2018-09(list) --
[92.31, 2.27, 2.51]
-- 2018-09(dict) --
{'google': 92.31, 'yahoo': 2.27, 'bing': 2.51}
}}
#html(</div>)

*** カラムを追加する [#z5fee21f]
#html(<div style="padding-left:10px;">)

前月からの増減を表す列を追加してみる

#mycode2(){{
import pandas as pd

df = pd.DataFrame([
        [92.31, 2.27, 2.51],
        [92.74, 2.17, 2.32],
        [92.37, 2.37, 2.25],
        [92.25, 2.41, 2.07]
    ],  
    index=["2018-09", "2018-10", "2018-11", "2018-12"],
    columns=["google", "yahoo", "bing"]
)

df2 = pd.DataFrame(df, copy=True)
google_list = list(df2["google"])
yahoo_list = list(df2["yahoo"])
bing_list = list(df2["bing"])
df2["google(up)"] = [google_list[i-1] < google_list[i] if i > 0 else '-' for i, val in enumerate(google_list)]
df2["yahoo(up)"] = [yahoo_list[i-1] < yahoo_list[i] if i > 0 else '-' for i, val in enumerate(yahoo_list)]
df2["bing(up)"] = [bing_list[i-1] < bing_list[i] if i > 0 else '-' for i, val in enumerate(bing_list)]

print("-- data(before) --")
print(df)

print("-- data(after) --")
print(df2)
}}

結果
#myterm2(){{
-- data(before) --
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
-- data(after) --
         google  yahoo  bing google(up) yahoo(up) bing(up)
2018-09   92.31   2.27  2.51          -         -        -
2018-10   92.74   2.17  2.32       True     False    False
2018-11   92.37   2.37  2.25      False      True    False
2018-12   92.25   2.41  2.07      False      True    False
}}

#html(</div>)

** その他の基本的なメソッド [#jb08f6a8]
#html(</div>)

** いろいろなデータの読み込み方法 [#j8eecd4c]
#html(<div style="padding-left:10px;">)

*** csvデータを読み込む [#v51d31d5]
#html(<div style="padding-left:10px;">)

read_csv を使用して CSVデータをデータフレームとして読み込む事ができる

search_engine_share.csv
#mycode2(){{
,Google,bing,Yahoo
2018-09,92.31,2.27,2.51
2018-10,92.74,2.17,2.32
2018-11,92.37,2.37,2.25
2018-12,92.25,2.41,2.07
}}

read_csv.py
#mycode2(){{
import pandas as pd

df = pd.read_csv("search_engine_share.csv", index_col=0) # データ名を列番号で指定
#df = pd.read_csv("search_engine_share.csv", sep='\t')    # タブ区切りの場合
#df = pd.read_csv("search_engine_share.csv", header=0)    # ヘッダの行番号を指定(デフォルト:0)
#df = pd.read_csv("search_engine_share_noheader.csv", names=["google", "yahoo", "bing"])    # ヘッダを自分で指定

print(df)
}}

結果
#myterm2(){{
         Google  bing  Yahoo
2018-09   92.31  2.27   2.51
2018-10   92.74  2.17   2.32
2018-11   92.37  2.37   2.25
2018-12   92.25  2.41   2.07
}}

read_csvのオプション
| パラメータ名 | 説明 | 使用例 | 補足 |h
| index_col | データ名を列番号で指定する | index_col=0 | |
| sep | 区切り文字を指定する | sep='\t' | |
| header | ヘッダの行番号を指定する | header=0 | デフォルト:0、ヘッダがデータにない場合はNoneを指定するか、names で自分で指定する |
| names | ヘッダを自分で指定する | names=["google", "yahoo", "bing"] | |
| usecols | 読み込む列を指定する | usecols=[1, 3] | |

上記以外にも沢山ある
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

#html(</div>)

*** XXXX [#b7067fd1]
*** JSONを読み込む [#b7067fd1]
#html(<div style="padding-left:10px;">)

read_json を使用して JSONをデータフレームとして読み込む事ができる

#mycode2(){{
import pandas as pd
import json

json_text = json.dumps({
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
})

df = pd.read_json(json_text, convert_axes=False)
#df = pd.read_json("search_engine_share.json", convert_axes=False) # ファイルからの読み込みも可能

print(df)
}}

read_jsonのオプション
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

結果
#myterm2(){{
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
}}

#html(</div>)

*** その他 [#jc04073e]
#html(<div style="padding-left:10px;">)

*** XXXX [#y1e2e3fa]
Excel や データベース、HTMLからも読み込みが可能。
https://pandas.pydata.org/pandas-docs/stable/api.html#input-output

#html(</div>)

** データの抽出 [#qd4e95ce]
#html(<div style="padding-left:10px;">)
#TODO

*** カラム名を指定して抽出 [#d1d129f1]
#html(<div style="padding-left:10px;">)

#mycode2(){{
import pandas as pd
import json

json_text = json.dumps({
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
})

df = pd.read_json(json_text, convert_axes=False)
print("### 全データ ###")
print(df)

print("### yahooだけを抽出 ###")
print(df["yahoo"])
}}

結果
#myterm2(){{
### 全データ ###
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
### yahooだけを抽出 ###
2018-09    2.27
2018-10    2.17
2018-11    2.37
2018-12    2.41
Name: yahoo, dtype: float64
}}

#html(</div>)

*** 行を指定して抽出 [#i0a8e671]
#html(<div style="padding-left:10px;">)

*** XXXX [#h0590c92]
データフレームの添字にBool値のリスト(またはSeries)を指定する事により抽出できる。(Trueの行が抽出される)

#mycode2(){{
import pandas as pd
import json

json_text = json.dumps({
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
})

df = pd.read_json(json_text, convert_axes=False)
print("### 全データ ###")
print(df)

print("### 2行目、4行目だけ抽出 ###")
print(df[[False, True, False, True]])
}}

結果
#myterm2(){{
### 全データ ###
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
### 2行目、4行目だけ抽出 ###
         google  yahoo  bing
2018-10   92.74   2.17  2.32
2018-12   92.25   2.41  2.07
}}
#html(</div>)

*** 条件を指定して抽出 [#faa3076a]
#html(<div style="padding-left:10px;">)

[[行を指定して抽出>#i0a8e671]]で

データフレームの添字にBool値のリスト(またはSeries)を指定する事により抽出できる。(Trueの行が抽出される)

#mycode2(){{
import pandas as pd
import json

json_text = json.dumps({
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
})

df = pd.read_json(json_text, convert_axes=False)
print("### 全データ ###")
print(df)

print("### 2行目、4行目だけ抽出 ###")
print(df[[False, True, False, True]])
}}

結果
#myterm2(){{
### 全データ ###
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
}}
#html(</div>)


#html(</div>)

** データの加工 [#yddac7e5]
#html(<div style="padding-left:10px;">)
#TODO
#html(</div>)


#html(</div>)
pandas入門 のバックアップ差分(No.5) - 闘うITエンジニアの覚え書き

pandas入門のバックアップ差分(No.5) - 闘うITエンジニアの覚え書き