pandas入門 - 闘うITエンジニアの覚え書き

pandasとは †

Pandas は Pythonのデータ解析用のライブラリ。
CSVやテキストファイルなど様々なフォーマットの1次元、2次元データを扱う事ができ、
基本的な各種データ操作(読込、追加、更新、削除)はもちろん、集計、グループ化、時系列データ操作などの様々な機能がある。

↑

目次 †

pandasとは
目次
pandas と numpy
インストール
1次元データの操作
2次元データの操作
いろいろなデータの読み込み方法
データの抽出
データの加工
グルーピング
データのソート
- 値でソート
- データ名(ラベル名/添字)でソート
実装メモ
- 連続データの抽出

参考
- https://pandas.pydata.org/pandas-docs/stable/index.html
関連
- Python
- Python覚え書き
- ディープラーニング入門
- TensorFlow入門
- Keras入門
- Chainer入門
- PyTorch入門
- numpy入門
- R言語入門

↑

pandas と numpy †

numpy は主に多次元配列の数値データを扱う事に特化したライブラリ。
数値データ以外を殆ど扱えない代わりに高速に動作する。

pandas は内部で numpy を利用しつつ使いやすくしたもの。
pandas では抽象化や関数ラップによって、様々な演算を利用しやすくなっているが、素の numpy を使用した方が数値演算は高速に行う事ができる。

↑

インストール †

普通に pip install するだけ。

pip install pandas

※numpy などの他に必要なライブラリもインストールされる。

↑

1次元データの操作 †

1次元データの操作は pandas の最も基本的なオブジェクトである Series を利用する。
※以下では記載しないが numpy の ndarray から作成する事も出来る。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html#pandas.Series

↑

リストからSeries を作成する。 †

import pandas as pd

series = pd.Series([1,2,3,4,5])
print(series)

結果

0     1
1     2
2     3
3     4
4     5
dtype: int64

インデックスに連番以外を指定する事も可能

import pandas as pd

series = pd.Series([1,2,3,4,5], index=['one','two','three','four','five'])
print(series)

結果

one      1
two      2
three    3
four     4
five     5
dtype: int64

後からインデックスを指定する事も可能

import pandas as pd

series = pd.Series([1,2,3,4,5])
series.index = ['one','two','three','four','five']
print(series)

結果

one      1
two      2
three    3
four     4
five     5
dtype: int64

↑

辞書からSeries を作成する。 †

検索エンジンのシェアを格納した dict を Series化してみる。
http://gs.statcounter.com/search-engine-market-share

import pandas as pd

# 辞書から作成
series = pd.Series({"google": 92.31, "yahoo": 2.51, "bing": 2.27})
print(series)

結果

google    92.31
yahoo      2.51
bing       2.27
dtype: float64

↑

Seriesで提供される属性とメソッド †

めちゃくちゃいっぱいある。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html#pandas.Series

series = pd.Series([1,2,3,4,5,4,6,1,3,4])

print(series.size)  # サイズ
print(series.values)  # 値のリスト
print(series.sum())  # 合計値
print(series.mean())  # 平均値
print(series.gt(2))  # 2より大きいもの
print(series.head(2))  # 先頭の2つ
print(series.max())  # 最大値
print(series.to_json())  # jsonに変換
print(series.filter(regex="3"))  # 添字の正規表現
print(series.drop_duplicates())  # 重複排除

↑

2次元データの操作 †

2次元データの操作には DataFrame を利用する。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame

↑

リストからDataFrameを生成する †

# coding: utf-8

import pandas as pd

df = pd.DataFrame([[1,2,3],[4,5,6]])
print(df)

結果

   0  1  2
0  1  2  3
1  4  5  6

↑

複数のSeriesからDataFrameを作る †

import pandas as pd

search_engine = pd.Series(["google", "yahoo", "bing"])
share = pd.Series([92.31, 2.51, 2.27])
df = pd.DataFrame({
    "engine" : search_engine,
    "share": share
})
print(df)

結果

   engine  share
0  google  92.31
1   yahoo   2.51
2    bing   2.27

↑

複数の辞書からDataFrameを作る †

import pandas as pd

search_engine = pd.Series(["google", "yahoo", "bing"])
share = pd.Series([92.31, 2.51, 2.27])

data = {
    "share": {"google": 92.31, "yahoo": 2.51, "bing": 2.27}
}
df = pd.DataFrame(data)
print(df)

結果

        share
bing     2.27
google  92.31
yahoo    2.51

↑

辞書のリストからDataFrameを作る †

import pandas as pd

data = [
    {"google": 92.31, "yahoo": 2.51, "bing": 2.27}
]
df = pd.DataFrame(data, index=["share"])
print(df)

結果

       bing  google  yahoo
share  2.27   92.31   2.51

↑

リストのリストからDataFrameを作る †

import pandas as pd

df = pd.DataFrame([
        [92.31, 2.27, 2.51],
        [92.74, 2.17, 2.32],
        [92.37, 2.37, 2.25],
        [92.25, 2.41, 2.07]
    ],  
    index=["2018-09", "2018-10", "2018-11", "2018-12"],
    columns=["google", "yahoo", "bing"]
)

print(df)

print("-- google --")
print(df["google"])

print("-- google(list) --")
print(list(df["google"]))

print("-- google(dict) --")
print(dict(df["google"]))

print("-- 2018-09 --")
print(df.loc["2018-09"])

print("-- 2018-09(list) --")
print(list(df.loc["2018-09"]))

print("-- 2018-09(dict) --")
print(dict(df.loc["2018-09"]))

結果

         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
-- google --
2018-09    92.31
2018-10    92.74
2018-11    92.37
2018-12    92.25
Name: google, dtype: float64
-- google(list) --
[92.31, 92.74, 92.37, 92.25]
-- google(dict) --
{'2018-09': 92.31, '2018-10': 92.74, '2018-11': 92.37, '2018-12': 92.25}
-- 2018-09 --
google    92.31
yahoo      2.27
bing       2.51
Name: 2018-09, dtype: float64
-- 2018-09(list) --
[92.31, 2.27, 2.51]
-- 2018-09(dict) --
{'google': 92.31, 'yahoo': 2.27, 'bing': 2.51}

↑

カラムを追加する †

前月からの増減を表す列を追加してみる

import pandas as pd

df = pd.DataFrame([
        [92.31, 2.27, 2.51],
        [92.74, 2.17, 2.32],
        [92.37, 2.37, 2.25],
        [92.25, 2.41, 2.07]
    ],  
    index=["2018-09", "2018-10", "2018-11", "2018-12"],
    columns=["google", "yahoo", "bing"]
)

df2 = pd.DataFrame(df, copy=True)
google_list = list(df2["google"])
yahoo_list = list(df2["yahoo"])
bing_list = list(df2["bing"])
df2["google(up)"] = [google_list[i-1] < google_list[i] if i > 0 else '-' for i, val in enumerate(google_list)]
df2["yahoo(up)"] = [yahoo_list[i-1] < yahoo_list[i] if i > 0 else '-' for i, val in enumerate(yahoo_list)]
df2["bing(up)"] = [bing_list[i-1] < bing_list[i] if i > 0 else '-' for i, val in enumerate(bing_list)]

print("-- data(before) --")
print(df)

print("-- data(after) --")
print(df2)

結果

-- data(before) --
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
-- data(after) --
         google  yahoo  bing google(up) yahoo(up) bing(up)
2018-09   92.31   2.27  2.51          -         -        -
2018-10   92.74   2.17  2.32       True     False    False
2018-11   92.37   2.37  2.25      False      True    False
2018-12   92.25   2.41  2.07      False      True    False

↑

いろいろなデータの読み込み方法 †

↑

csvデータを読み込む †

read_csv を使用して CSVデータをデータフレームとして読み込む事ができる
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

search_engine_share.csv

,Google,bing,Yahoo
2018-09,92.31,2.27,2.51
2018-10,92.74,2.17,2.32
2018-11,92.37,2.37,2.25
2018-12,92.25,2.41,2.07

read_csv.py

import io
import pandas as pd

df = pd.read_csv("search_engine_share.csv", index_col=0) # データ名を列番号で指定
#df = pd.read_csv("search_engine_share.csv", sep='\t')    # タブ区切りの場合
#df = pd.read_csv("search_engine_share.csv", header=0)    # ヘッダの行番号を指定(デフォルト:0)
#df = pd.read_csv("search_engine_share_noheader.csv", names=["google", "yahoo", "bing"])    # ヘッダを自分で指定
#df = pd.read_csv(io.StringIO(csv_text))    # 文字列から読み込む場合

print(df)

結果

         Google  bing  Yahoo
2018-09   92.31  2.27   2.51
2018-10   92.74  2.17   2.32
2018-11   92.37  2.37   2.25
2018-12   92.25  2.41   2.07

read_csvのオプション

パラメータ名	説明	使用例	補足
index_col	データ名を列番号で指定する	index_col=0
sep	区切り文字を指定する	sep='\t'
header	ヘッダの行番号を指定する	header=0	デフォルト:0、ヘッダがデータにない場合はNoneを指定するか、names で自分で指定する
names	ヘッダを自分で指定する	names=["google", "yahoo", "bing"]
usecols	読み込む列を指定する	usecols=[1, 3]

上記以外にも沢山ある
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

↑

JSONを読み込む †

read_json を使用して JSONをデータフレームとして読み込む事ができる
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

import pandas as pd
import json

json_text = json.dumps({
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
})

df = pd.read_json(json_text, convert_axes=False)
#df = pd.read_json("search_engine_share.json", convert_axes=False) # ファイルからの読み込みも可能

print(df)

read_jsonのオプション
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

結果

         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07

↑

その他 †

Excel やデータベース、HTMLからも読み込みが可能。
https://pandas.pydata.org/pandas-docs/stable/api.html#input-output

↑

データの抽出 †

↑

カラム名を指定して抽出 †

import pandas as pd
import json

json_text = json.dumps({
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
})

df = pd.read_json(json_text, convert_axes=False)
print("### 全データ ###")
print(df)

print("### yahooだけを抽出 ###")
print(df["yahoo"])

結果

### 全データ ###
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
### yahooだけを抽出 ###
2018-09    2.27
2018-10    2.17
2018-11    2.37
2018-12    2.41
Name: yahoo, dtype: float64

↑

行を指定して抽出 †

データフレームの添字にBool値のリスト(またはSeries)を指定する事により抽出できる。(Trueの行が抽出される)

import pandas as pd
import json

json_text = json.dumps({
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
})

df = pd.read_json(json_text, convert_axes=False)
print("### 全データ ###")
print(df)

print("### 2行目、4行目だけ抽出 ###")
print(df[[False, True, False, True]])

結果

### 全データ ###
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
### 2行目、4行目だけ抽出 ###
         google  yahoo  bing
2018-10   92.74   2.17  2.32
2018-12   92.25   2.41  2.07

↑

行の範囲を指定して抽出 †

print("### 2〜3行目を抽出 ###")
print(df[1:3])

print("### 3行目までを抽出 ###")
print(df[:3])

結果

### 2〜3行目を抽出 ###
         google  yahoo  bing
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
### 3行目までを抽出 ###
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25

↑

条件を指定して抽出 †

データフレームの添字に条件を指定すると、条件を満たすかどうかを示すBool値の Series が得られるので、
得られたSeriesを df の添字に指定する事により実際のデータが得られる。※行を指定して抽出と同じやり方。

print("### yahooのシェアが2.3以上のデータを判定(Bool値のSeriesを取得) ###")
condition = df.yahoo >= 2.3
print(condition)
print("### yahooのシェアが2.3以上のデータだけを抽出 ###")
print(df[condition])

結果

### yahooのシェアが2.3以上のデータを判定(Bool値のSeriesを取得) ###
2018-09    False
2018-10    False
2018-11     True
2018-12     True
Name: yahoo, dtype: bool
### yahooのシェアが2.3以上のデータだけを抽出 ###
         google  yahoo  bing
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07

↑

複数の条件を指定して抽出 †

& または | を使用して and , or 条件を指定できる。

print("### yahooが2.3以上 かつ bingがの2.2以上の年度だけを抽出 ###")
print(df[(df.yahoo >= 2.3) & (df.bing >= 2.2)])

print("### yahooが2.4以上 または bingがの2.3以上の年度だけを抽出 ###")
print(df[(df.yahoo >= 2.4) | (df.bing >= 2.3)])

結果

### yahooが2.3以上 かつ bingがの2.2以上の年度だけを抽出 ###
         google  yahoo  bing
2018-11   92.37   2.37  2.25
### yahooが2.4以上 または bingがの2.3以上の年度だけを抽出 ###
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-12   92.25   2.41  2.07

↑

query を使用して抽出 †

上記の条件抽出は query を使用して行う事もできる。
ただし、使用には numexpr のインストールが必要。

pip install numexpr

query を使用したデータ抽出
※queryの条件は文字列で指定する

print("### yahooが2.3以上 かつ bingがの2.2以上の年度だけを抽出 ###")
print(df.query('yahoo >= 2.3 & bing >= 2.2'))

print("### yahooが2.4以上 または bingがの2.3以上の年度だけを抽出 ###")
print(df.query('yahoo >= 2.4 | bing >= 2.3'))

結果

### yahooが2.3以上 かつ bingがの2.2以上の年度だけを抽出 ###
         google  yahoo  bing
2018-11   92.37   2.37  2.25
### yahooが2.4以上 または bingがの2.3以上の年度だけを抽出 ###
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-12   92.25   2.41  2.07

↑

データの加工 †

↑

条件を満たす要素の値を書き換える1 †

「 df[条件] = 値」とする事で、条件に合致する要素を書き換える事ができる。

import pandas as pd

data = { 
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
}
df = pd.DataFrame(data)

print("## 置換前 ##")
print(df)

print("## 置換後 ##")
df[df <= 2.2] = False
print(df)

結果

## 置換前 ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
## 置換後 ##
         google    yahoo     bing
2018-09   92.31     2.27     2.51
2018-10   92.74  False     2.32
2018-11   92.37     2.37     2.25
2018-12   92.25     2.41  False

↑

条件を満たす要素の値を書き換える2 †

特定の値を書き換える場合は、replace メソッドを使用して書き換える事もできる。

import pandas as pd

data = { 
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
}
df = pd.DataFrame(data)

print("## 置換前 ##")
print(df)

print("## 置換後 ##")
df[df <= 2.2] = False
print(df.replace({False: "unknown"}))
print(df)

結果

## 置換前 ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
## 置換後 ##
         google    yahoo     bing
2018-09   92.31     2.27     2.51
2018-10   92.74  unknown     2.32
2018-11   92.37     2.37     2.25
2018-12   92.25     2.41  unknown

↑

条件を満たさない行にNaNを設定する †

条件に合致しない行を欠損値としたい場合で、元データと同じ行数が必要な場合は where を使用する。
この場合、条件を満たさない行の全ての要素に NaN が設定される。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html#pandas.DataFrame.where

print("### yahooが2.4以上 または bingがの2.3以上の年度だけを抽出 ###")
print(df.where((df.yahoo >= 2.4) | (df.bing >= 2.3)))

結果

### yahooが2.4以上 または bingがの2.3以上の年度だけを抽出 ###
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11     NaN    NaN   NaN
2018-12   92.25   2.41  2.07

↑

条件を満たさない列にNaNを設定する †

条件に合致しないデータを要素単位で欠損値にしたい場合は、列名を指定しないで条件を指定する。
※列名まで指定すると、合致しないデータは除外されてしまう。
※さらに fillna を使用すれば特定の値に変換する事も可能。

import pandas as pd

data = { 
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
}
df = pd.DataFrame(data)

print("## 置換前 ##")
print(df)

print("## 置換後(90未満にはNaNを設定する) ##")
print(df[df >= 90])

print("## 置換後(90未満には 'min' を設定する) ##")
print(df[df >= 90].fillna('min'))

結果

## 置換前 ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
## 置換後(90未満にはNaNを設定する) ##
         google  yahoo  bing
2018-09   92.31    NaN   NaN
2018-10   92.74    NaN   NaN
2018-11   92.37    NaN   NaN
2018-12   92.25    NaN   NaN
## 置換後(90未満には 'min' を設定する) ##
         google yahoo bing
2018-09   92.31   min  min
2018-10   92.74   min  min
2018-11   92.37   min  min
2018-12   92.25   min  min

↑

欠損値の削除 †

欠損値があるデータを削除するには dropna を使用する。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

print("### yahooが2.4以上 または bingがの2.3以上の年度だけを抽出(欠損値あり) ###")
print(df.where((df.yahoo >= 2.4) | (df.bing >= 2.3)))
print("### yahooが2.4以上 または bingがの2.3以上の年度だけを抽出(欠損値除外) ###")
print(df.where((df.yahoo >= 2.4) | (df.bing >= 2.3)).dropna())
#print(df.where((df.yahoo >= 2.4) | (df.bing >= 2.3)).dropna(subset=["google"]))  # 特定の列に欠損値がある場合のみ除外する場合は subset オプションが使用できる

結果

### yahooが2.4以上 または bingがの2.3以上の年度だけを抽出(欠損値あり) ###
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11     NaN    NaN   NaN
2018-12   92.25   2.41  2.07
### yahooが2.4以上 または bingがの2.3以上の年度だけを抽出(欠損値除外) ###
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-12   92.25   2.41  2.07

↑

欠損値の置換 †

fillna を使用して欠損値を変換する事ができる
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html

import pandas as pd

data = { 
    "google": {"2018-09": 92.31, "2018-10": None, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": None, "2018-11": None, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": None},
}
df = pd.DataFrame(data)

print("## 置換前 ##")
print(df)

print("## 置換後 ##")
print(df.fillna(0))

※value オプションを使用すればカラム毎に変換値を指定する事も可能
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html

結果

## 置換前 ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10     NaN    NaN  2.32
2018-11   92.37    NaN  2.25
2018-12   92.25   2.41   NaN
## 置換後 ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10    0.00   0.00  2.32
2018-11   92.37   0.00  2.25
2018-12   92.25   2.41  0.00

↑

欠損値の場合に前データを引き継ぐ †

import pandas as pd

df = pd.DataFrame(
    { 
        "google": {"2018-09": 92.31, "2018-10": None, "2018-11": 92.37, "2018-12": 92.25},
        "yahoo": {"2018-09": 2.27, "2018-10": None, "2018-11": None, "2018-12": 2.41},
        "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": None},
    }
)
print("## 置換前 ##")
print(df)
print("## 置換後 ##")
print(df.fillna(method='ffill'))

結果

## 置換前 ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10     NaN    NaN  2.32
2018-11   92.37    NaN  2.25
2018-12   92.25   2.41   NaN
## 置換後 ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.31   2.27  2.32
2018-11   92.37   2.27  2.25
2018-12   92.25   2.41  2.25

↑

特定の値を置換する †

replace で特定の値を置換する事ができる。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html

import pandas as pd

data = { 
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
}
df = pd.DataFrame(data)

google_list = list(df["google"])
yahoo_list = list(df["yahoo"])
bing_list = list(df["bing"])
df["google(up)"] = [google_list[i-1] < google_list[i] if i > 0 else '-' for i, val in enumerate(google_list)]
df["yahoo(up)"] = [yahoo_list[i-1] < yahoo_list[i] if i > 0 else '-' for i, val in enumerate(yahoo_list)]
df["bing(up)"] = [bing_list[i-1] < bing_list[i] if i > 0 else '-' for i, val in enumerate(bing_list)]

print("## 置換前 ##")
print(df)

print("## 置換後 ##")
print(df.replace({True: "up", False: 'down'}))

結果

## 置換前 ##
         google  yahoo  bing google(up) yahoo(up) bing(up)
2018-09   92.31   2.27  2.51          -         -        -
2018-10   92.74   2.17  2.32       True     False    False
2018-11   92.37   2.37  2.25      False      True    False
2018-12   92.25   2.41  2.07      False      True    False
## 置換後 ##
         google  yahoo  bing google(up) yahoo(up) bing(up)
2018-09   92.31   2.27  2.51          -         -        -
2018-10   92.74   2.17  2.32         up      down     down
2018-11   92.37   2.37  2.25       down        up     down
2018-12   92.25   2.41  2.07       down        up     down

↑

重複データの削除 †

drop_duplicates を使用して重複データの削除を行う事ができる
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

import pandas as pd

data = { 
    "google": {"2018-09": 92.31, "2018-10": 92.31, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.27, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.51, "2018-11": 2.25, "2018-12": 2.25},
}
df = pd.DataFrame(data)

print("## 重複削除前 ##")
print(df)

print("## 重複削除後 ##")
print(df.drop_duplicates())

print("## 重複削除後(最後のデータを残す) ##")
print(df.drop_duplicates(keep='last'))

print("## 重複削除後(bingの重複を削除) ##")
print(df.drop_duplicates(subset=["bing"], keep='first'))

結果

## 重複削除前 ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.31   2.27  2.51
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.25
## 重複削除後 ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.25
## 重複削除後(最後のデータを残す) ##
         google  yahoo  bing
2018-10   92.31   2.27  2.51
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.25
## 重複削除後(bingの重複を削除) ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-11   92.37   2.37  2.25

↑

グルーピング †

↑

最も簡単なグルーピング †

import pandas as pd
import numpy as np

cities = ["Osaka", "Tokyo", "Osaka", "Osaka", "Nagoya", "Osaka", "Tokyo", "Tokyo", "Nagoya", "Tokyo"]
ages = [20, 37, 41, 25, 31, 28, 32, 44, 29, 23]
heights = [170.5, 175.4, 171.2, 173.6, 169.8, 165.2, 178.1, 165.5, 173.2, 172.9]
df = pd.DataFrame({"city": cities, "age": ages, "height": heights})

df.groupby(["city"]).mean()

結果

         age   height
city                 
Nagoya  30.0  171.500
Osaka   28.5  170.125
Tokyo   34.0  172.975

↑

列ごとに別の集計を行う †

# 都道府県ごとの最も年齢が高い人の年齢、身長の平均
df.groupby(["city"]).agg({"age": np.max, "height": np.mean})

結果

        age   height
city                
Nagoya   31  171.500
Osaka    41  170.125
Tokyo    44  172.975

↑

グループ化キーをラベルでなく列として得る †

df.groupby(["city"], as_index=False).agg({"age": np.max, "height": np.mean})

結果

     city  age   height
0  Nagoya   31  171.500
1   Osaka   41  170.125
2   Tokyo   44  172.975

↑

1つの列の複数の集約結果を得る †

tmp_df = df
tmp_df["age(max)"] = tmp_df["age"]
tmp_df["age(min)"] = tmp_df["age"]
tmp_df["age(avg)"] = tmp_df["age"]

tmp_df.groupby(["city"], as_index=False).agg({"age(max)": np.max, "age(min)": np.min, "age(avg)": np.mean})

結果

     city  age(max)  age(min)  age(avg)
0  Nagoya        31        29      30.0
1   Osaka        41        20      28.5
2   Tokyo        44        23      34.0

↑

データのソート †

↑

値でソート †

sort_values を使用すればデータの値でソートする事ができる。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html

import pandas as pd

data = { 
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
}
df = pd.DataFrame(data)

print("## 元データ ##")
print(df)

print("## Googleのデータの降順でソート ##")
print(df.sort_values('google', ascending=False))
#print(df.sort_values(by=['google', 'yahoo'], ascending=False))  # by オプションで複数列を指定する事も可能

結果

## 元データ ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
## Googleのデータの降順でソート ##
         google  yahoo  bing
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-09   92.31   2.27  2.51
2018-12   92.25   2.41  2.07

↑

データ名(ラベル名/添字)でソート †

sort_index を使用すればデータのラベル名でソートする事ができる。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html

import pandas as pd

data = { 
    "google": {"2018-09": 92.31, "2018-10": 92.74, "2018-11": 92.37, "2018-12": 92.25},
    "yahoo": {"2018-09": 2.27, "2018-10": 2.17, "2018-11": 2.37, "2018-12": 2.41},
    "bing": {"2018-09": 2.51, "2018-10": 2.32, "2018-11": 2.25, "2018-12": 2.07},
}

df = pd.DataFrame(data)

print("## 元データ ##")
print(df)

print("## データ名(年月)の降順でソート ##")
print(df.sort_index(ascending=False))

結果

## 元データ ##
         google  yahoo  bing
2018-09   92.31   2.27  2.51
2018-10   92.74   2.17  2.32
2018-11   92.37   2.37  2.25
2018-12   92.25   2.41  2.07
## データ名(年月)の降順でソート ##
         google  yahoo  bing
2018-12   92.25   2.41  2.07
2018-11   92.37   2.37  2.25
2018-10   92.74   2.17  2.32
2018-09   92.31   2.27  2.51

↑

実装メモ †

↑

連続データの抽出 †

時系列データを扱う場合によくある実装。

N個以上の連続データのみを抽出するサンプル。
なお、ここでは前のレコードから0.2秒以内のレコードのみを連続データとみなす例を記載する。

テストデータ作成

#
# テストデータ作成
#
import datetime
import numpy as np
import pandas as pd

DATA_SIZE = 100

times = []
current_time = datetime.datetime.strptime('2020-01-01', '%Y-%m-%d')
for i in range(DATA_SIZE):
    diff_sec = np.round(np.random.uniform(0.1, 0.5), 1)
    delta = datetime.timedelta(seconds=diff_sec)
    current_time = current_time + delta
    times.append(current_time)

keys    = range(DATA_SIZE)
values = np.round(np.random.uniform(1, 10, DATA_SIZE), 1)
df = pd.DataFrame({'time': times, 'key': keys, 'value': values})
df.head(10)

5レコード以上の連続データのみを抽出する。

#
# 連続するレコードの抽出
# (前のレコードから0.2秒以内のみを連続レコードとみなす)
#
df2 = df.copy()

time_diff = df2["time"].diff(1).fillna(datetime.timedelta(seconds=0))
df2["time_diff"] = [x.microseconds / 1000 / 1000 for x in time_diff]

df2["tmp_index"] = np.arange(df.shape[0])
df2["tmp_index"] = df2["tmp_index"][df2["time_diff"] > 0.2]
df2.loc[df2.index[0],"tmp_index"] = 0
df2["group"] = df2["tmp_index"].fillna(method='ffill').astype(int)
del df2["tmp_index"]

# 5コマ以上連続するデータのみ抽出
g = df2.groupby(["group"], as_index=True).agg({'group': np.size})
target_group = g[g >= 5]["group"].dropna()
result_df = df2[df2["group"].isin(target_group.index.to_list())]

result_df