Exercise: PythonでExcelから取得したデータからグラフ画像を生成

Pythonを日常的に使うようになって、2か月*1が経過しました。

だいぶ馴染んできたので、何か作業の練習をしてみることにします。

はじめに

私にとってのPythonは、ツールとしてはほぼ定着したと思います。（アプリ開発はまだ勉強しないと無理。）

とは言ってもまだ実用的な仕事をさせたことが無いので、課題として、
「複数のExcelファイルからデータを取得して、グラフをプロットして画像として保存」
というのをやってみます。

グラフの使い方はテキトーです。

Excelを使っているのは、Excelを使う人にもPython便利！と思ってもらう目論見があったり無かったり。

環境

Windows7 64bit
Python 3.5.1
Anaconda3 v4.0.0

使用モジュール

バージョンは省略。
Anacondaがあればこれらが全部使えるところが良いですね。

os, glob ⇒ 主にファイルパス関連処理
collections ⇒ グルーピング（Counter）
xlrd （Python-Excel） ⇒ Excel読みこみ
matplotlib ⇒ グラフのプロット（ファイル書き出しまで）

Numpyやpandasモジュールを使ったほうが良さそうなところもありますが、今回は見送ります。

おおまかな仕様のようなもの

任意のフォルダーにある、すべての*.xls, *.xlsxファイルのデータを元に、それぞれのファイルごとのグラフ画像(PNG)を作る
画像ファイル名は、Excelファイル名を元に作る
グラフはデータに記載している種類で作成（棒/円/散布図）

詳細は実装のところで。

実装

インポート構成

import os
from glob import glob
from collections import Counter
import xlrd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

エントリーポイント

引数に入力ディレクトリーと出力ディレクトリーを取る関数を定義します。

def excel_to_plot(srcdir, dstdir):
    for file in excel_files(srcdir):
        create_plot_image_file(file, dstdir)

「Excelファイルのリストを取得する」関数で取得したリストをイテレートし、それぞれ「グラフを作成＆保存する」関数に渡します。

任意のフォルダーからExcelファイルを全部取得

srcにあるExcelファイルを取得して文字列リストにします。
globモジュールを使用します。

def excel_files(srcdir):
    return glob(srcdir + '/*.xls') + glob(srcdir + '/*.xlsx')

Excelからデータを取得

xlrdモジュールを使って、Excelからデータを取得します。
シートは常に先頭のものを対象にします。
セルA1にグラフ種類を記載しているので、それを取得します。

book = xlrd.open_workbook(excel_file_path)
sheet = book.sheet_by_index(0)
plot_type = sheet.cell(0, 0).value  # A1セル

シート名をグラフのタイトルにします。
日本語の場合はFontPropertiesを設定する必要があります。

matplotlibで日本語 - Qiita
- http://qiita.com/canard0328/items/a859bffc9c9e11368f37

fp = FontProperties(fname=r'C:\Windows\Fonts\msgothic.ttc', size=24)
plt.title(sheet.name, fontproperties=fp)

xlrdが返すリストはセルオブジェクトのリストなので、値リストに変換する内部関数を定義しておきます。

def to_values(cells):
    return [x.value for x in cells]

セルA1のグラフ種類がbar,pie,scatterのどれかであれば、対応するグラフを描画します。
いずれにも該当しなければ、警告メッセージを出力してそのファイルはスキップします。

    if plot_type == 'bar':
        pass  # 後で

    elif plot_type == 'pie':
        pass  # 後で

    elif plot_type == 'scatter':
        pass  # 後で

    else:
        print('warning: unknown plot type detected, type =', plot_type, 'in', excel_file_path)
        return

barの場合は、B,Cの2列を取得して、それぞれ種類と量とし、種類ごとに合算した値を算出します。
pieの場合は、B列を取得して、値の出現割合を算出します。
scatterの場合は、B,C,Dの3列を取得して、それぞれx軸、y軸、点の大きさとします。

各グラフのラベルは省略または最小限にしています。

ちなみに、matplotlib.pyplotに渡すデータは、イテレート可能なものであればたいてい何でも渡せます。

棒グラフ (Bar chart)

集計は辞書を作ってループで足しこんでいます。
ソート順は種類の名前順です。

棒グラフの参考ページ：

python - how to put gap between y axis and first bar in vertical barchart matplotlib - Stack Overflow
- http://stackoverflow.com/questions/6642482/how-to-put-gap-between-y-axis-and-first-bar-in-vertical-barchart-matplotlib

kinds = to_values(sheet.col_slice(0, 1))
amounts = to_values(sheet.col_slice(1, 1))
summary_dic = {x: 0 for x in kinds}
for kind, amount in zip(kinds, amounts):
    summary_dic[kind] += amount
summary = sorted(summary_dic.items(), key=lambda x: x[0], reverse=False)
labels, values = list(zip(*summary))  # unzip
plt.bar(range(len(values)), values, width=0.5, align='center')
plt.xticks(range(len(values)), labels)

円グラフ (Pie chart)

出現割合の算出は、今回はcollectionsモジュールのCounterを使っています。

データは割合が少ない順にします。
デフォルトでは3時の位置から開始するので、startangle=90を指定すると、我々が良く目にする円グラフになります。

円グラフの参考ページ：

pie_and_polar_charts example code: pie_demo_features.py — Matplotlib 1.5.1 documentation
- http://matplotlib.org/examples/pie_and_polar_charts/pie_demo_features.html

samples = to_values(sheet.col_slice(0, 1))
grouped = sorted(Counter(samples).items(), key=lambda x: x[1], reverse=False)
labels, values = list(zip(*grouped))  # unzip
plt.pie(values, labels=labels, startangle=90, shadow=True, colors=['red', 'green', 'blue', 'yellow'])

散布図 (Scatter plot)

X軸とY軸のそれぞれの配列を渡すだけです。

後から色の指定も追加（D列）してみました。

散布図の参考ページ：

shapes_and_collections example code: scatter_demo.py — Matplotlib 1.5.1 documentation
- http://matplotlib.org/examples/shapes_and_collections/scatter_demo.html

xs = to_values(sheet.col_slice(0, 1))
ys = to_values(sheet.col_slice(1, 1))
zs = to_values(sheet.col_slice(2, 1))  # 点の大きさ
colors = to_values(sheet.col_slice(3, 1))
plt.scatter(xs, ys, s=zs, c=colors)

Excelファイル名から画像ファイル名を生成

画像の種類は拡張子で自動判別されるので、ファイル名さえ決めればOK。

Excelファイル名のピリオドをアンダースコア(_)にして（しなくても良い）、末尾に.pngをつけることにします。

def to_plot_image_file_name(excel_file_path):
    return os.path.basename(excel_file_path).replace('.', '_') + '.png'

プロットしたグラフを画像に保存

plt.savefigするだけです。

完成版

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
from glob import glob
from collections import Counter
import xlrd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

fp = FontProperties(fname=r'C:\Windows\Fonts\msgothic.ttc', size=24)


def excel_to_plot(srcdir, dstdir):
    for file in excel_files(srcdir):
        create_plot_image_file(file, dstdir)


def excel_files(srcdir):
    return glob(srcdir + '/*.xls') + glob(srcdir + '/*.xlsx')


def to_plot_image_file_name(excel_file_path):
    return os.path.basename(excel_file_path).replace('.', '_') + '.png'


def create_plot_image_file(excel_file_path, dstdir):
    def to_values(cells):
        return [x.value for x in cells]

    book = xlrd.open_workbook(excel_file_path)
    sheet = book.sheet_by_index(0)
    plot_type = sheet.cell(0, 0).value  # A1セル

    if plot_type == 'bar':
        kinds = to_values(sheet.col_slice(0, 1))
        amounts = to_values(sheet.col_slice(1, 1))
        summary_dic = {x: 0 for x in kinds}
        for kind, amount in zip(kinds, amounts):
            summary_dic[kind] += amount
        summary = sorted(summary_dic.items(), key=lambda x: x[0], reverse=False)
        labels, values = list(zip(*summary))  # unzip
        plt.bar(range(len(values)), values, width=0.5, align='center')
        plt.xticks(range(len(values)), labels)

    elif plot_type == 'pie':
        samples = to_values(sheet.col_slice(0, 1))
        grouped = sorted(Counter(samples).items(), key=lambda x: x[1], reverse=False)
        labels, values = list(zip(*grouped))  # unzip
        plt.pie(values, labels=labels, startangle=90, shadow=True, colors=['red', 'green', 'blue', 'yellow'])

    elif plot_type == 'scatter':
        xs = to_values(sheet.col_slice(0, 1))
        ys = to_values(sheet.col_slice(1, 1))
        zs = to_values(sheet.col_slice(2, 1))
        colors = to_values(sheet.col_slice(3, 1))
        plt.scatter(xs, ys, s=zs, c=colors)

    else:
        print('warning: unknown plot type detected, type =', plot_type, 'in', excel_file_path)
        return

    plt.title(sheet.name, fontproperties=fp)

    dst_file_path = os.path.join(dstdir, to_plot_image_file_name(excel_file_path))
    plt.savefig(dst_file_path)
    plt.clf()  # 念の為クリアー


excel_to_plot(r'C:\users\argius\src', r'C:\users\argius\dst')

実行例

棒グラフのデータ

bar	
A	3
B	2
C	5
A	13
B	34

円グラフのデータ

pie
AAA
BBB
AAA
CCC
BBB
AAA
AAA
AAA
BBB
X

散布図のデータ

scatter			
3	2	500	#ff6600
4	2	450	green
5	1	100	red
6	3	350	blue

出力結果

f:id:argius:20160613155735p:plain

※50%縮小しています。

おわりに

まだ分からないことばかりで調べながら書いていたら一日かかりましたが、ツールが充実しているのでそれほど苦になりませんでした。

今回のコードはPyCharmで書いたんですが、Java+IDEと同じ感覚で書けるので快適です。
コード断片を試すときはJupyter notebookを使いました。これも便利です。

以上、Pythonはツールとして大変使いやすいと改めて思いました。

最後までお読みいただきましてありがとうございました。

（おわり）

*1:前々回の記事がスタートとして

argius note

プログラミング関連