Tabular Data

xskillscore can be used on tabular data such as that stored in a pandas.DataFrame.

It can be used most effectively when evaluating predictions over different fields.

[1]:
import numpy as np
import pandas as pd
import xskillscore as xs
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
np.random.seed(seed=42)

Boston house prices dataset

A small example is to take a dataset and evaluate the model according to a field (column).

Load the Boston house prices dataset:

[2]:
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['y'] = pd.Series(data.target)
df
[2]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT y
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 22.4
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9

506 rows × 14 columns

Create a dummy prediction column by adding noise to y:

[3]:
noise = np.random.uniform(-1, 1, size=len(df['y']))
df['yhat'] = (df['y'] + (df['y'] * noise)).clip(lower=df["y"].min())

Evaluate the model over the field RAD using pandas.groupby.apply with mean_squared_error from scikit-learn:

[4]:
df.groupby('RAD').apply(lambda x: mean_squared_error(x["y"], x["yhat"]))
[4]:
RAD
1.0     161.237554
2.0     313.855750
3.0     307.220760
4.0     162.634430
5.0     221.852969
6.0     155.612978
7.0     214.375240
8.0     278.092560
24.0    148.840507
dtype: float64

You could also do the following using xskillscore.

First, structure the pandas.DataFrame to keep the core fields when converting to an xarray object:

[5]:
min_df = df.reset_index().set_index(["index", "RAD"])[["y", "yhat"]]
min_df
[5]:
y yhat
index RAD
0 1.0 24.0 17.977926
1 2.0 21.6 41.070858
2 2.0 34.7 50.800380
3 3.0 33.4 39.990387
4 3.0 36.2 11.295750
... ... ... ...
501 1.0 22.4 24.017117
502 1.0 20.6 12.752538
503 1.0 23.9 38.899402
504 1.0 22.0 30.128172
505 1.0 11.9 5.000000

506 rows × 2 columns

Convert it to an xarray.Dataset using pandas.DataFrame.to_xarray. Note: This will create an array of index by RAD and pad the values that do not exist with nan.

[6]:
ds = min_df.to_xarray()
ds
[6]:
<xarray.Dataset>
Dimensions:  (index: 506, RAD: 9)
Coordinates:
  * index    (index) int64 0 1 2 3 4 5 6 7 8 ... 498 499 500 501 502 503 504 505
  * RAD      (RAD) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 24.0
Data variables:
    y        (index, RAD) float64 24.0 nan nan nan nan ... nan nan nan nan nan
    yhat     (index, RAD) float64 17.98 nan nan nan nan ... nan nan nan nan nan

You call now apply any metric from xskillscore using the accessor method. The input for the dim argument is index as we want to reduce this dimension and apply the metric over RAD. In addition, there are nan’s in the xarray.Dataset so you should use skipna=True:

[7]:
out = ds.xs.mse('y', 'yhat', dim="index", skipna=True)
out
[7]:
<xarray.DataArray (RAD: 9)>
array([161.23755363, 313.85575025, 307.22076012, 162.63442999,
       221.85296903, 155.6129776 , 214.37524005, 278.09256049,
       148.84050691])
Coordinates:
  * RAD      (RAD) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 24.0

It makes sense to return the data in tabular form hence you can call xarray.DataArray.to_series to convert it to a pandas.Series:

[8]:
out.to_series()
[8]:
RAD
1.0     161.237554
2.0     313.855750
3.0     307.220760
4.0     162.634430
5.0     221.852969
6.0     155.612978
7.0     214.375240
8.0     278.092560
24.0    148.840507
dtype: float64

Evaluating predictions over many columns

xskillscore is built upon xarray.apply_ufunc which offers speed-up by vectorizing operations. As a result xskillscore can be faster than pandas.groupby.apply. This is espicially true if there are many samples in the dataset and if the predictions have to be evaluated over many fields.

For this exercise we will create fake data for which the predictions have to be evaluated over three fields:

[9]:
stores = np.arange(100)
skus = np.arange(100)
dates = pd.date_range("1/1/2020", "1/10/2020", freq="D")

rows = []
for _, date in enumerate(dates):
    for _, store in enumerate(stores):
        for _, sku in enumerate(skus):
            rows.append(
                dict(
                    {
                        "DATE": date,
                        "STORE": store,
                        "SKU": sku,
                        "y": np.random.randint(9) + 1,
                    }
                )
            )
df = pd.DataFrame(rows)

noise = np.random.uniform(-1, 1, size=len(df['y']))
df['yhat'] = (df['y'] + (df['y'] * noise)).clip(lower=df["y"].min())
df
[9]:
DATE STORE SKU y yhat
0 2020-01-01 0 0 3 4.617306
1 2020-01-01 0 1 6 1.000000
2 2020-01-01 0 2 2 3.039347
3 2020-01-01 0 3 3 5.102145
4 2020-01-01 0 4 5 3.563087
... ... ... ... ... ...
99995 2020-01-10 99 95 9 15.836256
99996 2020-01-10 99 96 5 7.515791
99997 2020-01-10 99 97 1 1.000000
99998 2020-01-10 99 98 6 6.676512
99999 2020-01-10 99 99 5 4.600985

100000 rows × 5 columns

Time the pandas.groupby.apply method:

[10]:
%%time
df.groupby(['STORE', 'SKU']).apply(lambda x: mean_squared_error(x["y"], x["yhat"]))
CPU times: user 3.52 s, sys: 4.37 ms, total: 3.52 s
Wall time: 3.52 s
[10]:
STORE  SKU
0      0       8.384360
       1       7.071648
       2      14.677462
       3      13.391239
       4      12.131033
                ...
99     95     18.473114
       96     10.154608
       97     11.743513
       98      8.406069
       99      7.098808
Length: 10000, dtype: float64

Time it using xskillscore:

[11]:
%%time
df.set_index(["DATE", "STORE", "SKU"]).to_xarray().xs.mse(
    "y", "yhat", dim="DATE"
).to_series()
CPU times: user 23.3 ms, sys: 0 ns, total: 23.3 ms
Wall time: 22.7 ms
[11]:
STORE  SKU
0      0       8.384360
       1       7.071648
       2      14.677462
       3      13.391239
       4      12.131033
                ...
99     95     18.473114
       96     10.154608
       97     11.743513
       98      8.406069
       99      7.098808
Length: 10000, dtype: float64

See xskillscore-tutorial for further reading.

[ ]: