Tabular Data

xskillscore can be used on tabular data such as that stored in a pandas.DataFrame.

It can be used most effectively when evaluating predictions over different fields.

[1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error

import xskillscore as xs

np.random.seed(seed=42)

California house prices dataset

A small example is to take a dataset and evaluate the model according to a field (column).

Load the California house prices dataset:

[2]:
housing = fetch_california_housing(as_frame=True)
df = housing.frame
df["AveRooms"] = df["AveRooms"].round()
df = df.rename(columns={"MedHouseVal": "y"})
df
[2]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude y
0 8.3252 41.0 7.0 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.0 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.0 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 6.0 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.0 1.081081 565.0 2.181467 37.85 -122.25 3.422
... ... ... ... ... ... ... ... ... ...
20635 1.5603 25.0 5.0 1.133333 845.0 2.560606 39.48 -121.09 0.781
20636 2.5568 18.0 6.0 1.315789 356.0 3.122807 39.49 -121.21 0.771
20637 1.7000 17.0 5.0 1.120092 1007.0 2.325635 39.43 -121.22 0.923
20638 1.8672 18.0 5.0 1.171920 741.0 2.123209 39.43 -121.32 0.847
20639 2.3886 16.0 5.0 1.162264 1387.0 2.616981 39.37 -121.24 0.894

20640 rows × 9 columns

Create a dummy prediction column by adding noise to y:

[3]:
noise = np.random.uniform(-1, 1, size=len(df["y"]))
df["yhat"] = (df["y"] + (df["y"] * noise)).clip(lower=df["y"].min())

Evaluate the model over the field AveRooms using pandas.groupby.apply with mean_squared_error from scikit-learn:

[4]:
df.groupby("AveRooms").apply(lambda x: mean_squared_error(x["y"], x["yhat"])).head()
[4]:
AveRooms
1.0    1.789466
2.0    1.827004
3.0    1.492455
4.0    1.352848
5.0    1.384756
dtype: float64

You could also do the following using xskillscore.

First, structure the pandas.DataFrame to keep the core fields when converting to an xarray object:

[5]:
min_df = df.reset_index().set_index(["index", "AveRooms"])[["y", "yhat"]]
min_df
[5]:
y yhat
index AveRooms
0 7.0 4.526 3.390337
1 6.0 3.585 6.816622
2 8.0 3.521 5.154701
3 6.0 3.413 4.086443
4 6.0 3.422 1.067792
... ... ... ...
20635 5.0 0.781 0.611083
20636 6.0 0.771 1.497737
20637 5.0 0.923 0.648200
20638 5.0 0.847 1.470100
20639 5.0 0.894 0.166662

20640 rows × 2 columns

Convert it to an xarray.Dataset using pandas.DataFrame.to_xarray. Note: This will create an array of index by AveRooms and pad the values that do not exist with nan.

[6]:
ds = min_df.to_xarray()
ds
[6]:
<xarray.Dataset>
Dimensions:   (index: 20640, AveRooms: 47)
Coordinates:
  * index     (index) int64 0 1 2 3 4 5 ... 20634 20635 20636 20637 20638 20639
  * AveRooms  (AveRooms) float64 1.0 2.0 3.0 4.0 5.0 ... 60.0 62.0 133.0 142.0
Data variables:
    y         (index, AveRooms) float64 nan nan nan nan nan ... nan nan nan nan
    yhat      (index, AveRooms) float64 nan nan nan nan nan ... nan nan nan nan

You call now apply any metric from xskillscore using the accessor method. The input for the dim argument is index as we want to reduce this dimension and apply the metric over AveRooms. In addition, there are nan’s in the xarray.Dataset so you should use skipna=True:

[7]:
out = ds.xs.mse("y", "yhat", dim="index", skipna=True)
out
[7]:
<xarray.DataArray (AveRooms: 47)>
array([1.78946554e+00, 1.82700352e+00, 1.49245536e+00, 1.35284843e+00,
       1.38475581e+00, 1.78975720e+00, 3.26525395e+00, 4.80747797e+00,
       5.16524336e+00, 3.13496890e+00, 1.23401728e+00, 1.12562885e+00,
       5.54888374e-01, 2.61824323e+00, 8.84529997e-01, 1.17865387e+00,
       8.96786588e-01, 6.93484341e-01, 8.44837355e-01, 9.50615751e-01,
       2.55912220e+00, 4.16548298e-01, 3.07284580e-01, 8.31537279e-01,
       4.06466713e+00, 8.79983025e-01, 1.09491040e-02, 1.12379707e+00,
       1.50188148e+00, 1.56069394e+00, 2.73330025e-02, 2.68438951e-01,
       4.63967683e-01, 1.47081770e+00, 3.28568563e+00, 4.86835859e-01,
       5.48064237e-04, 1.40563208e+00, 9.04093610e-01, 3.26459003e-01,
       1.48460982e-01, 3.39427104e+00, 4.19379397e+00, 1.74130396e-01,
       1.04411235e+00, 1.23495233e+00, 2.64087781e-01])
Coordinates:
  * AveRooms  (AveRooms) float64 1.0 2.0 3.0 4.0 5.0 ... 60.0 62.0 133.0 142.0

It makes sense to return the data in tabular form hence you can call xarray.DataArray.to_series to convert it to a pandas.Series:

[8]:
out.to_series().head()
[8]:
AveRooms
1.0    1.789466
2.0    1.827004
3.0    1.492455
4.0    1.352848
5.0    1.384756
dtype: float64

Evaluating predictions over many columns

xskillscore is built upon xarray.apply_ufunc which offers speed-up by vectorizing operations. As a result xskillscore can be faster than pandas.groupby.apply. This is espicially true if there are many samples in the dataset and if the predictions have to be evaluated over many fields.

For this exercise we will create fake data for which the predictions have to be evaluated over three fields:

[9]:
stores = np.arange(100)
skus = np.arange(100)
dates = pd.date_range("1/1/2020", "1/10/2020", freq="D")

rows = []
for _, date in enumerate(dates):
    for _, store in enumerate(stores):
        for _, sku in enumerate(skus):
            rows.append(
                dict(
                    {
                        "DATE": date,
                        "STORE": store,
                        "SKU": sku,
                        "y": np.random.randint(9) + 1,
                    }
                )
            )
df = pd.DataFrame(rows)

noise = np.random.uniform(-1, 1, size=len(df["y"]))
df["yhat"] = (df["y"] + (df["y"] * noise)).clip(lower=df["y"].min())
df
[9]:
DATE STORE SKU y yhat
0 2020-01-01 0 0 6 3.874272
1 2020-01-01 0 1 9 13.551266
2 2020-01-01 0 2 8 3.979884
3 2020-01-01 0 3 3 3.222543
4 2020-01-01 0 4 6 1.647346
... ... ... ... ... ...
99995 2020-01-10 99 95 1 1.000000
99996 2020-01-10 99 96 4 2.770135
99997 2020-01-10 99 97 7 5.820397
99998 2020-01-10 99 98 2 1.000000
99999 2020-01-10 99 99 2 1.000000

100000 rows × 5 columns

Time the pandas.groupby.apply method:

[10]:
%%time
df.groupby(["STORE", "SKU"]).apply(lambda x: mean_squared_error(x["y"], x["yhat"]))
CPU times: user 3.5 s, sys: 3.99 ms, total: 3.5 s
Wall time: 3.5 s
[10]:
STORE  SKU
0      0      10.968313
       1       5.465377
       2       2.546790
       3       4.274809
       4       8.443736
                ...
99     95      6.832711
       96      4.262613
       97     11.533266
       98     14.450065
       99      2.820765
Length: 10000, dtype: float64

Time it using xskillscore:

[11]:
%%time
df.set_index(["DATE", "STORE", "SKU"]).to_xarray().xs.mse(
    "y", "yhat", dim="DATE"
).to_series()
CPU times: user 20.2 ms, sys: 8 µs, total: 20.2 ms
Wall time: 19.7 ms
[11]:
STORE  SKU
0      0      10.968313
       1       5.465377
       2       2.546790
       3       4.274809
       4       8.443736
                ...
99     95      6.832711
       96      4.262613
       97     11.533266
       98     14.450065
       99      2.820765
Length: 10000, dtype: float64

See xskillscore-tutorial for further reading.

[ ]: