[1]:
# ruff: noqa
Tabular Data
xskillscore can be used on tabular data such as that stored in a pandas.DataFrame.
It can be used most effectively when evaluating predictions over different fields.
[2]:
import numpy as np
import pandas as pd
import xskillscore as xs
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
np.random.seed(seed=42)
California house prices dataset
A small example is to take a dataset and evaluate the model according to a field (column).
Load the California house prices dataset:
[3]:
housing = fetch_california_housing(as_frame=True)
df = housing.frame
df["AveRooms"] = df["AveRooms"].round()
df = df.rename(columns={"MedHouseVal": "y"})
df
[3]:
| MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | y | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.3252 | 41.0 | 7.0 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 | 4.526 |
| 1 | 8.3014 | 21.0 | 6.0 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 | 3.585 |
| 2 | 7.2574 | 52.0 | 8.0 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 | 3.521 |
| 3 | 5.6431 | 52.0 | 6.0 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 | 3.413 |
| 4 | 3.8462 | 52.0 | 6.0 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 | 3.422 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 20635 | 1.5603 | 25.0 | 5.0 | 1.133333 | 845.0 | 2.560606 | 39.48 | -121.09 | 0.781 |
| 20636 | 2.5568 | 18.0 | 6.0 | 1.315789 | 356.0 | 3.122807 | 39.49 | -121.21 | 0.771 |
| 20637 | 1.7000 | 17.0 | 5.0 | 1.120092 | 1007.0 | 2.325635 | 39.43 | -121.22 | 0.923 |
| 20638 | 1.8672 | 18.0 | 5.0 | 1.171920 | 741.0 | 2.123209 | 39.43 | -121.32 | 0.847 |
| 20639 | 2.3886 | 16.0 | 5.0 | 1.162264 | 1387.0 | 2.616981 | 39.37 | -121.24 | 0.894 |
20640 rows × 9 columns
Create a dummy prediction column by adding noise to y:
[4]:
noise = np.random.uniform(-1, 1, size=len(df["y"]))
df["yhat"] = (df["y"] + (df["y"] * noise)).clip(lower=df["y"].min())
Evaluate the model over the field AveRooms using pandas.groupby.apply with mean_squared_error from scikit-learn:
[5]:
df.groupby("AveRooms").apply(lambda x: mean_squared_error(x["y"], x["yhat"])).head()
[5]:
AveRooms
1.0 1.789466
2.0 1.827004
3.0 1.492455
4.0 1.352848
5.0 1.384756
dtype: float64
You could also do the following using xskillscore.
First, structure the pandas.DataFrame to keep the core fields when converting to an xarray object:
[6]:
min_df = df.reset_index().set_index(["index", "AveRooms"])[["y", "yhat"]]
min_df
[6]:
| y | yhat | ||
|---|---|---|---|
| index | AveRooms | ||
| 0 | 7.0 | 4.526 | 3.390337 |
| 1 | 6.0 | 3.585 | 6.816622 |
| 2 | 8.0 | 3.521 | 5.154701 |
| 3 | 6.0 | 3.413 | 4.086443 |
| 4 | 6.0 | 3.422 | 1.067792 |
| ... | ... | ... | ... |
| 20635 | 5.0 | 0.781 | 0.611083 |
| 20636 | 6.0 | 0.771 | 1.497737 |
| 20637 | 5.0 | 0.923 | 0.648200 |
| 20638 | 5.0 | 0.847 | 1.470100 |
| 20639 | 5.0 | 0.894 | 0.166662 |
20640 rows × 2 columns
Convert it to an xarray.Dataset using pandas.DataFrame.to_xarray. Note: This will create an array of index by AveRooms and pad the values that do not exist with nan.
[7]:
ds = min_df.to_xarray()
ds
[7]:
<xarray.Dataset> Size: 16MB
Dimensions: (index: 20640, AveRooms: 47)
Coordinates:
* index (index) int64 165kB 0 1 2 3 4 5 ... 20635 20636 20637 20638 20639
* AveRooms (AveRooms) float64 376B 1.0 2.0 3.0 4.0 ... 60.0 62.0 133.0 142.0
Data variables:
y (index, AveRooms) float64 8MB nan nan nan nan ... nan nan nan nan
yhat (index, AveRooms) float64 8MB nan nan nan nan ... nan nan nan nanYou call now apply any metric from xskillscore using the accessor method. The input for the dim argument is index as we want to reduce this dimension and apply the metric over AveRooms. In addition, there are nan’s in the xarray.Dataset so you should use skipna=True:
[8]:
out = ds.xs.mse("y", "yhat", dim="index", skipna=True)
out
[8]:
<xarray.DataArray (AveRooms: 47)> Size: 376B
array([1.78946554e+00, 1.82700352e+00, 1.49245536e+00, 1.35284843e+00,
1.38475581e+00, 1.78975720e+00, 3.26525395e+00, 4.80747797e+00,
5.16524336e+00, 3.13496890e+00, 1.23401728e+00, 1.12562885e+00,
5.54888374e-01, 2.61824323e+00, 8.84529997e-01, 1.17865387e+00,
8.96786588e-01, 6.93484341e-01, 8.44837355e-01, 9.50615751e-01,
2.55912220e+00, 4.16548298e-01, 3.07284580e-01, 8.31537279e-01,
4.06466713e+00, 8.79983025e-01, 1.09491040e-02, 1.12379707e+00,
1.50188148e+00, 1.56069394e+00, 2.73330025e-02, 2.68438951e-01,
4.63967683e-01, 1.47081770e+00, 3.28568563e+00, 4.86835859e-01,
5.48064237e-04, 1.40563208e+00, 9.04093610e-01, 3.26459003e-01,
1.48460982e-01, 3.39427104e+00, 4.19379397e+00, 1.74130396e-01,
1.04411235e+00, 1.23495233e+00, 2.64087781e-01])
Coordinates:
* AveRooms (AveRooms) float64 376B 1.0 2.0 3.0 4.0 ... 60.0 62.0 133.0 142.0It makes sense to return the data in tabular form hence you can call xarray.DataArray.to_series to convert it to a pandas.Series:
[9]:
out.to_series().head()
[9]:
AveRooms
1.0 1.789466
2.0 1.827004
3.0 1.492455
4.0 1.352848
5.0 1.384756
dtype: float64
Evaluating predictions over many columns
xskillscore is built upon xarray.apply_ufunc which offers speed-up by vectorizing operations. As a result xskillscore can be faster than pandas.groupby.apply. This is espicially true if there are many samples in the dataset and if the predictions have to be evaluated over many fields.
For this exercise we will create fake data for which the predictions have to be evaluated over three fields:
[10]:
stores = np.arange(100)
skus = np.arange(100)
dates = pd.date_range("1/1/2020", "1/10/2020", freq="D")
rows = []
for _, date in enumerate(dates):
for _, store in enumerate(stores):
for _, sku in enumerate(skus):
rows.append(
dict(
{
"DATE": date,
"STORE": store,
"SKU": sku,
"y": np.random.randint(9) + 1,
}
)
)
df = pd.DataFrame(rows)
noise = np.random.uniform(-1, 1, size=len(df["y"]))
df["yhat"] = (df["y"] + (df["y"] * noise)).clip(lower=df["y"].min())
df
[10]:
| DATE | STORE | SKU | y | yhat | |
|---|---|---|---|---|---|
| 0 | 2020-01-01 | 0 | 0 | 6 | 3.874272 |
| 1 | 2020-01-01 | 0 | 1 | 9 | 13.551266 |
| 2 | 2020-01-01 | 0 | 2 | 8 | 3.979884 |
| 3 | 2020-01-01 | 0 | 3 | 3 | 3.222543 |
| 4 | 2020-01-01 | 0 | 4 | 6 | 1.647346 |
| ... | ... | ... | ... | ... | ... |
| 99995 | 2020-01-10 | 99 | 95 | 1 | 1.000000 |
| 99996 | 2020-01-10 | 99 | 96 | 4 | 2.770135 |
| 99997 | 2020-01-10 | 99 | 97 | 7 | 5.820397 |
| 99998 | 2020-01-10 | 99 | 98 | 2 | 1.000000 |
| 99999 | 2020-01-10 | 99 | 99 | 2 | 1.000000 |
100000 rows × 5 columns
Time the pandas.groupby.apply method:
[11]:
%%time
df.groupby(["STORE", "SKU"]).apply(lambda x: mean_squared_error(x["y"], x["yhat"]))
CPU times: user 2.36 s, sys: 7.24 ms, total: 2.37 s
Wall time: 2.37 s
[11]:
STORE SKU
0 0 10.968313
1 5.465377
2 2.546790
3 4.274809
4 8.443736
...
99 95 6.832711
96 4.262613
97 11.533266
98 14.450065
99 2.820765
Length: 10000, dtype: float64
Time it using xskillscore:
[12]:
%%time
df.set_index(["DATE", "STORE", "SKU"]).to_xarray().xs.mse("y", "yhat", dim="DATE").to_series()
CPU times: user 9.74 ms, sys: 0 ns, total: 9.74 ms
Wall time: 9.61 ms
[12]:
STORE SKU
0 0 10.968313
1 5.465377
2 2.546790
3 4.274809
4 8.443736
...
99 95 6.832711
96 4.262613
97 11.533266
98 14.450065
99 2.820765
Length: 10000, dtype: float64
See xskillscore-tutorial for further reading.
[ ]: