{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# ruff: noqa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tabular Data\n", "\n", "`xskillscore` can be used on tabular data such as that stored in a `pandas.DataFrame`.\n", "\n", "It can be used most effectively when evaluating predictions over different fields." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import xskillscore as xs\n", "from sklearn.datasets import fetch_california_housing\n", "from sklearn.metrics import mean_squared_error\n", "\n", "np.random.seed(seed=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## California house prices dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A small example is to take a dataset and evaluate the model according to a field (column).\n", "\n", "Load the California house prices dataset:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudey
08.325241.07.01.023810322.02.55555637.88-122.234.526
18.301421.06.00.9718802401.02.10984237.86-122.223.585
27.257452.08.01.073446496.02.80226037.85-122.243.521
35.643152.06.01.073059558.02.54794537.85-122.253.413
43.846252.06.01.081081565.02.18146737.85-122.253.422
..............................
206351.560325.05.01.133333845.02.56060639.48-121.090.781
206362.556818.06.01.315789356.03.12280739.49-121.210.771
206371.700017.05.01.1200921007.02.32563539.43-121.220.923
206381.867218.05.01.171920741.02.12320939.43-121.320.847
206392.388616.05.01.1622641387.02.61698139.37-121.240.894
\n", "

20640 rows × 9 columns

\n", "
" ], "text/plain": [ " MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \\\n", "0 8.3252 41.0 7.0 1.023810 322.0 2.555556 37.88 \n", "1 8.3014 21.0 6.0 0.971880 2401.0 2.109842 37.86 \n", "2 7.2574 52.0 8.0 1.073446 496.0 2.802260 37.85 \n", "3 5.6431 52.0 6.0 1.073059 558.0 2.547945 37.85 \n", "4 3.8462 52.0 6.0 1.081081 565.0 2.181467 37.85 \n", "... ... ... ... ... ... ... ... \n", "20635 1.5603 25.0 5.0 1.133333 845.0 2.560606 39.48 \n", "20636 2.5568 18.0 6.0 1.315789 356.0 3.122807 39.49 \n", "20637 1.7000 17.0 5.0 1.120092 1007.0 2.325635 39.43 \n", "20638 1.8672 18.0 5.0 1.171920 741.0 2.123209 39.43 \n", "20639 2.3886 16.0 5.0 1.162264 1387.0 2.616981 39.37 \n", "\n", " Longitude y \n", "0 -122.23 4.526 \n", "1 -122.22 3.585 \n", "2 -122.24 3.521 \n", "3 -122.25 3.413 \n", "4 -122.25 3.422 \n", "... ... ... \n", "20635 -121.09 0.781 \n", "20636 -121.21 0.771 \n", "20637 -121.22 0.923 \n", "20638 -121.32 0.847 \n", "20639 -121.24 0.894 \n", "\n", "[20640 rows x 9 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "housing = fetch_california_housing(as_frame=True)\n", "df = housing.frame\n", "df[\"AveRooms\"] = df[\"AveRooms\"].round()\n", "df = df.rename(columns={\"MedHouseVal\": \"y\"})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a dummy prediction column by adding noise to `y`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "noise = np.random.uniform(-1, 1, size=len(df[\"y\"]))\n", "df[\"yhat\"] = (df[\"y\"] + (df[\"y\"] * noise)).clip(lower=df[\"y\"].min())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluate the model over the field `AveRooms` using `pandas.groupby.apply` with `mean_squared_error` from `scikit-learn`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/rf/26llfhwd68x7cftb1z3h000w0000gp/T/ipykernel_11510/120772475.py:1: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n", " df.groupby(\"AveRooms\").apply(lambda x: mean_squared_error(x[\"y\"], x[\"yhat\"])).head()\n" ] }, { "data": { "text/plain": [ "AveRooms\n", "1.0 1.789466\n", "2.0 1.827004\n", "3.0 1.492455\n", "4.0 1.352848\n", "5.0 1.384756\n", "dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"AveRooms\").apply(lambda x: mean_squared_error(x[\"y\"], x[\"yhat\"])).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could also do the following using `xskillscore`.\n", "\n", "First, structure the `pandas.DataFrame` to keep the core fields when converting to an `xarray` object:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yyhat
indexAveRooms
07.04.5263.390337
16.03.5856.816622
28.03.5215.154701
36.03.4134.086443
46.03.4221.067792
............
206355.00.7810.611083
206366.00.7711.497737
206375.00.9230.648200
206385.00.8471.470100
206395.00.8940.166662
\n", "

20640 rows × 2 columns

\n", "
" ], "text/plain": [ " y yhat\n", "index AveRooms \n", "0 7.0 4.526 3.390337\n", "1 6.0 3.585 6.816622\n", "2 8.0 3.521 5.154701\n", "3 6.0 3.413 4.086443\n", "4 6.0 3.422 1.067792\n", "... ... ...\n", "20635 5.0 0.781 0.611083\n", "20636 6.0 0.771 1.497737\n", "20637 5.0 0.923 0.648200\n", "20638 5.0 0.847 1.470100\n", "20639 5.0 0.894 0.166662\n", "\n", "[20640 rows x 2 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "min_df = df.reset_index().set_index([\"index\", \"AveRooms\"])[[\"y\", \"yhat\"]]\n", "min_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert it to an `xarray.Dataset` using `pandas.DataFrame.to_xarray`. Note: This will create an array of `index` by `AveRooms` and pad the values that do not exist with `nan`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 16MB\n",
       "Dimensions:   (index: 20640, AveRooms: 47)\n",
       "Coordinates:\n",
       "  * index     (index) int64 165kB 0 1 2 3 4 5 ... 20635 20636 20637 20638 20639\n",
       "  * AveRooms  (AveRooms) float64 376B 1.0 2.0 3.0 4.0 ... 60.0 62.0 133.0 142.0\n",
       "Data variables:\n",
       "    y         (index, AveRooms) float64 8MB nan nan nan nan ... nan nan nan nan\n",
       "    yhat      (index, AveRooms) float64 8MB nan nan nan nan ... nan nan nan nan
" ], "text/plain": [ " Size: 16MB\n", "Dimensions: (index: 20640, AveRooms: 47)\n", "Coordinates:\n", " * index (index) int64 165kB 0 1 2 3 4 5 ... 20635 20636 20637 20638 20639\n", " * AveRooms (AveRooms) float64 376B 1.0 2.0 3.0 4.0 ... 60.0 62.0 133.0 142.0\n", "Data variables:\n", " y (index, AveRooms) float64 8MB nan nan nan nan ... nan nan nan nan\n", " yhat (index, AveRooms) float64 8MB nan nan nan nan ... nan nan nan nan" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = min_df.to_xarray()\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You call now apply any metric from `xskillscore` using the accessor method. The input for the `dim` argument is `index` as we want to reduce this dimension and apply the metric over `AveRooms`. In addition, there are `nan`'s in the `xarray.Dataset` so you should use `skipna=True`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray (AveRooms: 47)> Size: 376B\n",
       "array([1.78946554e+00, 1.82700352e+00, 1.49245536e+00, 1.35284843e+00,\n",
       "       1.38475581e+00, 1.78975720e+00, 3.26525395e+00, 4.80747797e+00,\n",
       "       5.16524336e+00, 3.13496890e+00, 1.23401728e+00, 1.12562885e+00,\n",
       "       5.54888374e-01, 2.61824323e+00, 8.84529997e-01, 1.17865387e+00,\n",
       "       8.96786588e-01, 6.93484341e-01, 8.44837355e-01, 9.50615751e-01,\n",
       "       2.55912220e+00, 4.16548298e-01, 3.07284580e-01, 8.31537279e-01,\n",
       "       4.06466713e+00, 8.79983025e-01, 1.09491040e-02, 1.12379707e+00,\n",
       "       1.50188148e+00, 1.56069394e+00, 2.73330025e-02, 2.68438951e-01,\n",
       "       4.63967683e-01, 1.47081770e+00, 3.28568563e+00, 4.86835859e-01,\n",
       "       5.48064237e-04, 1.40563208e+00, 9.04093610e-01, 3.26459003e-01,\n",
       "       1.48460982e-01, 3.39427104e+00, 4.19379397e+00, 1.74130396e-01,\n",
       "       1.04411235e+00, 1.23495233e+00, 2.64087781e-01])\n",
       "Coordinates:\n",
       "  * AveRooms  (AveRooms) float64 376B 1.0 2.0 3.0 4.0 ... 60.0 62.0 133.0 142.0
" ], "text/plain": [ " Size: 376B\n", "array([1.78946554e+00, 1.82700352e+00, 1.49245536e+00, 1.35284843e+00,\n", " 1.38475581e+00, 1.78975720e+00, 3.26525395e+00, 4.80747797e+00,\n", " 5.16524336e+00, 3.13496890e+00, 1.23401728e+00, 1.12562885e+00,\n", " 5.54888374e-01, 2.61824323e+00, 8.84529997e-01, 1.17865387e+00,\n", " 8.96786588e-01, 6.93484341e-01, 8.44837355e-01, 9.50615751e-01,\n", " 2.55912220e+00, 4.16548298e-01, 3.07284580e-01, 8.31537279e-01,\n", " 4.06466713e+00, 8.79983025e-01, 1.09491040e-02, 1.12379707e+00,\n", " 1.50188148e+00, 1.56069394e+00, 2.73330025e-02, 2.68438951e-01,\n", " 4.63967683e-01, 1.47081770e+00, 3.28568563e+00, 4.86835859e-01,\n", " 5.48064237e-04, 1.40563208e+00, 9.04093610e-01, 3.26459003e-01,\n", " 1.48460982e-01, 3.39427104e+00, 4.19379397e+00, 1.74130396e-01,\n", " 1.04411235e+00, 1.23495233e+00, 2.64087781e-01])\n", "Coordinates:\n", " * AveRooms (AveRooms) float64 376B 1.0 2.0 3.0 4.0 ... 60.0 62.0 133.0 142.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = ds.xs.mse(\"y\", \"yhat\", dim=\"index\", skipna=True)\n", "out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It makes sense to return the data in tabular form hence you can call `xarray.DataArray.to_series` to convert it to a `pandas.Series`:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AveRooms\n", "1.0 1.789466\n", "2.0 1.827004\n", "3.0 1.492455\n", "4.0 1.352848\n", "5.0 1.384756\n", "dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out.to_series().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluating predictions over many columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`xskillscore` is built upon `xarray.apply_ufunc` which offers speed-up by vectorizing operations. As a result `xskillscore` can be faster than `pandas.groupby.apply`. This is espicially true if there are many samples in the dataset and if the predictions have to be evaluated over many fields.\n", "\n", "For this exercise we will create fake data for which the predictions have to be evaluated over three fields:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DATESTORESKUyyhat
02020-01-010063.874272
12020-01-0101913.551266
22020-01-010283.979884
32020-01-010333.222543
42020-01-010461.647346
..................
999952020-01-10999511.000000
999962020-01-10999642.770135
999972020-01-10999775.820397
999982020-01-10999821.000000
999992020-01-10999921.000000
\n", "

100000 rows × 5 columns

\n", "
" ], "text/plain": [ " DATE STORE SKU y yhat\n", "0 2020-01-01 0 0 6 3.874272\n", "1 2020-01-01 0 1 9 13.551266\n", "2 2020-01-01 0 2 8 3.979884\n", "3 2020-01-01 0 3 3 3.222543\n", "4 2020-01-01 0 4 6 1.647346\n", "... ... ... ... .. ...\n", "99995 2020-01-10 99 95 1 1.000000\n", "99996 2020-01-10 99 96 4 2.770135\n", "99997 2020-01-10 99 97 7 5.820397\n", "99998 2020-01-10 99 98 2 1.000000\n", "99999 2020-01-10 99 99 2 1.000000\n", "\n", "[100000 rows x 5 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stores = np.arange(100)\n", "skus = np.arange(100)\n", "dates = pd.date_range(\"1/1/2020\", \"1/10/2020\", freq=\"D\")\n", "\n", "rows = []\n", "for _, date in enumerate(dates):\n", " for _, store in enumerate(stores):\n", " for _, sku in enumerate(skus):\n", " rows.append(\n", " dict(\n", " {\n", " \"DATE\": date,\n", " \"STORE\": store,\n", " \"SKU\": sku,\n", " \"y\": np.random.randint(9) + 1,\n", " }\n", " )\n", " )\n", "df = pd.DataFrame(rows)\n", "\n", "noise = np.random.uniform(-1, 1, size=len(df[\"y\"]))\n", "df[\"yhat\"] = (df[\"y\"] + (df[\"y\"] * noise)).clip(lower=df[\"y\"].min())\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time the `pandas.groupby.apply` method:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.23 s, sys: 31.5 ms, total: 4.26 s\n", "Wall time: 4.29 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ ":1: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n" ] }, { "data": { "text/plain": [ "STORE SKU\n", "0 0 10.968313\n", " 1 5.465377\n", " 2 2.546790\n", " 3 4.274809\n", " 4 8.443736\n", " ... \n", "99 95 6.832711\n", " 96 4.262613\n", " 97 11.533266\n", " 98 14.450065\n", " 99 2.820765\n", "Length: 10000, dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "df.groupby([\"STORE\", \"SKU\"]).apply(lambda x: mean_squared_error(x[\"y\"], x[\"yhat\"]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time it using `xskillscore`:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 20.1 ms, sys: 6.21 ms, total: 26.3 ms\n", "Wall time: 24.9 ms\n" ] }, { "data": { "text/plain": [ "STORE SKU\n", "0 0 10.968313\n", " 1 5.465377\n", " 2 2.546790\n", " 3 4.274809\n", " 4 8.443736\n", " ... \n", "99 95 6.832711\n", " 96 4.262613\n", " 97 11.533266\n", " 98 14.450065\n", " 99 2.820765\n", "Length: 10000, dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "df.set_index([\"DATE\", \"STORE\", \"SKU\"]).to_xarray().xs.mse(\"y\", \"yhat\", dim=\"DATE\").to_series()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See [xskillscore-tutorial](https://github.com/raybellwaves/xskillscore-tutorial) for further reading." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "interpreter": { "hash": "e5607b67897ceeb4cb8d1a6f5e8f77cf995244d75ab9ff3b133e23bb37c07f75" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 4 }