Merge branch 'main' of github.com:computationalUncertaintyLab/TSID

thomas mcandrew · thomas mcandrew · commit 83d02333e016 · 2026-01-27T16:11:40.000-05:00
diff --git a/.DS_Store b/.DS_Store
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,4 @@
+
+*.ipynb	diff=jupyternotebook
+
+*.ipynb	merge=jupyternotebook
diff --git a/.ipynb_checkpoints/README-checkpoint.md b/.ipynb_checkpoints/README-checkpoint.md
@@ -0,0 +1 @@
+# TSID
diff --git a/.ipynb_checkpoints/ch1-checkpoint.ipynb b/.ipynb_checkpoints/ch1-checkpoint.ipynb
@@ -0,0 +1,268 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "21a6615c-b9eb-4081-a362-e0bbdf82590a",
+   "metadata": {},
+   "source": [
+    "# Chapter 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4812cc2a-428d-4eee-aa2a-406a4b4d871e",
+   "metadata": {},
+   "source": [
+    "## Time series data versus IID data \n",
+    "\n",
+    "A typical setup for statistical analysis assumes that a series of experiments generate observations that are independent and identically distributed~(often abbreviated i.i.d). \n",
+    "For example, \n",
+    "\n",
+    "\\begin{align}\n",
+    "    \\mathcal{D} &= ( y_{1}, y_{2}, \\cdots, y_{n}   ) \\\\ \n",
+    "    y_{i} &\\sim \\text{Poisson}(\\lambda)\n",
+    "\\end{align}\n",
+    "\n",
+    "where we use $\\mathcal{D}$ to represent a dataset, lower case letters to represent collected observations, capital letters to represent random variables, and greek letters to represent parameters. \n",
+    "Because we assume that the above observations were generated from a sequence of i.i.d poisson random variables, we can simplify expressions that incldue the probability of $Y_{1}, Y_{2}, \\cdots$. \n",
+    "\n",
+    "\\begin{align}\n",
+    "    P(Y_{1}, Y_{2}) &= P(Y_{1}) \\cdot P(Y_{2}) \\\\ \n",
+    "    P(Y_{1}, Y_{2}, \\cdots, Y_{n}) &= P(Y_{1}) \\cdot P(Y_{2}) \\cdots P(Y_{n}) = \\prod_{i=1}^{n} P(Y_{i}) \\\\ \n",
+    "    & = \\lambda^{\\sum_{i=1}^{n} y_{i} } \\frac{e^{ -n\\lambda }}{ \\prod_{i=1}^{n} y_{i}! } \\propto e^{ -n\\lambda }\\lambda^{\\sum_{i=1}^{n} y_{i} }\n",
+    "\\end{align}\n",
+    "\n",
+    "The expression above is an (often good) approximation of the joint probability of observing all $n$ data points at once. \n",
+    "Unlike more traditional data collections mechanisms, for time series data we cannot assume that the observations are i.i.d.\n",
+    "Instead, we assume that observations at time $t$ deoend on all previous random variables before time $t$. \n",
+    "Then, we cannot simplify the joint probability of the first $t$ random variables as their product. \n",
+    "\n",
+    "Recall the multiplication rule \n",
+    "\n",
+    "\\begin{align}\n",
+    "    P(A,B,C) &= P( B,C | A ) P(A) \\\\ \n",
+    "              &= P( C | B, A ) P(B|A) P(A)\n",
+    "\\end{align}\n",
+    "\n",
+    "We can still use the multiplication rule to assess the joint probability of a sequence of random variables.\n",
+    "Lets assume that we wish to model some time seres process from time unit one up until time unit $T$. \n",
+    "Then we need to estimate probabilities like \n",
+    "\n",
+    "\\begin{align}\n",
+    "    P( Y_{1}, Y_{2}, \\cdots, Y_{T}  ) = P(Y_{0})\\cdot P(Y_{1} | Y_{0}) \\cdot P(Y_{2} | Y_{1},Y_{0}) \\cdots P(Y_{T} | Y_{T-1} \\cdots Y_{0})\n",
+    "\\end{align}\n",
+    "\n",
+    "The i.i.d assumption simplifies the above by assuming that each random variable is independent of all others. \n",
+    "For time series, we want to simplify the above but still keep the most important characteristics of the process---that observations in the future depend on the past. \n",
+    "\n",
+    "### Markov Assumption \n",
+    "\n",
+    "Given a series of random variables, the Markov assumption states that the probability of $Y_{t}$ depends only on the random variable at time $t-1$, or \n",
+    "\n",
+    "\\begin{align}\n",
+    "    P(Y_{t} | Y_{t-1}, Y_{t-2}, \\cdots Y_{1}) \\approx  P(Y_{t} | Y_{t-1})\n",
+    "\\end{align}\n",
+    "\n",
+    "The markov assumption aims to capture the most basic attribute of a time series, that future values depend on the recent past, without the more restrictive property that future values depend on **all** of the past. \n",
+    "\n",
+    "This simplified considerably the above \n",
+    "\n",
+    "\\begin{align}\n",
+    "    P( Y_{1}, Y_{2}, \\cdots, Y_{T}  ) = P(Y_{0})\\cdot P(Y_{1} | Y_{0}) \\cdot P(Y_{2} | Y_{1},Y_{0}) \\cdots P(Y_{T} | Y_{T-1} \\cdots Y_{0}) \\\\ \n",
+    "    & \\approx P(Y_{0}) \\cdot P(Y_{1} | Y_{0}) \\cdot P(Y_{2} | Y_{1}) \\cdots \n",
+    "\\end{align}\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f2aca56d-50c3-4bd0-a087-81cca8ad4d6b",
+   "metadata": {},
+   "source": [
+    "## Influenza-like illness\n",
+    "\n",
+    "The Centers for Disease Control and Prevention collect a dataset about influenza-like illness,or ILI.\n",
+    "ILI is a non-specific syndrome defined as fever  and cough and/or sore throat. It is used for flu surveillance worldwide. ILI can be caused by influenza virus infection and infections with other respiratory viruses.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3f585c25-8d3e-43ee-a3b7-05f642670fcb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#--d \n",
+    "import pandas as pd \n",
+    "\n",
+    "d = pd.read_csv(\"./data/XXXXXXXX\")  #<--using pandas to import a datset\n",
+    "\n",
+    "# plot time series for two state\n",
+    "#x is weeks\n",
+    "#y is percent ili (column_name = wILI)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0e4ed68f-6af4-4068-9612-52427e027ecd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#--d \n",
+    "import pandas as pd \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7614c9a3-efc5-4037-b68d-9a9d97bef067",
+   "metadata": {},
+   "source": [
+    "## COVID Community mobility\n",
+    "\n",
+    "COVID Community Mobility Reports aim to provide insights into what changed in response to policies aimed at combating COVID-19. The reports charted movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3959d067-a105-4f97-b917-faa514116f36",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#--d \n",
+    "import pandas as pd \n",
+    "\n",
+    "d = pd.read_csv(\"./data/XXXXXXXX\")\n",
+    "\n",
+    "# a plot of one county time seires for two activities\n",
+    "\n",
+    "# x is the day \n",
+    "# y - parks_percent_change_from_baseline (<-for example)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0ff18230-d314-4bd8-8033-99c15cd2636d",
+   "metadata": {},
+   "source": [
+    "## Mpox incidence"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "57dfd0e6-9805-4754-b774-738058c1fd2f",
+   "metadata": {},
+   "source": [
+    "## Correlation, Covariance, and the Corrolelogram"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "29b752f3-aacc-4f33-9e65-5434927cbfaf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# FFor ili we will wantt to plot the percent ILI at week t versus the percent ILI at week t+1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "40c49a8b-547e-4d27-a64e-b92d6d187980",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# FFor COVID we will wantt to plot the behaviro at week t versus the behavior at week t+1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9dde1a78-0206-4272-a102-0c2295b6efd3",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aa674e67-12ab-497d-bf4f-aa0c811b8e64",
+   "metadata": {},
+   "source": [
+    "## Smoothing methods"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "50c0eb5c-acb1-4099-8856-aafeadb90719",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e8d7b8b3-1680-4d53-ae30-4804f79d2868",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "915fd2f8-fb4a-4937-aac0-09a36bef5785",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "363ad686-aaf7-427d-9362-87cd366317c6",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8a31ece2-cbf6-499a-9cff-ae9061a08b56",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "899076c1-dd9b-43f9-80a8-9111d06f3626",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.14.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ch1.ipynb b/ch1.ipynb
@@ -83,21 +83,33 @@
     "## Influenza-like illness\n",
     "\n",
     "The Centers for Disease Control and Prevention collect a dataset about influenza-like illness,or ILI.\n",
-    "ILI is a CCXZXXZXZXZ. \n",
+    "ILI is a non-specific syndrome defined as fever  and cough and/or sore throat. It is used for flu surveillance worldwide. ILI can be caused by influenza virus infection and infections with other respiratory viruses.\n",
     "\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 9,
    "id": "3f585c25-8d3e-43ee-a3b7-05f642670fcb",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "ename": "ModuleNotFoundError",
+     "evalue": "No module named 'pandas'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
+      "\u001b[31mModuleNotFoundError\u001b[39m                       Traceback (most recent call last)",
+      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m      1\u001b[39m \u001b[38;5;66;03m#--d \u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mpandas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mpd\u001b[39;00m \n\u001b[32m      4\u001b[39m d = pd.read_csv(\u001b[33m\"\u001b[39m\u001b[33m./data/ili_data.csv\u001b[39m\u001b[33m\"\u001b[39m)  \u001b[38;5;66;03m#<--using pandas to import a datset\u001b[39;00m\n\u001b[32m      6\u001b[39m \u001b[38;5;66;03m# plot time series for two state\u001b[39;00m\n\u001b[32m      7\u001b[39m \u001b[38;5;66;03m#x is weeks\u001b[39;00m\n\u001b[32m      8\u001b[39m \u001b[38;5;66;03m#y is percent ili (column_name = wILI)\u001b[39;00m\n",
+      "\u001b[31mModuleNotFoundError\u001b[39m: No module named 'pandas'"
+     ]
+    }
+   ],
    "source": [
     "#--d \n",
     "import pandas as pd \n",
     "\n",
-    "d = pd.read_csv(\"./data/XXXXXXXX\")  #<--using pandas to import a datset\n",
+    "d = pd.read_csv(\"./data/ili_data.csv\")  #<--using pandas to import a datset\n",
     "\n",
     "# plot time series for two state\n",
     "#x is weeks\n",
@@ -122,21 +134,33 @@
    "source": [
     "## COVID Community mobility\n",
     "\n",
-    "Describe describe describe\n",
+    "COVID Community Mobility Reports aim to provide insights into what changed in response to policies aimed at combating COVID-19. The reports charted movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.\n",
     "\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "id": "3959d067-a105-4f97-b917-faa514116f36",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "ename": "ModuleNotFoundError",
+     "evalue": "No module named 'pandas'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
+      "\u001b[31mModuleNotFoundError\u001b[39m                       Traceback (most recent call last)",
+      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m      1\u001b[39m \u001b[38;5;66;03m#--d \u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mpandas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mpd\u001b[39;00m \n\u001b[32m      4\u001b[39m d = pd.read_csv(\u001b[33m\"\u001b[39m\u001b[33m./data/pa_covid.csv\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m      6\u001b[39m \u001b[38;5;66;03m# a plot of one county time seires for two activities\u001b[39;00m\n\u001b[32m      7\u001b[39m \n\u001b[32m      8\u001b[39m \u001b[38;5;66;03m# x is the day \u001b[39;00m\n\u001b[32m      9\u001b[39m \u001b[38;5;66;03m# y - parks_percent_change_from_baseline (<-for example)\u001b[39;00m\n",
+      "\u001b[31mModuleNotFoundError\u001b[39m: No module named 'pandas'"
+     ]
+    }
+   ],
    "source": [
     "#--d \n",
     "import pandas as pd \n",
     "\n",
-    "d = pd.read_csv(\"./data/XXXXXXXX\")\n",
+    "d = pd.read_csv(\"./data/pa_covid.csv\")\n",
     "\n",
     "# a plot of one county time seires for two activities\n",
     "\n",
@@ -261,7 +285,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.13.3"
+   "version": "3.14.2"
   }
  },
  "nbformat": 4,
diff --git a/data/.ipynb_checkpoints/from_us_to_pa-checkpoint.py b/data/.ipynb_checkpoints/from_us_to_pa-checkpoint.py
@@ -0,0 +1,14 @@
+import pandas as pd
+
+if __name__ == "__main__":
+    d1 = pd.read_csv("2020_US_Region_Mobility_Report.csv")
+    d2 = pd.read_csv("2021_US_Region_Mobility_Report.csv")
+    d3 = pd.read_csv("2022_US_Region_Mobility_Report.csv")
+
+    def combine_covid(imput,output):
+        d = pd.concat(imput)
+        pa_covid = d.loc[d.sub_region_1 == "Pennsylvania"]
+        pa_covid.to_csv(output, index = False)
+    
+    output = "pa_covid.csv"
+    combine_covid([d1,d2,d3],output)
diff --git a/data/.ipynb_checkpoints/ili_data-checkpoint.py b/data/.ipynb_checkpoints/ili_data-checkpoint.py
@@ -0,0 +1,7 @@
+import pandas as pd
+from delphi_epidata import Epidata
+
+res = Epidata.fluview(["nat"], [201501, Epidata.range(201502, 202552)])
+
+d = pd.DataFrame(res["epidata"])
+d.to_csv("ili_data.csv",index=False)
diff --git a/data/ili_data.csv b/data/ili_data.csv
diff --git a/data/ili_data.py b/data/ili_data.py

-Original file line number
+Diff line change
@@ @@ -0,0 +1,4 @@ @@
++
 +*.ipynb	diff=jupyternotebook
++
 +*.ipynb	merge=jupyternotebook