{"id":22814759,"url":"https://github.com/abstractmachines/r-stats-probability","last_synced_at":"2025-03-30T22:17:30.301Z","repository":{"id":263663412,"uuid":"891102518","full_name":"abstractmachines/r-stats-probability","owner":"abstractmachines","description":"RStudio projects - and theory - for probability and statistics","archived":false,"fork":false,"pushed_at":"2025-03-11T18:34:35.000Z","size":270,"stargazers_count":1,"open_issues_count":5,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-11T19:35:58.828Z","etag":null,"topics":["binomial-distribution","geometric-distribution","hypergeometric-distribution","poisson-distribution","probability","probability-distribution","r","statistics"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abstractmachines.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-19T18:21:35.000Z","updated_at":"2025-03-11T18:34:38.000Z","dependencies_parsed_at":"2024-11-19T19:45:51.321Z","dependency_job_id":"6c5d511d-2c14-4994-af11-fd2fc85450f6","html_url":"https://github.com/abstractmachines/r-stats-probability","commit_stats":null,"previous_names":["abstractmachines/r-stats-probability-451","abstractmachines/r-stats-probability"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abstractmachines%2Fr-stats-probability","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abstractmachines%2Fr-stats-probability/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abstractmachines%2Fr-stats-probability/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abstractmachines%2Fr-stats-probability/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abstractmachines","download_url":"https://codeload.github.com/abstractmachines/r-stats-probability/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246385415,"owners_count":20768672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["binomial-distribution","geometric-distribution","hypergeometric-distribution","poisson-distribution","probability","probability-distribution","r","statistics"],"created_at":"2024-12-12T13:10:12.795Z","updated_at":"2025-03-30T22:17:30.282Z","avatar_url":"https://github.com/abstractmachines.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RStudio projects for probability and statistics\n\nAssumption: Set theory and proofs familiarity\n\nSource(s):\n\nMost of this material is derived from _\"Mathematical Statistics\"_ by Wackerly.\n\nSome of this material is also derived from _\"Probability and Statistics for Engineering and the Sciences\"_, by Jay Devore, but to a much lesser degree.\n\nThe Wackerly book has more formulas (instead of tables), introductions to the mn rule,\nand other important concepts of combinatorics and statistics.\n\n# Table of Contents\n\n-- Part 1: The Basics --\n\n1. [Probability Definition: Events, Sample Points and Sequencing Events Techniques](#probability-definition)\n2. [How to calculate probability: Combinations, Permutations, Bayes Theorem](#how-to-calculate-probability)\n3. [Expected Value, Variance, Standard Deviation, Quartiles](#expected-value-variance-standard-deviation-quartiles)\n4. [Discrete Random Variables](#discrete-random-variables)\n5. [Discrete Probability Distributions: Binomial](#binomial-probability-distribution)\n- 5b. [Bernoulli](#bernoulli-random-variables-and-distributions)\n6. [Discrete Probability Distributions: Geometric](#geometric-probability-distribution)\n7. [Discrete Probability Distributions: Hypergeometric](#hypergeometric-probability-distribution)\n8. [Discrete Probability Distributions: Negative Binomial](#negative-binomial-distribution)\n9. [Discrete Probability Distributions: Poisson](#poisson-distribution)\n10. [Continuous Random Variables](#continuous-random-variables)\n11. [Probability Distributions \"Distribution Functions\" for all types of variables](#distribution-functions---or-cumulative-distributions---are-for-any-type-of-variable)\n12. [What is Density? A Mathematician's Perspective (and prep for Density Functions)](#what-is-density-a-mathematicians-perspective)\n13. [Probability Density Functions: PDF](#probabilty-density-functions)\n14. [Expected Value for a Continuous Random Variable](#expected-value-and-variance-continuous-rv)\n15. [Cumulative Distribution Functions (CDFs)](#cumulative-distribution-function-cdf)\n16. [Uniform Probability Distribution](#uniform-probability-distribution)\n17. [Gamma and Exponential Distributions](#gamma-and-exponential-distributions)\n18. [Multivariate (Bivariate, Joint) Probability Distributions](#multivariate---bivariate-joint-probability-distributions)\n19. [Marginal and Conditional Probability Distributions](#marginal-and-conditional-probability-distributions)\n20. [Independent Random Variables](#independent-random-variables)\n21. [Expected Value of a Function of Random Variables](#expected-value-of-a-function-of-random-variables)\n22. [Covariance of Two Random Variables](#covariance-of-two-random-variables)\n\n-- Part Two: Estimation and Application --\n\n[Introduction](#part-two-introduction)\n1. [Normal Probability Distributions](#normal-probability-distribution)\n2. [Standard Normal Distribution](#standard-normal-distribution)\n- 2b. [Z Scores](#standard-scores-z-scores)\n- 2c. [Central Limit Theorem](#central-limit-theorem)\n3. [Moments](#moments)\n4. [Estimation: Statistical Inference (and Confidence Intervals)](#estimation-statistical-inference)\n\n-- Part Three: Significance / Hypothesis Testing --\n\n1. [Significance or Hypothesis Testing](#significance-or-hypothesis-testing)\n2. [Hypothesis Testing and Inferences Based on Two Populations: Means and Proportions](#hypothesis-testing-and-inferences-based-on-two-populations-means-and-proportions)\n\n-- Part Four: The Analysis of Variance --\n\n1. [ANOVA](#analysis-of-variance-anova)\n\n-- Part Five: Linear Regression -- \n\n1. [Linear Regression](#linear-regression)\n\n##  Probability Definition\n\n[Probability](https://en.wikipedia.org/wiki/Probability) is the likelihood that an event will occur.\n\n\u003e Events\nThe probability of an event `E` is the cardinality of the event `|E|` divided by the cardinality of the sample space `|S|` (the \"universe\", `S`,)\nthat the event is in.\n\n$$\\frac{|E|}{|S|}$$\n\n### Axioms of Probability\n\n• For any event, the probability is nonnegative.\n\n• Probability of entire sample space is $1$, or $100\\%$.\n\n• The likelihood of at least 1 event occurring is the sum of all events.\n\n### Laws of Probability\n\n\u003e Law of Total Probability:\n\n$P(B) = P(B|A)P(A) + P(B|A\\prime)P(A\\prime) + P(B|C)P(C) ...$\n\n\u003e Law of Conditional Probability:\n\n$P(A|B) = \\dfrac{P(A \\cap B)}{P(B)}$\n\n\u003e Independent Events:\n\n$P(A|B) = P(A)$, and/or if $A \\cap B = \\emptyset \\Rightarrow P(A \\cap B) = P(A)P(B)$.\n\nOne really interesting quality about independent events is that reliant events are dependent;\n\n\"negative number\" versus \"positive number\" are dependent events.\n\n\u003e Mutually Exclusive is not Independent\n\nTake the \"negative number\" versus \"positive number\" setup. \"if A, then not B\".\n\nHere, events are dependent, and mutually exclusive. \n\n\u003e Multiplicative:\n\n$P(A \\cap B) = P(A) P(B|A) = P(B)P(A|B)$.\n\nIf A and B independent, $P(A \\cap B) = P(A)P(B)$.\n\n\u003e Additive:\n\n$P(A \\cup B) = P(A) + P(B) - P(A \\cap B)$.\n\nIf A and B are mutually exclusive, $P(A \\cap B) = 0$ and $P(A \\cup B) = P(A) + P(B)$.\n\n### Some Probability Properties\n\n• $P(A) + P(A\\prime) = 1$, and by Complement Law, $P(A) = 1 - P(A\\prime)$.\n\n• The complement of \"at most one\" is \"at least two.\"\n\n• The complement of \"at least one type\" is \"only one type.\"\n\n## Probability Technique: Sample Points\n\nThe Wackerly probability book is great, and describes the sample-point method for calculating probability.\n\nOne example is to toss a pair of dice. The sample space, via the `mn rule`, is $m \\times n = (6)(6) = 36$ sample points in the sample space.\n\nThere will be a list of events such as $E_1$ = the event that roll is $(1,1)$, event $E_2 = (1,2)$ and so on. Each event is called _equiprobable_, having equally likely probability. So each event $A$ has a probability $P(A) = \\dfrac{N(A)}{N(S)} = \\dfrac{1}{36}$.\n\nSee the Wackerly book for more details on this technique, as well as sequenced events.\n\n## Probability Technique: Sequenced Events\n\nAnother technique, after sample point technique, is sequenced events.\n\n## How to calculate probability\n\n### Counting Distinct Objects : Combinations and Permutations\n\n\u003e Ordering n items: $n!$ ways.\n\n\u003e Combinations: Order Doesn't Matter\n\n$$ C = \\frac{n!}{(n-r)!r!}$$\n\nExamples: Out of the set `S = {A, B, C}`, a combination set would include `AAA`,  `AAB`, `ABC`, .... etc, and `ABA = BAA` because _order doesn't matter._ When order doesn't matter, you don't need to count as many things, e.g. if `AAB` is equivalent to `ABA`, then those items count as one element of the set, not two.\n\n\u003e Permutations: Order Matters\n\n$$ P = \\frac{n!}{(n-r)!}$$\n\nNote that the denominator is smaller than in combinations. Permuations possibilities are much larger _because order matters_, so we have to count it all.\n\n\nExamples: Out of the set `S= {A, B, C}`, a combination set would include `AAA`,  `AAB`, `ABC`, .... etc, and `ABA != BAA.`\n\n\u003e Bayes Theorem:\n\nUsually used for inversion techniques. \"Find probability of a cause, given effect.\"\n\nLet $A_1, ... A_k$ be mutually exclusive, disjoint events with prior probabilities.\n\nThen, $P(A_i | B) = \\dfrac{P(A_i \\cap B}{P(B)} = \\dfrac{P(B|A_i)P(A_i)}{\\sum_{i=1}^k P(B|A_i)P(A_i)}$\n\n\u003e Cardinality\n\n[Cardinality](https://en.wikipedia.org/wiki/Cardinality) is the number of elements in a Set.\n\n\n## Expected Value, Variance, Standard Deviation, Quartiles\n\n\u003e Expected Value, $\\mu$ or $E[Y]$: The average\n\nExpected value or mean is a calculation whose computation will differ depending on the probability distribution technique.\n\n\u003e Variance, $\\sigma^2$: Dispersion From the Mean\n\nVariance is a measure of how far a set of numbers \"spreads out\" from the mean or average value.\n\n\u003e Standard Deviation, $\\sigma$: Amount of variance from the mean\n\nA low standard deviation means values are close to the mean, and high standard deviation, more distributed values.\n\n\u003e Quartiles:\n\nA measure in statistics; we've heard \"upper quartile\", etc. There are three actual quartiles,\nfirst is 25th percentile, then 50th (median) and 75th; the four quartiles are just data \nthat fits around those quartiles.\n\n## Discrete Random Variables\n\n\u003e Expected Value or Mean of a Discrete Random Variable\n\n$E(Y) = \\mu_y = \\sum y* p(y)$\n\n\u003e Variance of a Discrete Random Variable, $\\sigma^2$\n\n$V(Y) = \\sigma^2_y = E[(Y - \\mu)^2]$\n\n\u003e\u003e Hacking variance: $Var[Y] = E[Y^2] = [E(Y)]^2$\n\nA trick that's nice to know.\n\n\u003e Standard Deviation of a Discrete Random Variable, $\\sigma$\n\n$\\sigma = \\sqrt{\\sigma^2}$\n\n\nScalar, discrete values of probability. Stepwise functions. Best described via pmf.\n\n\u003e pmf: Probability \"mass\" function\n\nA pmf measures the scalar value of a discrete variable; the probability that a discrete random variable has a particular value.\n\nThis could be denoted as `P(Y = y)`, or more concretely, `P(Y = 1)` for example.\n\nProbability mass functions will depend on the particular problem you're trying to solve.\n\n\u003e Axioms of pmf's and discrete random variable probabilities:\n\n1. Each possible value of the random variable must be assigned a nonzero probability;\n2. All of the probabilities must sum to a total probability of `1`.\n\n\n## Binomial Probability Distribution\n\nThe binomial distribution is *identical, independent* trials. These are uniform experiments of a series of failures and successes, for example $\\{F,F,F,F,S,F,S,F...\\}$; the random variable for the binomial distribution counts the number of successes in each trial.\n\n\u003e Distribution:\n\nUsing the binomial probability distribution formula, we know that for $n$ trials, \n\nthe pmf represented by:\n\n$b(x; n, p) = {n \\choose x}p^x(1-p)^{n-x}$\n\nOr, more canonically, let $q = (1-p)$ and\n\n$b(x; n, p) = {n \\choose x}p^xq^{n-x}$.\n \nfor $x = 0,1,2....$ (and $0$ otherwise).\n\n\u003e Mean, Variance, Std Deviation of Binomial:\n\n$\\mu = E(Y) = np$\n\n$\\sigma^2 = npq$\n\n## Bernoulli Random Variables and Distributions\n\nThe Bernoulli distribution is considered a special case of the binomial distribution, with $n = 1$.\n\nBernoulli random variables, or distributions, are considered the simplest. This is a binary random variable with \"success\" denoted as `p` and \"failure\" as `1-p`, or just `q` where `q = 1-p`.\n\n- PMF: $f(x;p) = p$ if success, $f(x;p) = 1-p = q$ if failure.\n\n- Mean: $\\mu = 1-p$\n\n- Variance: $\\sigma^2 = p(1-p) \\Rightarrow pq$\n\n## Geometric Probability Distribution\n\nThe geometric probability distribution is built on the binomial distribution idea; that of a series of uniform trials occurring of successes and failures; the geometric distribution of a random variable is where value $y$ of the random variable $Y$, e.g. $P(Y=y)$, is the number of the trial in which the first success occurs.\n\nLooking at the sample space (Wackerly 3.5), we see that\n\n$E_1: S$ with success on first trial;\n\n$E_2: FS$ with success on second trial;\n\n...\n\n$E_k: F, F, F .... S $ with success on $kth$ trial;\n\nwhere there are $k-1$ failures, and first $S$ on $kth$ trial.\n\nAs such, $P(Y = y)$ is the probability that there will be $y-1$ failures, and trial number $y$ is the first success. If we let the failures be $q$, that means that there are $y-1$ $q$'s, and one $y$, which describes the geometric distribution below. \n \n\u003e Geometric Probability Distribution:\n\n$p(y) = q^{y-1}p$\n\n\u003e Mean, Variance, Std Deviation of Geometric Distribution:\n\n$\\mu = E[Y] = \\dfrac{1}{p}$\n\n$\\sigma^2 = \\dfrac{1-p}{p^2}$.\n\nProofs for these are in the Wackerly book chapter 3.5 and are interesting.\n\n\n## Hypergeometric Probability Distribution\n\n\u003e Distribution:\n\nFor random sampling of sample size $n$ without replacement on a finite population of size $N$, particularly in cases where the sample size approaches the population size.\n\nThe denominator: counting the number of ways to select a subset of $n$ elements from a population of $N$, or, $N$ choose $n$ for the denominator e.g. sample space.\n\nThen for the numerator, we think of $n$ objects, $r$ of which are red, and $N-r$ of which are black. Then, choosing $y$ objects from $r$ and then remaining $n-y$ objects from remaining $N-r$, such that by the $mn$ rule we have $mn = {r \\choose y}  {N-r \\choose n -y}$. Putting this all together, we have:\n\n$$p(y) = h(y; n,r,N) = \\dfrac{{r \\choose y}{N-r \\choose n-y}}{{N \\choose n}}$$\n\n\u003e Mean, Variance, Std Deviation of Hypergeometric:\n\n$\\mu = E(Y) = \\dfrac{nr}{N}$\n\n$\\sigma^2 = (\\dfrac{nr}{N})(\\dfrac{N-r}{N})(\\dfrac{N-n}{N-1})$,\n\nThen if we define $p = \\frac{r}{N}$ and $q = 1 - p = \\frac{N-r}{N}$,\n\n$\\sigma^2 = npq(\\dfrac{N-n}{N-1})$, similarly to binomial random variable.\n\nNote the factor $\\dfrac{N-n}{N-1}$, often called the _\"finite population correction factor\"_. \n\nAs $N \\rightarrow \\infty$, $\\dfrac{N-n}{N-1} \\rightarrow 1$.\n\nSo for larger population sizes, the variance of the hypergeometric distribution is the same as binomial, e.g. $npq$.\n\nAs $n \\rightarrow N$, $\\dfrac{N-n}{N-1} \u003c 1$, so for more \"finite\" population sizes _where sample size approaches population size_,\n\nthen obviously the hypergeometric distribution variance is smaller than that of the binomial distribution, as we'd have variance of $npq$ multiplied by a factor of less than 1.\n\nHaving lesser variance can be a good thing, so we can see how the hypergeometric distribution is useful for cases where the sample size approaches the population size. \"For sampling from a finite population\" such as, quality control, genetic hypothesis testing, or statistical hypothesis testing.\n\n## Negative Binomial Distribution\n\nRecall the geometric distribution, which is finding the probability of the first success. The negative binomial distribution focuses on the use case for multiple successes occurring.\n\nDepending on the textbook you are using, this is either counting the number of failures, or counting the trial where the $r$th success occurs.\n\nThe \"rth success\". $p(y) = {(y-1) \\choose (r-1)}p^rq^{y-r}$ where $y$ is either num of failures before rth success (Devore) or num trial on which rth success occurs (Wackerley). $\\mu = \\dfrac{r}{p}$, $\\sigma^2 = \\dfrac{r(1-p)}{p^2}$.\n\n\u003e Distribution (TODO): (case 1, Wackerly)\n\n\u003e Distribution (TODO): Case 2, Devore\n\n## Poisson Distribution\n\nThe Poisson probability distribution, used for rare events over a period of time, is also used to approximate the binomial distribution since the binomial distribution converges to the Poisson distribution. The Poisson distribution can approximate the binomial distribution in use cases for: large $N$, small $p$, and $\\lambda = np \\leq \\approx 7$.\n\nThe Poisson distribution's probability function is $p(y) = \\dfrac{\\lambda^y}{y!}e^{-y}$, with $\\mu = \\lambda$, $\\sigma^2 = \\lambda$, and hence $\\sigma = \\sqrt{\\lambda}$.\n\n## Continuous Random Variables\n\nContinuous random variables are defined on a continuum, e.g. an interval.\n\nTake the real number line $x \\in \\mathbb{R}$. We know from Real Analysis that there are infinite possibilities,\neither countably infinite or uncountably infinite, in an interval on this line.\n\n\u003e Hence, axioms of probability for continuous variables cannot be similar to those of discrete.\n- If each possible value of the random variable must be assigned a probability,\n- And each possible value is a subset of an infinite set within an interval,\n- Then the probabilities cannot all sum to 1, as they are infinite.\n- Therefore a new set of axioms for continuous random variables must be defined, as follows.\n\n## Distribution functions - or Cumulative Distributions - are for any type of variable\n\nFrom Wackerly 4.2, this is an important note about the definition of distribution functions,\nbecause _distribution functions, e.g. cumulative distributions or probability distributions,\ncan be for ANY random variable, whether discrete or continuous:_\n\n\u003e\u003e \"Before we can state a formal definition for a continuous random variable, we must define the distribution function (or cumulative distribution function) associated with a random variable.\"\n\n\u003e\u003e Let `Y` denote any random variable. Then, `F(y) = P(Y \u003c= y)`, for example, `P(Y \u003c= 2)`.\n\n\u003e\u003e The *nature* of the distribution function associated with a random variable, determines whether the variable is discrete or continuous.\n\n- Discrete random variables have a stepwise function.\n- Continuous random variables have a continuous function.\n- Continuous random variables have a smooth curve graph that is the result of histograms, or Riemann summations.\n\n### Axioms of continuous RV distributions\n\n- Variables are continuous if their distributions are, and, lots of real analysis continuity stuff,\nregarding \"absolute continuity.\" More importantly,\n\n- For a continuous random variable `Y`, then $\\forall y \\in \\mathbb{R}, P(Y = y) = 0$,\nthat is,\n\n\u003e Continuous random variables have a zero probability at discrete points.\n\nWackerly uses the example of daily rainfall; probability of exactly 2.312 inches, a discrete point, is quite unlikely;\nprobability of between 2 and 3 inches is quite likely; an interval.\n\n\u003e Semantics and Idioms of `R` language for probability distributions:\nConsidered separate from pure mathematical theory.\n\nNote in R, the \"density function,\" invoked via `dhyper(y, r, N-r, n)`, this function measures a discrete random variable's scalar value, such as our hypergeometric example in R; there's a bit of oddness here, since we've used this function for _discrete_ random variables.\n\nAlso in R, the \"probability distribution function\" is invoked via `phyper(4, r, N-r, n)`.\n\n## What Is Density? A Mathematician's Perspective\n\n\u003e _And a preparation for density functions in probability._\n\n_Note: This is often considered grad-student level Real Analysis work, and the real numbers\ncan arguably be constructed in various ways; the Dedekind cuts are merely my personal favorite._\n\n_I ran across this material with Jay Cummings' _Real Analysis_ book,\nthis is a book that's \\$20 on Amazon and used by the Wrath of Math (excellent Youtube math channel)._\n\n_If you'd prefer to have a social life, you can skip this section, but frankly, without density \nin Real Analysis, density functions in probability are a bit nonsensical to me._\n\nRecall Real Analysis, and that the real numbers can be constructed via Dedekind cuts of \nrational numbers [link](https://en.wikipedia.org/wiki/Construction_of_the_real_numbers); recall that\n\"rationals are dense in the reals,\" [stack exchange](https://math.stackexchange.com/questions/1027970/what-does-it-mean-for-rational-numbers-to-be-dense-in-the-reals), Wikipedia dense set and topology [here](https://en.wikipedia.org/wiki/Dense_set).\n\nWe could also say \"density of $\\mathbb{Q} \\in \\mathbb{R}$.\"\n\nBasically, there are a lot of \"density\" discussions with the real numbers, as such.\n\nTake any interval on the real number line. \"Subdivide\" that interval into many \"subdivisions.\"\n\nThere are \"infinite\" real numbers, or subdivisions, in that interval (arguably countable or uncountable).\n\nThe big picture is, they're infinite, or close enough to infinite that it doesn't matter.\n\nThis is what \"density\" looks like. (The articles above are about this, regarding the real numbers,\nas well as rational and irrational numbers, and constructing the real number line from a hybrid\nof rational and irrational numbers like Dedekind, which is very fun Real Analysis stuff).\n\nSo, that's what \"density\" is: take an interval on the real number line,\nsubdivide it quite a lot into infinite subdivisions,\nand hey, that's \"dense.\"\n\n## Probability Density Functions\n\nContinuous variables are analyzed on an _interval_, so we care about _density_ in that interval, as the previous section discusses.\n\n\u003e PDF: Probability Density Function\n\nA PDF is a function that provides a \"likelihood\" that a continuous random variable's\nvalue is _close to_ that of the value of a sample, or multiple samples.\n\nFor more on PDFs, see\n[Wikipedia PDF article](https://en.wikipedia.org/wiki/Probability_density_function).\n\n\u003e Probability density: Probability per unit length that RV is _near_ one or more samples.\n\n**Probability density is** the probability per unit length, while the absolute likelihood \nfor a continuous random variable to take on any particular value is 0 \n(since there is an infinite set of possible values to begin with), \nthe value of the PDF at two different samples can be used to infer, \nin any particular draw of the random variable, how much more likely it is that the \nrandom variable would be close to one sample compared to the other sample.\" [wikipedia](https://en.wikipedia.org/wiki/Probability_density_function)\n\n\u003e PDF formula: The PDF of continuous random var $Y$ is the function $f(y)$, such that\n\n\u003e for interval $[a,b], a \\leq b$,\n\n\u003e $P(a \\leq Y \\leq b) = \\int_a^b f(y) dy$. \n\nThat is, the probability that the continuous random variable is within an interval,\nis the area under the curve of the density function between $a$ and $b$.\n\n\u003e PDF Axioms:\n\n1. The _total area under the curve of $f(x)$, from $(-\\infty, \\infty) = 1$:_\n\nThat is, $\\int_{-\\infty}^{\\infty} f(x) dx = 1.$\n\nContinuous variables have a \"smooth curve\" graph $f(x)$ that looks like the \nresult of a histogram, or a result of Riemann sums.\n\nThis axiom is analogous to the discrete RV's having all probabilities sum to 1 discretely.\n\n2. $f(x) \\geq 0, \\forall x$. All probabilities of the PDF function are positive.\n\n\n## Expected Value and Variance: Continuous RV\n\n\u003e Mean or Expected Value of a continuous random variable:\n\n$E(Y) = \\int_{-\\infty}^{\\infty} y * f(y) dy$\n\nSimilarly, for $h(y)$, a function of $y$,\n\n$E[h(Y)] = \\int_{-\\infty}^{\\infty} h(y) * f(y) dy$\n\n\u003e Variance of a continuous random variable with PDF $f(x)$:\n\n$\\sigma^2 = \\int_{-\\infty}^{\\infty} (x-\\mu)^2 * f(x) dx = E[(X-\\mu)^2]$\n\n## Cumulative Distribution Function (CDF)\n\nThe CDF for a continuous random variable $X$ is:\n\n$F(x) = P(X \\leq x) = \\int_{-\\infty}^x f(y) dy$. For each $x$, $F(x)$ is the area under the density curve to the left of $x$.\n\n\u003e Using $F(x) to compute probabilities:\n\nLet $X$ be a continuous random variable with PDF = $f(x)$, CDF = $F(x)$.\n\nThen, $\\forall a, P(X \u003e a) = 1 - F(a)$, and\n\n$\\forall a, b, a \u003c b, P(a \\leq X \\leq b) = F(b) - F(a)$.\n\n\n\u003e Relating PDF and CDF via fundamental theorem of calculus:\n\nIf $X$ is a continuous random variable with PDF $f(x)$ and CDF $F(x)$,\n\nThen, $\\forall x$ where $F\\prime(x)$ exists, $F\\prime(x) = f(x)$.\n\n## Uniform Probability Distribution\n\nIn a uniform distribution, every possible outcome is equiprobable - for example, handing out a dollar to random passersby without discernment.\n\nUniform Distributions look like a \"block\" most of the time, where probability is constant within an interval.\n\n\u003e Uniform Distributions for Discrete Random Variables\n\nThe probability is 1, divided by total outcomes.\n\nUse cases include the possible outcomes of rolling a 6-sided die,\n\nprobability of drawing a particular suit within a deck of cards,\n\nflipping a coin, etc.\n\nAll of these are equiprobable discrete cases.\n\n\u003e Uniform Distributions for Continuous Random Variables\n\nThis can include a random number generator, temperature ranges, and many use cases\nwith an infinite number of possible outcomes within an interval of measurement.\n\nFor the continuous random variables, we'll present the probability density function,\nthe cumulative distribution function, and mean and variance.\n\n\u003e PDF:\n\nPDF of uniform distributions is $f(y; A,B) = \\dfrac{1}{B-A}$ between A, B; 0 otherwise.\n\nIn the uniform distribution, the probability over a subinterval is proportional to the length of that subinterval.\n\n\u003e CDF:\n\n$F(x) = \\dfrac{x-a}{b-a}$\n\n\u003e $\\mu, \\sigma^2$:\n\n$\\mu = \\dfrac{a+b}{2}$\n\n$\\sigma^2 = \\dfrac{(b-a)^2}{12}$\n\n## Gamma and Exponential Distributions\n\nThe gamma distribution, like the Poisson, is often used for waiting times and other measurements during temporal intervals.\n\n\u003e Exponential Distribution:\n\nWith scale param $\\lambda$,\n\n- $\\mu = \\dfrac{1}{\\lambda}$, and $\\sigma^2 = \\dfrac{1}{\\lambda^2}$\n\n- PDF: $f(x, \\lambda) = \\lambda e^{-\\lambda x}, x \\geq 0$, else $0$\n\n- CDF: $F(x, \\lambda) = 1 - e^{-\\lambda x}, x \u003e 0$, else $0$\n\n\u003e Gamma Distribution\n\nWith params $\\alpha, \\beta$,\n\n- PDF: $f(y; \\alpha, \\beta) = \\dfrac{y^{\\alpha - 1}e^{-y/\\beta}}{\\beta^{\\alpha}\\tau(\\alpha)}$,\n\nwhere gamma function $\\tau(\\alpha) = \\int_0^{\\infty} y^{\\alpha - 1}e^{-y} dy$;\n\n- PDF, Standard Gamma Distribution ($\\beta = 1$): $f(y; \\alpha) = \\dfrac{y^{\\alpha - 1}e^{-y}}{\\tau(\\alpha)}$\n\n- CDF: $F(y, \\alpha) = \\int_0^{y} \\dfrac{y^{\\alpha - 1}e^{-y}}{\\tau(\\alpha)}$\n\n- $\\mu = \\alpha\\beta$\n\n- $\\sigma^2 = \\alpha\\beta^2$\n\n## Multivariate - Bivariate, Joint Probability Distributions\n\nUntil now we've seen univariate probability distributions. The same basic axioms\nand rules tend to apply to multivariate distributions.\n\n### Discrete PMFs (or joint PMFs/CDFs):\n\n\u003e\u003e Example: toss a pair of dice.\n\nThe sample space by the `mn` rule is $m \\times n = 6 \\times 6 = 36$ possible pairs  of sample points,\n\nwith events such as $E_1 = (1,1)$ having the probability of $\\dfrac{1}{36}$.\n\nHence, the bivariate probability function is $P(Y_1 = y_1, Y_2 = y_2) = \\dfrac{1}{36}$.\n\n\u003e Joint or Bivariate PMFs for discrete random multiple variables is their sum:\n\n$P(Y_1 = y_1, Y_2 = y_2) = F(y_1,y_2) = \\sum_{t_1 \\leq y_1} \\sum_{t_2 \\leq y_2}p(t_1,t_2)$.\n\n- Axioms: Probabilities all nonzero, and all probabilities sum to 1.\n\nExample: For tossing two die, find $P(2 \\leq Y_1 \\leq 3, 1 \\leq Y_2 \\leq 2)$:\n\nSimply sum the probabilities:\n\n$P(2,1) + P(2,2) + P(3,1) + P(3,2) = \\frac{1}{36} + \\frac{1}{36} + \\frac{1}{36} + \\frac{1}{36} = \\frac{4}{36} = \\frac{1}{9}$.\n\n### Continuous CDFs:\n\n\u003e Joint or Bivariate CDFs for two jointly continuous random variables is a double integral:\n\n$P(a_1 \\leq Y_1 \\leq a_2, b_1 \\leq Y_2 \\leq b_2) = F(a,b) = \\int_{b_1}^{b_2} \\int_{a_1}^{a_2} f(y_1,y_2) dy_1dy_2$. (Integrate the inside first then outside at limits).\n\n## Marginal and Conditional Probability Distributions\n\n### Marginal\n\n\u003e  \"To find p1(y1), we sum p(y1, y2) over all values of y2 and hence accumulate the probabilities on the y1 axis (or margin).\" - Wackerly\n\nBivariate events such as $P(Y_1 = y_1, Y_2 = y_2)$ we've seen, and per Wackerly,\nit follows that _univariate events e.g. $P(Y_1 = y_1)$ is the **union of bivariate events**\n$P(Y_1 = y_1, Y_2 = y_2)$ with the union taken \"over all possible values of $y_2$.\"\n\n\u003e Marginal Probability Functions: Fix one var, iterate (sum, integrate) over the other; accumulate.\n\n- Discrete PMF: $p_x(x) = \\sum_{\\forall y} p(x,y), \\forall x$.\n\n- Continuous CDF: $f_x(x) = \\int_{\\forall y} f(x,y) dy$.\n\n### Conditional\n\nWe know that bivariate or joint events such as $P(y_1, y_2)$ are the intersection\nof two univariate events, s.t. $P(y_1, y_2) = P(y_1 \\cap y_2)$.\n\nGenerally, $P(A|B) = \\dfrac{P(A \\cap B)}{P(B)} \\Rightarrow P(y_1 | y_2) = \\dfrac{P(y_1 \\cap y_2)}{p(y_2)}$.\n\nLess generally:\n\n\u003e Conditional: Discrete: $P(y_1, y_2) = P(y_1 \\cap y_2) = P(Y_1 = y_1, Y_2 = y_2)$\n\n\u003e Conditional: Continuous: $P(y_1 | y_2) = P(y_1 \\cap y_2) = P(Y_1 \\leq y_1 | Y_2 = y_2)$\n\n## Independent Random Variables\n\nIf Y1 and Y2 are independent, the joint probability can be written as the product of the marginal probabilities: $F(y_1, y_2) = F_1(y_1)F_2(y_2)$.\n\n## Expected Value of a Function of Random Variables\n\nThis is the same as in univariate situations, just multiply the variable value by the (density/mass/PDF/pmf) function.\n\n## Covariance and Correlation of Two Random Variables\n\nCovariance and Correlation are measures of dependency. The larger the covariance,\nthe larger the correlation (zero covariance, zero correlation).\n\nIf $Y_1, Y_2$ are random variables with means $\\mu_1, \\mu_2$, the covariance \nis $Cov(Y_1, Y_2) = E (Y_1 − \\mu_1)(Y_2 − \\mu_2)]$.\n\nAfter some algebra, we can see that's also $E[XY] - E[X]E[Y]$.\n\nPositive covariance indicates proportionality; negative indicate inverse proportionality.\n\nSince covariance is hard to use, we often use the correlation coefficient instead:\n\n$\\rho = \\dfrac{Cov(Y_1,Y_2)}{\\sigma_1 \\sigma_2}$\n\n\n## Part Two: Introduction\n\nOr, \"why did we wait until now to talk about the normal distribution and z scores?\"\n\nThe answer is, because we use those things for estimation, and they belong best\ntogether in an introduction. The normal distribution is the most frequently used\nprobability distribution. We'll learn about that, and then about z scores and moments\nwhich feed into estimation.\n\nZ scores also help us with confidence intervals and estimation.\n\n## Normal Probability Distribution\n\nThis is the famous \"bell curve,\" the most widely used probability distribution,\nwhere the mean is at the center, and standard deviation depicts width around\nthat mean of the curve, indicating its variance - or, its _volatility._ This\nrelation to volatility helps us understand the bell curve's importance in\nmeasuring the relative stability of a metric.\n\nThe normal distribution is common in statistics, economicics and finance.\n\nThe little underlying standard deviations from the mean create the bell shape.\n\n\u003e Normal Distribution for a continuous random variable has the PDF:\n\n$f(y; \\mu, \\sigma) = \\dfrac{1}{\\sigma\\sqrt{2\\pi}} e^{\\frac{-(y-\\mu)}{2\\sigma^2}}$.\n\n\u003e Parameters of the Normal Distribution: $\\mu, \\sigma$\n\nWe consider $\\mu$ a location parameter since its location centers the bell curve;\n\nwe consider $\\sigma$ a scale parameter since variance widens or narrows the curve,\nwithout changing its mean center location.\n\nThe notation $Y \\sim N(\\mu,\\sigma^2$) means _\"the random variable Y is normally distributed, with params_ $\\mu, \\sigma^2$.\"\n\n\u003e Area under the normal density function from a to b:\n\n$\\int_a^b \\dfrac{1}{\\sigma\\sqrt{2\\pi}} e^{\\frac{-(y-\\mu)}{2\\sigma^2}}$\n\n\u003e R code: pnorm, qnorm\n\n$pnorm(y_0, \\mu, \\sigma) \\Rightarrow P(Y \\leq Y_0)$\n\n$qnorm(p \\mu, \\sigma) \\Rightarrow$ the pth quantile such that $P(Y \\leq \\phi_p) = p$.\n\n\u003e Solving for the Normal Distribution in R\n\n```\ndnorm: density function of the normal distribution\n\npnorm: cumulative density function of the normal distribution\n\nqnorm: quantile function of the normal distribution\n\nrnorm: random sampling from the normal distribution\n```\n\n## Standard Normal Distribution\n\nThis is the normal distribution, with param values $\\mu = 0, \\sigma = 1$.\n\nThe PDF of a random continuous variable with standard normal distribution is:\n\n$f(z; \\mu = 0, \\sigma = 1) = \\dfrac{1}{\\sqrt{2\\pi}} e^{\\dfrac{-z^2}{2}}$.\n\n## Standard Scores (Z Scores)\n\n\"Z score / Z Value / Standard Score\"\n\n\u003e From the population (mostly theoretical or edge cases)\n\nZ Scores are called so many things, but they all mean the same thing: the \ndistance of an observed value from the statistical mean. Theoretically, this \nwould be the population mean, although that is hard to measure as we will cover.\n\n$z = \\dfrac{y-\\mu}{\\sigma}$, \n\nwhere the observed value is $y$, the population mean\n$\\mu$, and the population standard deviation $\\sigma$. So \"z scores\" or \n\"standard scores\" are also in the business of converting \"raw scores\", e.g. \nobserved data or values (here, $xy$) into standard or raw scores.\n\nZ Scores represent how far an observed value is from the statistical mean (recall\nagain that population mean and std dev can be difficult to get to, so we say \n\"statistical mean\" to indicate this abstraction). If a Z Score is 1, that means\nthe observed raw value $y$ is one standard deviation above the statistical mean.\nIf a Z score is zero, that means the observed value $y$ is equivalent to the mean.\n\n\u003e From the sample (most actual practice)\n\nOutside of \"standardized testing\" where an entire population is measured (including\nits mean and standard deviation). So, often, population mean and standard deviation\nare unknown. In these cases, we use sample statistics instead of population statistics.\n\nUsing sample stats,\n\n$z = \\dfrac{y- \\overline{y}}{s}$, where $\\overline{y}$ is the sample mean, \nand $s$ is the sample standard deviation.\n\nUnfortunately many statisticians will not make clear the very important difference\nbetween population and sample statistics in their Z scores, making it confusing\nto figure out what they are talking about. You will often see $\\overline{y}$ \nobserved _sample_ data, mixed with $\\sigma$ _population_ data, in their work.\n\n\u003e Z Curve\nThe \"z -curve\" is the standard normal curve. \n\n\u003e Z-scores: How many std dev from the mean a value is; areas under the curve\n\n\u003e\u003e 68-95-99 rule:\n\n68% of the distribution is within one standard deviation; 95% within two; 99% within three.\n\nSo, \n\n• 68% of all scores: $-1 \u003c z \u003c 1$,\n\n• 95% of all scores: $-2 \u003c z \u003c 2$,\n\n• 99% of all scores: $-3 \u003c z \u003c 3$,\n\n• and 50% of all scores: $0 \u003c z$, since the mean is at zero; it's convenient to\noften only calculate one side due to the geometry of the normal distribution.\n\n\u003e Z-notation for z-critical values; percentiles\n\nThe $Z_\\alpha$ percentile is the $100(1-\\alpha)$-th percentile of the distribution;\nthis means _\"area to the right_\" of $\\alpha$. For example, we say that\n$Z_{0.5}$ is the $100(1-0.05)$-th, or just 95th, percentile of the standard normal distribution.\n\n\u003e Standardizing (nonstandard) distributions: $\\mu = 1, \\sigma = 1$\n\nRecall distance from the mean in standard deviations was $z = \\dfrac{y-\\mu}{\\sigma}$.\n\nThis is similar; the \"standardized variable Y\" is $\\dfrac{Y-\\mu}{\\sigma}$.\n\n• Subtracting $\\mu$ \"shifts the mean to zero\";\n\n• Dividing by $\\sigma$ scales the variable s.t. the std deviation is 1 instead of $\\sigma$.\n\n\u003e Standard normal distribution axioms:\n\n• $P(a \\leq X \\leq b) = P(\\dfrac{a-\\mu}{\\sigma} \\leq Z \\leq \\dfrac{b-\\mu}{\\sigma})$;\n\n\u003e\u003e Then, when we see $\\phi$, that means to use probability distribution tables:\n\n$\\Rightarrow \\phi(\\dfrac{b-\\mu}{\\sigma}) - \\phi(\\dfrac{a-\\mu}{\\sigma})$.\n\n• $P(X \\leq a) = \\phi(\\dfrac{a-\\mu}{\\sigma})$.\n\n• $P(X \\geq b) = 1 - \\phi(\\dfrac{b-\\mu}{\\sigma})$.\n\n• The CDF of Z = $\\dfrac{X - \\mu}{\\sigma} = P(Z \\leq z) = P(X \\leq \\sigma z + \\mu) = \\int_{-\\infty}^{\\sigma z + \\mu} = f(x;\\mu,\\sigma) dx$.\n\n    **Please note the normal distribution markdown file to see an application\n    of the axioms of std normal distribution, as that is the best way to learn.**\n\n\u003e Standard Normal Approximation of Binomial:\n\nAn interesting quality of the normal distribution is that its curve approximates the histogram Riemann-sums-like binomial distribution when a random variable under the binomial distribution has histograms that aren't \"too skewed\". For these cases, use the normal approximation.\n\n\u003e Normal approximation:\n\n$\\mu = np, \\sigma = \\sqrt{npq}$ like binomial, and\n\n$P(X \\leq x) = \\phi(\\dfrac{x + 0.5 - \\mu}{\\sigma})$.\n\nThis approximation is adequate if $np \\geq 10$, $nq \\geq 10$, as it gives enough symmetry in the underlying binomial distribution.\n\n## Central Limit Theorem\n\n\u003e \"For large enough n, things are normal.\"\n\nFor a large enough sample size of n, usually $n \u003e 30$, a standard normal distribution will suffice for a random variable.\n\n\u003c!-- Random Sampling for any distribution: $E[\\overline{Y}]  = \\mu, V[\\overline{Y}] = \\dfrac{\\sigma^2}{n}$. --\u003e\n\nIf the sample size is large, $\\overline{Y}$ will have an approximately normal sampling distribution, so as $n \\rightarrow \\infty$, the distribution function will converge to the standard normal one.\n\n\n## Moments\n\nMoments of a probability distribution include:\n\n\u003e Moment 1: Expected value (mean)\n\n- The first population moment is $E(X) = \\mu$;\n- The first sample moment is $\\overline{x} = \\dfrac{1}{n} \\sum X_i$.\n\n   (This makes sense as an average of many points in the sample.)\n\n\u003e Moment 2: Variance\n\n- The second population moment is $E(X^2) = \\sigma^2$;\n- The second sample moment is $\\overline{x}^2 = \\dfrac{1}{n} \\sum X_i^2$ or $s^2$.\n\n\u003e Moment 3: Skewness (whether the data is skewed to the left or right of the mean), e.g. asymmetry about mean;\n\n\u003e Moment 4: Kurtosis (\"tail-ness\").\n\n\u003e Moment k: \n\n- The kth population moment is $E(X^k)$;\n\n- The kth sample moment is $\\overline{x}^k = \\dfrac{1}{n} \\sum X_i^k$.\n\nThese moments will be fundamental to techniques of estimation that follow.\n\n## Estimation: Statistical Inference\n\nThe purpose of statistics is to make inferences about data, and conclusions. We make \ninferences about a population, and its sample(s). All the data, is not always known.\n\nSo, we use _parameters_ to pass into functions - joint probability functions,\nestimation functions, and so on, in order to estimate, and infer, data.\n\nWhen we create estimates they can be a scalar point estimate, or a prediction interval,\ne.g. a confidence interval.\n\n### Point Estimates\n\nNotation: $\\hat{\\theta}$ is the \"point estimator\" of the parameter $\\theta$.\n\nWe use $\\theta$ as an abstraction.\n\nNow, remember when we talked above in the \"z scores/standardization\" section about\nstatistics often using population or sample data, mixed, without much explanation.\nHere we see that in play:\n\n\"One example is $\\hat{\\mu}$, the point estimator of $\\mu$, is $\\overline{x}$, the sample mean.\"\n\nThat statement is really saying the following:\n- there is a _population_ mean $\\mu$.\n- By changing $\\mu$ into $\\hat{\\mu}$, we are saying \"what is the _estimator of $\\mu$?\n- That question is answered on the RHS of the equation, by $\\overline{x}$, the sample mean.\n\nThis completely squares with what we discussed earlier, using sample data to estimate\n(often unavailable) population data.\n\nTODO finish this section\n\n\u003e Unbiased Estimator\n\nBias of a point estimate is $E(\\hat{\\theta}) - \\theta$. For an unbiased estimate, \nthis is $0$, and expected value of the estimator is exactly the param value:\n\n$E[\\hat{\\theta}] = \\theta$.\n\n\u003e Good Estimator: Minimum Variance Unbiased Estimator (MVUE)\n\nA good estimator has minimal variance, and has a \"skinny\" scatter about the mean.\n\nMUVE: $\\hat{\\mu} - \\overline{x}$. (Basically, minimal variance points us to a \npreferred estimator, all other things being equal with other estimators).\n\n\u003e Recall also variance derivations, often used in estimation techniques:\n\n$\\sigma^2 = Var[X] =$\n\n$\\Rightarrow E[(X-\\mu)^2] \\Rightarrow E[(X-E[X])^2] \\Rightarrow E[X^2] - E[X]^2$.\n\n\u003e Method of Estimation: Method of Moments\n\n\u003e\u003e Typical problem statement: \"Use method of moments to obtain an estimator for $\\theta$.\"\n\nFor the first population moment, $E(X) = \\mu$ and  first sample moment $\\overline{X} = \\dfrac{1}{n} \\sum X_i$;\n\nFor the second population moment, $E(X^2)$ and  first sample moment $\\overline{X}^2 = \\dfrac{1}{n} \\sum X_i^2$;\n\nLet the kth population moment be $E[X^k]$, and let the kth sample moment be $\\dfrac{1}{n} \\sum_{i=1}^n x_i^k$.\n\nAs covered in Devore 6.2, the method of moments estimator is obtained by equating the expected value $E(X^k)$ to actual, sampled value  to $\\overline{X^k}$.\n\nAs covered in Wackerly, the nth raw moment (about zero) of a random variable $X$\nwith density function $f(x)$, is:\n\n- $\u003cX^n\u003e = \\sum_i X_i^n * f(x_i)$ for a discrete distribution,\n\n   ... very similarly to evaluating a discrete PMF, $f(x,...) = x*f(x)$;\n\n- $\u003cX^n\u003e = \\int (x-\\mu)^n * f(x) dx$ for a continuous distribution, similar to CDF.\n\nThat's it. That is the \"method of moments\" technique for obtaining estimators for $\\theta$.\n\n\u003e Method of Estimation: Method of Maximum Likelihood\n\n\u003e\u003e Typical problem statement: \"Use method of maximum likelihood to obtain an estimator for $\\theta$.\"\n\nProcess:\n\n• (1) Take the distribution function, e.g. the PDF or CDF. This is also called\nthe `likelihood function`. Same thing.\n\nThis would be a joint PMF/PDF/CDF. Recall that joint probability looks like \n$P(A \\cap B) = P(A)P(B)$, particularly with independent events $\\because P(A \\cap B) = \\emptyset$.\n\nNotably this joint probability is a product $P(A)P(B)$. This will matter below.\n\n• (2) Take its natural log. Why? Because it's easier due to logarithm rules as follow.\n\nRecall that the joint probability is a product. Logarithm rules apply,\n\n\"The log of a product $\\Rightarrow$ sum of logs:\"\n\n$log(A*B) \\Rightarrow log(A) + log(B)$.\n\nNotation and flow for this step would generally be something like:\n\n$\\prod_{i=1}^n f(x) x_i^{\\theta} = ln[ f(x) ] + ln [\\sum x_i^{\\theta}]$.\n\n• (3) Take its derivative and set that to equal for \"maximum value.\"\n\nWe are taking the derivative of the log function in the last step. Set it to zero.\n\n• (4) Solve for $\\theta$ to find the value that has the maximum likelihood, or \nprobability, of being an estimator for $\\theta$.\n\nThat's it.\n\n\n### Confidence Intervals\n\nConfidence intervals are another way to obtain estimates. The confidence interval,\nor \"interval estimator,\" is a rule by which we get the limits/endpoints. We desire:\n\n- A narrow interval;\n- That actually encloses the desired parameter, $\\theta$.\n\n\u003e Confidence coefficient: \"$1-\\alpha$\"\n\nProbability that a confidence interval will enclose the desired parameter $\\theta$.\n\nOr, \"the fraction of the time, with repeated sampling, that the interval contains\n $\\theta$\", per Wackerley.\n\nA high confidence coefficient means, high confidence. Moving forward with sampling,\nwe can be confident that our resulting confidence interval contains $\\theta$.\n\n\u003e Two sided confidence interval:\n\nVery similarly to real analysis and delta-epsilon infimum and supremum, let \nthe probability of the interval between lower limit $\\hat{\\theta}_L$, upper limit $\\hat{\\theta}_U$ be:\n\n$P(\\hat{\\theta}_L \\leq \\theta \\leq \\hat{\\theta}_U) = 1 - \\alpha$, with confidence coefficient $1 - \\alpha$.\n\n$[\\hat{\\theta}_L, \\hat{\\theta}_U]$ is the resulting two sided confidence interval.\n\n\u003e One sided confidence interval:\n\nLet $P(\\hat{\\theta}_L \\leq \\theta) = 1 - \\alpha$, with implied one-sided CI of $(\\hat{\\theta}_L, \\infty)$, or,\n\nless abstractly, $\\overline{x} - Z_{\\alpha/2}*\\dfrac{s}{\\sqrt{n}} \u003c \\mu$;\n\nLet $P(\\theta \\leq \\hat{\\theta}_U) = 1 - \\alpha$,  with implied one-sided CI of $(-\\infty, \\hat{\\theta}_U)$, or,\n\nless abstractly, $\\mu \u003c \\overline{x} + Z_{\\alpha/2}*\\dfrac{s}{\\sqrt{n}}$.\n\n\u003e Finding a confidence interval\n\nRecall the standard normal distribution axioms, distance from the mean in standard\ndeviations was $z = \\dfrac{y-\\mu}{\\sigma}$, the \"standardized variable Y\" is $\\dfrac{Y-\\mu}{\\sigma}$.\n\nSubtracting $\\mu$ \"shifts the mean to zero\". Dividing by $\\sigma$ scales the variable\ns.t. the std deviation is 1 instead of $\\sigma$.\n\nThis will be similar. The quantity $Z = \\dfrac{\\hat{\\theta} - \\theta}{\\sigma_{\\hat{\\theta}}}$ has a standard normal distribution.\n\n\nLet's look at our probability, by selecting tail area values\nof $Z_{\\alpha/2}$ and $-Z_{\\alpha/2}$. Then,\n\n$P(-Z_{\\alpha/2} \\leq Z \\leq Z_{\\alpha/2}) = 1 - \\alpha$\n\n$\\Rightarrow P(-Z_{\\alpha/2} \\leq \\dfrac{\\hat{\\theta} - \\theta}{\\sigma_{\\hat{\\theta}}} \\leq Z_{\\alpha/2}) = 1 - \\alpha$\n\n$\\Rightarrow P(-Z_{\\alpha/2} * \\sigma_{\\hat{\\theta}} \\leq \\hat{\\theta} - \\theta \\leq Z_{\\alpha/2} * \\sigma_{\\hat{\\theta}}) = 1 - \\alpha$\n\n$\\Rightarrow P(\\hat{\\theta} -Z_{\\alpha/2} * \\sigma_{\\hat{\\theta}} \\leq \\theta \\leq \\hat{\\theta} + Z_{\\alpha/2} * \\sigma_{\\hat{\\theta}}) = 1 - \\alpha$\n\nWhere our endpoints comprise the LHS and RHS of the inequality. This is quite abstract,\nso let's note that $\\theta$ is the expected \"population\" value, and $\\hat{\\theta}$\nis the \"real value\" or \"estimator,\" and let's apply this.\n\n[EX] Let the parameter of interest, $\\theta$, be population mean $\\mu$.\n\nWe know its estimator is the sample mean, $\\overline{x}$. That means $\\hat{\\theta} = \\overline{x}$.\n\nNow, regarding this: \"$\\sigma_{\\hat{\\theta}}$.\" This is the estimator for standard deviation\nof the population, $\\sigma$. Well, that would be that of the sample, right? $\\dfrac{\\sigma}{\\sqrt{n}}$.\n\nPutting this all together, we have that\n\n$P(\\overline{x} -Z_{\\alpha/2} * \\dfrac{\\sigma}{\\sqrt{n}} \\leq \\mu \\leq \\overline{x} + Z_{\\alpha/2} * \\dfrac{\\sigma}{\\sqrt{n}}) = 1 - \\alpha$\n\n$\\Rightarrow \\overline{x} \\pm Z_{\\alpha/2} * \\dfrac{\\sigma}{\\sqrt{n}} = 1 - \\alpha$.\n\n\u003e\u003e This is also called a $100(1-\\alpha)\\%$ confidence interval for $\\mu$.\n\n\n[Def] Confidence Interval and eqns\n---\n\n\u003e When $\\sigma$ is known, \n\na $100(a-\\alpha)\\%$ confidence interval for the mean $\\mu$ of a normal population, is,\n\nthe point estimate of $\\mu$, $\\pm$ (z critical value)(standard error of the mean), or\n\n$(\\overline{x} - Z_{\\alpha/2} * \\dfrac{\\sigma}{\\sqrt{n}}, \\overline{x} + Z_{\\alpha/2} * \\dfrac{\\sigma}{\\sqrt{n}})$, e.g.\n\n$\\overline{x} \\pm Z_{\\alpha/2} * \\dfrac{\\sigma}{\\sqrt{n}} = 1 - \\alpha$.\n\n**Remember you can always replace $\\sigma$ with $s$ to represent \"sample std deviation.\"**\nThis is very useful, since population standard deviations aren't always known, but\nsample ones almost always are. This is often called \"large sample confidence\" work,\nbecause it relies on $n\u003e30$ to work.\n\n\u003e When $\\sigma$ is unknown and $n \u003c 30$, use the t-distribution:\n\n$\\overline{x} \\pm t_{\\alpha/2} * \\dfrac{s}{\\sqrt{n}} = 1 - \\alpha$.\n\nInstead of \"z critical scores\" $Z_{\\alpha/2}$, we will use $t_{\\alpha/2}$.\n\nThe t-distribution is not a normal distribution; since it uses small $n$ and \nthe sample standard deviation or error, this introduces less reliability, and \nheavier tails.\n\nThe t-distribution is controlled by parameter \"degrees of freedom.\" This can be \nnotated $\\nu$ or $df$.\n\n- When $\\nu = 1$, the t-distribution becomes Cauchy with very heavy tails.\n- When $\\nu \\rightarrow \\infty$, the t-distribution converges to the standard\nnormal distribution, with very light tails.\n\nThese principles are also related to kurtosis.\n\n\u003e R code related to the t distribution:\n\n- $dt$ (PDF value),\n- $pt$ (CDF value), returns value to the left, or to the right if $pt(x, df, lower.tail = FALSE)$\n- $qt$ t-distribution's quantile. $qt(x, df)$, e.g. \n\n    ```R\n    #find the t-score of the 99th quantile of the Student t distribution with df = 20\n    qt(.99, df = 20)\n    ```\n\n    Or better yet, $qt(1 - \\dfrac{\\alpha}{2}, df = n-1)$.\n\n- $rt$ (ret value is vector of random variables).\n\nFor the t-distribution, we call these t-scores, not z-scores.\n---\n\n\u003e Confidence Interval Width: Why we sometimes choose $90\\%$ over $99\\%$\n\nWider intervals are more reliable, but less precise; we desire narrow intervals\nwhenever possible. For this reason, we often specify our desired CI and interval width,\nand our output is the required sample size for such a CI and interval width.\n\n(This is also reminiscent of the minimal variance theories of estimation).\n\n_The sample size required for the confidence interval to have width $w$ is:_\n\n$n = (2 Z_{\\alpha/2}*\\dfrac{\\sigma}{w})^2$.\n\nAlso, $w = \\hat{\\theta}_U - \\hat{\\theta}_L$. The width of the interval is difference of endpoints.\n\nAlso, width is twice the margin of error, e.g. if \"within 3 standard deviations,\"\nthen your margin of error is 3, and width is 6.\n\n\u003e Final notes on Confidence Intervals\n\n- According to the relative frequency viewpoint of probability, many experiments \nneed to be applied/performed. Can't just do one experiment and say that you have\na $95\\%$ confidence interval. That theory must be tested.\n\n\u003e Known critical values:\n\n$90\\%$ CI: Two sided CV 1.64, one sided 1.28\n\n$95\\%$ CI: Two sided CV 1.96, one sided 1.65\n\n$99\\%$ CI: Two sided CV 2.58, one sided 2.33\n\n## Significance or Hypothesis Testing\n\nThe hypothesis test is about the population mean $\\mu$, which represents the \"true\" average estimated calorie content for all individuals in the population (not just the people sampled).\n\nFor more information see hypothesis testing R code samples.\n\n$H_0$, the null hypothesis, is the conservative, \"tried and true\" value or finding. This is usually from the population statistic, such as $\\mu$.\n\nThe alternative \"researcher's\" hypothesis is $H_a \u003e 158$, and is generally\n\"what the researcher wants to prove,\" or a theory of sorts.\n\n\"Significance testing\", denoted $\\alpha$, is a way to tell whether a change has the intended consequence.\nSignificance levels are usually $0.05, 0.01, 0.001$ for most common values.\n\nAssuming $\\alpha = 0.05$ because \"strong evidence\" is common.\n\nThe general way to do hypothesis testing is:\n\n• Define hypotheses $H_0$, $H_a$;\n\n• Set up a threshold, significance level;\n\n• Take a sample;\n\n• That sample gives us **test statistics** from the sample $\\overline{x}, s, $ etc.\n\n    Where our test statistic is either z test (normally distributed) or t test:\n\n    Recall that z-test is for approximately normally distributed samples with $n \u003e 30$,\n    and $\\sigma$ pop std dev known, e.g. the z-distribution;\n\n    t-test is for $n \u003c 30$ and/or when $\\sigma$ unknown; t-distribution.\n\n$z = \\dfrac{\\overline{x} - \\mu}{\\dfrac{\\sigma}{\\sqrt{n}}}$ or $t = \\dfrac{\\overline{x} - \\mu}{\\dfrac{s}{\\sqrt{n}}}$\n\n• Set up the rejection region for $H_0$ via critical values:\n\n($qnorm()$ for z distribution, $qt()$ for t distribution).\n\n• See if our **test statistic** hits inside or outside the rejection region for $H_0$.\n\n    This informs whether to reject the null hypothesis.\n\n• See if our **p-values** are greater or less than significance levels.\n\n($pnorm()$ for z distribution, $pt()$ for t distribution).\n\nP-values are a probability that we would get the statistics we got, if $H_0$ were true.\n\nFor example, if our test statistic hits inside a rejection region for $H_0$,\nYET $H_0$ is true, how likely is that?\n\nA high p-value would mean, very likely. This means that getting the \"wrong answer\"\nis likely with this particular distribution. This indicates _high variance._\n\nSo the test for p values is to test that probability against the threshold,\nor significance level $\\alpha$.\n\n• p-value $\u003c \\alpha \\Rightarrow$ reject $H_0$.\n\n• p-value $\\geq \\alpha \\Rightarrow$ do not reject $H_0$. \n\n\u003e One sample t test\n\n$t.test()$ for t distributions. For z distributions, that is going to be a lot of\ndata so we tend to not write out all the samples individually in R, and also\nthere appears to be no equivalent builtin for the same reason, so for z-dist,\nfind z scores automatically and then find p values with $pnorm()$.\n\n\u003e Testing a population proportion, or percentage; \"large sample z-test\"\n\nTest statistic: $z = \\dfrac{\\hat{p} - p_0}{\\sqrt{\\dfrac{p_0(1-p_0}{n}}}$,\n\nwhere $\\hat{p} = \\dfrac{82}{n}$ where 82 = sample \"success.\"\n\nUse $prop.test()$ for this for a \"one sample population proportion z-test.\"\n\n\u003e Large sample test: need $np \\geq 10, nq \\geq 10$.\n\n## Hypothesis Testing and Inferences Based on Two Populations: Means and Proportions\n\nInferences based on two samples usually happen via means ($\\mu, \\overline{x}$),\nand population proportions.\n\n- Two sample t test: see R code;\nThe test statistic is \n$t = \\dfrac{(\\overline{x} - \\overline{y}) - \\Delta}{\\sqrt{\\dfrac{s_1^2}{m} + \\dfrac{s_2^2}{n}}}$, where $\\Delta = 0$ is used if working with the hypothesis that $\\mu1 = \\mu2$.\n\n```R\n# welch's t-test, denoted by last argument of var.equal=false, does not assume\n# equivalent variance for the two samples.\n\nt.test(h,p, var.equal = FALSE)\n```\n\n- Two-sample t confidence interval:\n\n$(\\overline{x} - \\overline{y}) \\pm t_{\\alpha/2, df * \\sqrt{\\dfrac{s_1^2}{m} + \\dfrac{s_2^2}{n}}}$ with approximately normal distribution\n\n\u003e Paired data: use $t.test()$\n\nFor matching \"paired data,\" what that really means is sampling the same _thing_,\nbut in different ways. So instead of the \"two objects\" or \"two samples\" usually\ndiscussed in two-sample hypothesis testing, we have two samples of the same thing.\n\nEquations for most \"2 sample\" problems involve independent sets of data, but in \npaired testing, e.g. duplicate of same \"object\", there's dependence,\nso we use different equations generally called \"paired t testing.\"\n\n- Paired t test: See R code, and also,\n\nThe t-statistic for a paired t-test is $t = \\dfrac{\\overline{d}}{Sd/\\sqrt{n}}$ where $\\overline{d}$ is the differences mean, $Sd$ is the differences std deviation.\n\n- 2 sample, large sample Z test: when $\\sigma$ known, and approximately normal distribution,\n\nZ test statistic is $z = \\dfrac{\\overline{x} - \\overline{y}}{\\sqrt{\\dfrac{\\sigma^2}{m} + \\dfrac{\\sigma^2}{n}}}$\n\n- Two-proportion population large sample test:\n\ntest statistic is $z = \\dfrac{\\hat{p_1} - \\hat{p_2}}{\\sqrt{\\hat{p}\\hat{q}(\\dfrac{1}{m}+\\dfrac{1}{n})}}$\n\n## Analysis of Variance (ANOVA)\n\nANOVA, which stands for Analysis of Variance, is a statistical test used to analyze the difference between the means of more than two groups. A one-way ANOVA uses one independent variable, while a two-way ANOVA uses two independent variables.\n\n```R\n# 1-sided \"single factor\" ANOVA p-value:\npf(test_stat,df1 = df_treatment, df2 = df_error, lower.tail = FALSE)\n# always right tailed for one-sided ANOVA.\n\n# and, critical values (alpha = 0.05 default):\nqf(1-alpha, dfTr, dfError)\n\n# as usual, if p \u003c alpha, reject H0, and check critical vals too.\n```\n\nTODO ANOVA table:\n\n## Linear Regression\n\nTODO. See R code for now :)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabstractmachines%2Fr-stats-probability","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabstractmachines%2Fr-stats-probability","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabstractmachines%2Fr-stats-probability/lists"}