Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions 1_DATASCITOOLBOX/Data Scientists Toolbox Course Notes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,14 @@ $\pagebreak$
* `pwd` = print working directory (current directory)
* `clear` = clear screen
* `ls` = list stuff
* `-a` = see all (hidden)
* `-a` = see all (including hidden files)
* `-l` = details
* `cd` = change directory
* `mkdir` = make directory
* `touch` = creates an empty file
* `cp` = copy
* `cp <file> <directory>` = copy a file to a directory
* `cp <file> <renamed_file>` = rename a file
* `cp -r <directory> <newDirectory>` = copy all documents from directory to new Directory
* `-r` = recursive
* `rm` = remove
Expand Down Expand Up @@ -102,7 +103,7 @@ $\pagebreak$
* **Big data** = now possible to collect data cheap, but not necessarily all useful (need the right data)

## Experimental Design
* Formulate you question in advance
* Formulate your question in advance
* **Statistical inference** = select subset, run experiment, calculate descriptive statistics, use inferential statistics to determine if results can be applied broadly
* ***[Inference]*** **Variability** = lower variability + clearer differences = decision
* ***[Inference]*** **Confounding** = underlying variable might be causing the correlation (sometimes called Spurious correlation)
Expand All @@ -118,5 +119,5 @@ $\pagebreak$
* **Accuracy** = Pr(correct outcome)
* **Data dredging** = use data to fit hypothesis
* **Good experiments** = have replication, measure variability, generalize problem, transparent
* Prediction is not inference, and be ware of data dredging
* Prediction is not inference, and beware of data dredging

10 changes: 5 additions & 5 deletions 2_RPROG/R Programming Course Notes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
title: "R Programming Course Notes"
author: "Xing Su"
output:
pdf_document:
toc: yes
toc_depth: 3
html_document:
highlight: pygments
theme: spacelab
toc: yes
pdf_document:
toc: yes
toc_depth: 3
---
$\pagebreak$

Expand Down Expand Up @@ -360,7 +360,7 @@ $\pagebreak$
* ***examples***
* `apply(x, 1, sum)` or `apply(x, 1, mean)` = find row sums/means
* `apply(x, 2, sum)` or `apply(x, 2, mean)` = find column sums/means
* `apply(x, 1, quantile, props = c(0.25, 0.75))` = find 25% 75% percentile of each row
* `apply(x, 1, quantile, probs = c(0.25, 0.75))` = find 25% 75% percentile of each row
* `a <- array(rnorm(2*2*10), c(2, 2, 10))` = create 10 2x2 matrix
* `apply(a, c(1, 2), mean)` = returns the means of 10

Expand Down Expand Up @@ -551,7 +551,7 @@ $\pagebreak$
### Larger Tables
* ***Note**: help page for read.table important*
* need to know how much RAM is required $\rightarrow$ calculating memory requirements
* `numRow` x `numCol` x 8 bytes/numeric value = size required in bites
* `numRow` x `numCol` x 8 bytes/numeric value = size required in bytes
* double the above results and convert into GB = amount of memory recommended
* set `comment.char = ""` to save time if there are no comments in the file
* specifying `colClasses` can make reading data much faster
Expand Down
20 changes: 10 additions & 10 deletions 3_GETDATA/Getting and Cleaning Data Course Notes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ $\pagebreak$
* ***Relative***: `setwd("./data")`, `setwd("../")` = move up in directory
* ***Absolute***: `setwd("/User/Name/data")`
* **Check if file exists and download file**
* `if(!file.exists("data"){dir.create("data")}`
* `if(!file.exists("./data")) {dir.create("./data")}`
* **Download file**
* `download.file(url, destfile= "directory/filname.extension", method = "curl")`
* `method = "curl"` [mac only for https]
Expand Down Expand Up @@ -120,7 +120,7 @@ $\pagebreak$
* `xpathSApply(rootNode, "//price", xmlValue)` = get the values of all elements with tag "price"
* **extract content by attributes**
* `doc <- htmlTreeParse(url, useInternal = True)`
* `scores <- xpathSApply(doc, "//li@class='score'", xmlvalue)` = look for li elements with `class = "score"` and return their value
* `scores <- xpathSApply(doc, "//li[@class='score']", xmlvalue)` = look for li elements with `class = "score"` and return their value



Expand Down Expand Up @@ -153,14 +153,14 @@ $\pagebreak$
## data.table
* inherits from `data.frame` (external package) $\rightarrow$ all functions that accept `data.frame` work on `data.table`
* can be much faster (written in C), ***much much faster*** at subsetting/grouping/updating
* **syntax**: `dt <- data.table(x = rnorm(9), y = rep(c(a, b, c), each = 3), z = rnorm(9)`
* **syntax**: `dt <- data.table(x = rnorm(9), y = rep(c("a","b","c"), each = 3), z = rnorm(9))`
* `tables()` = returns all data tables in memory
* shows name, nrow, MB, cols, key
* some subset works like before = `dt[2, ], dt[dt$y=="a",]`
* `dt[c(2, 3)]` = subset by rows, rows 2 and 3 in this case
* **column subsetting** (modified for `data.table`)
* argument after comma is called an ***expression*** (collection of statements enclosed in `{}`)
* `dt[, list(means(x), sum(z)]` = returns mean of x column and sum of z column (no `""` needed to specify column names, x and z in example)
* `dt[, list(mean(x), sum(z))]` = returns mean of x column and sum of z column (no `""` needed to specify column names, x and z in example)
* `dt[, table(y)]` = get table of y value (perform any functions)
* **add new columns**
* `dt[, w:=z^2]`
Expand All @@ -176,9 +176,9 @@ $\pagebreak$
* **special variables**
* `.N` = returns integer, length 1, containing the number (essentially count)
* `dt <- data.table (x=sample(letters[1:3], 1E5, TRUE))` = generates data table
* `dt[, .N by =x]` = creates a table to count observations by the value of x
* `dt[, .N, by =x]` = creates a table to count observations by the value of x
* **keys** (quickly filter/subset)
* *example*: `dt <- data.table(x = rep(c("a", "b", "c"), each 100), y = rnorm(300))` = generates data table
* *example*: `dt <- data.table(x = rep(c("a", "b", "c"), each = 100), y = rnorm(300))` = generates data table
* `setkey(dt, x)` = set the key to the x column
* `dt['a']` = returns a data frame, where x = 'a' (effectively filter)
* **joins** (merging tables)
Expand All @@ -187,9 +187,9 @@ $\pagebreak$
* `setkey(dt1, x); setkey(dt2, x)` = sets the keys for both data tables to be column x
* `merge(dt1, dt2)` = returns a table, combine the two tables using column x, filtering to only the values that match up between common elements the two x columns (i.e. 'a') and the data is merged together
* **fast reading of files**
* *example*: `big_df <- data.frame(norm(1e6), norm(1e6))` = generates data table
* *example*: `big_df <- data.frame(rnorm(1e6), rnorm(1e6))` = generates data table
* `file <- tempfile()` = generates empty temp file
* `write.table(big.df, file=file, row.names=FALSE, col.names = TRUE, sep = "\t". quote = FALSE)` = writes the generated data from big.df to the empty temp file
* `write.table(big_df, file=file, row.names=FALSE, col.names = TRUE, sep = "\t", quote = FALSE)` = writes the generated data from big.df to the empty temp file
* `fread(file)` = read file and load data = much faster than `read.table()`


Expand All @@ -202,7 +202,7 @@ $\pagebreak$
* free/widely used open sources database software, widely used for Internet base applications
* each row = record
* data are structured in databases $\rightarrow$ series tables (dataset) $\rightarrow$ fields (columns in dataset)
* `dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu)` = open a connection to the database
* `dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu")` = open a connection to the database
* `db = "hg19"` = select specific database
* `MySQL()` can be replaced with other arguments to use other data structures
* `dbGetQuery(db, "show databases;")` = return the result from the specified SQL query executed through the connection
Expand Down Expand Up @@ -473,7 +473,7 @@ $\pagebreak$
## Subsetting and Sorting
* **subsetting**
* `x <- data.frame("var1" = sample(1:5), "var2" = sample(6:10), "var3" = (11:15))` = initiates a data frame with three names columns
* `x <- x[sample(1:5)` = this scrambles the rows
* `x <- x[sample(1:5),]` = this scrambles the rows
* `x$var2[c(2,3)] = NA` = setting the 2nd and 3rd element of the second column to NA
* `x[1:2, "var2"]` = subsetting the first two row of the the second column
* `x[(x$var1 <= 3 | x$var3 > 15), ]` = return all rows of x where the first column is less than or equal to three or where the third column is bigger than 15
Expand Down
11 changes: 6 additions & 5 deletions 7_REGMODS/Regression Models Course Notes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -743,13 +743,14 @@ $\pagebreak$
### Intervals/Tests for Coefficients
* standard errors for coefficients
$$\begin{aligned}
Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)}{((X_i - \bar X)^2)}\right) \\
(expanding) & = Var\left(\frac{\sum_{i=1}^n Y_i (X_i - \bar X) - \bar Y \sum_{i=1}^n (X_i - \bar X)}{((X_i - \bar X)^2)}\right) \\
& Since~ \sum_{i=1}^n X_i - \bar X = 0 \\
(simplifying) & = \frac{\sum_{i=1}^n Y_i (X_i - \bar X)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \Leftarrow \mbox{denominator taken out of } Var\\
Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2}\right) \\
(expanding) & = Var\left(\frac{\sum_{i=1}^n Y_i (X_i - \bar X) - \bar Y \sum_{i=1}^n (X_i - \bar X)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2}\right) \\
& Since~ \sum_{i=1}^n (X_i - \bar X) = 0 \\
(simplifying) & = \frac{Var\left(\sum_{i=1}^n Y_i (X_i - \bar X)\right)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \Leftarrow \mbox{denominator taken out of } Var\\
& Since~ Var\left(\sum aY\right) = \sum a^2 Var\left(Y\right) \\
(Var(Y_i) = \sigma^2) & = \frac{\sigma^2 \sum_{i=1}^n (X_i - \bar X)^2}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \\
\sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) &= \frac{\sigma^2 }{ \sum_{i=1}^n (X_i - \bar X)^2 }\\
\Rightarrow \sigma_{\hat \beta_1} &= \frac{\sigma}{ \sum_{i=1}^n X_i - \bar X} \\
\Rightarrow \sigma_{\hat \beta_1} &= \frac{\sigma}{ \sqrt {\sum_{i=1}^n (X_i - \bar X)^2}} \\
\\
\mbox{by the same derivation} \Rightarrow & \\
\sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) & = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2 \\
Expand Down