sux13 · ghost · Jul 2, 2015 · Jul 9, 2015 · Jul 9, 2015 · Jul 11, 2015
diff --git a/1_DATASCITOOLBOX/Data Scientists Toolbox Course Notes.Rmd b/1_DATASCITOOLBOX/Data Scientists Toolbox Course Notes.Rmd
@@ -20,13 +20,14 @@ $\pagebreak$
 * `pwd` = print working directory (current directory)
 * `clear` = clear screen
 * `ls` = list stuff
-    *  `-a` = see all (hidden)
+    *  `-a` = see all (including hidden files)
     *  `-l` = details
 * `cd` = change directory
 * `mkdir` = make directory
 * `touch` = creates an empty file
 * `cp` = copy
     * `cp <file> <directory>` = copy a file to a directory
+    * `cp <file> <renamed_file>` = rename a file
     * `cp -r <directory> <newDirectory>` = copy all documents from directory to new Directory
             * `-r` = recursive
 * `rm` = remove
@@ -102,7 +103,7 @@ $\pagebreak$
 * **Big data** = now possible to collect data cheap, but not necessarily all useful (need the right data)
 
 ## Experimental Design
-* Formulate you question in advance
+* Formulate your question in advance
 * **Statistical inference** = select subset, run experiment, calculate descriptive statistics, use inferential statistics to determine if results can be applied broadly
 * ***[Inference]*** **Variability** = lower variability + clearer differences = decision
 * ***[Inference]*** **Confounding** = underlying variable might be causing the correlation (sometimes called Spurious correlation)
@@ -118,5 +119,5 @@ $\pagebreak$
     * **Accuracy** = Pr(correct outcome)
 * **Data dredging** = use data to fit hypothesis
 * **Good experiments** = have replication, measure variability, generalize problem, transparent
-* Prediction is not inference, and be ware of data dredging
+* Prediction is not inference, and beware of data dredging
 
diff --git a/2_RPROG/R Programming Course Notes.Rmd b/2_RPROG/R Programming Course Notes.Rmd
@@ -2,13 +2,13 @@
 title: "R Programming Course Notes"
 author: "Xing Su"
 output:
-  pdf_document:
-    toc: yes
-    toc_depth: 3
   html_document:
     highlight: pygments
     theme: spacelab
     toc: yes
+  pdf_document:
+    toc: yes
+    toc_depth: 3
 ---
 $\pagebreak$
 
@@ -360,7 +360,7 @@ $\pagebreak$
 * ***examples***
     * `apply(x, 1, sum)` or `apply(x, 1, mean)` = find row sums/means
     * `apply(x, 2, sum)` or `apply(x, 2, mean)` = find column sums/means
-    * `apply(x, 1, quantile, props = c(0.25, 0.75))` = find 25% 75% percentile of each row
+    * `apply(x, 1, quantile, probs = c(0.25, 0.75))` = find 25% 75% percentile of each row
     * `a <- array(rnorm(2*2*10), c(2, 2, 10))` = create 10 2x2 matrix
     * `apply(a, c(1, 2), mean)` = returns the means of 10
 
@@ -551,7 +551,7 @@ $\pagebreak$
 ### Larger Tables
  * ***Note**: help page for read.table important*
  * need to know how much RAM is required $\rightarrow$ calculating memory requirements
-    * `numRow` x `numCol` x 8 bytes/numeric value = size required in bites
+    * `numRow` x `numCol` x 8 bytes/numeric value = size required in bytes
     * double the above results and convert into GB = amount of memory recommended
  * set `comment.char = ""` to save time if there are no comments in the file
  * specifying `colClasses` can make reading data much faster

diff --git a/3_GETDATA/Getting and Cleaning Data Course Notes.Rmd b/3_GETDATA/Getting and Cleaning Data Course Notes.Rmd
@@ -63,7 +63,7 @@ $\pagebreak$
     * ***Relative***: `setwd("./data")`, `setwd("../")` = move up in directory
     * ***Absolute***: `setwd("/User/Name/data")`
 * **Check if file exists and download file**
-    * `if(!file.exists("data"){dir.create("data")}`
+    * `if(!file.exists("./data")) {dir.create("./data")}`
 * **Download file**
     * `download.file(url, destfile= "directory/filname.extension", method = "curl")`
 		* `method = "curl"` [mac only for https]
@@ -120,7 +120,7 @@ $\pagebreak$
 		* `xpathSApply(rootNode, "//price", xmlValue)` = get the values of all elements with tag "price"
 * **extract content by attributes**
     * `doc <- htmlTreeParse(url, useInternal = True)`
-    * `scores <- xpathSApply(doc, "//li@class='score'", xmlvalue)` = look for li elements with `class = "score"` and return their value
+    * `scores <- xpathSApply(doc, "//li[@class='score']", xmlvalue)` = look for li elements with `class = "score"` and return their   value
 
 
 
@@ -153,14 +153,14 @@ $\pagebreak$
 ## data.table
 * inherits from `data.frame` (external package) $\rightarrow$ all functions that accept `data.frame` work on `data.table`
 * can be much faster (written in C), ***much much faster*** at subsetting/grouping/updating
-* **syntax**: `dt <- data.table(x = rnorm(9), y = rep(c(a, b, c), each = 3), z = rnorm(9)`
+* **syntax**: `dt <- data.table(x = rnorm(9), y = rep(c("a","b","c"), each = 3), z = rnorm(9))`
 * `tables()` = returns all data tables in memory
     * shows name, nrow, MB, cols, key
 * some subset works like before = `dt[2, ], dt[dt$y=="a",]`
     * `dt[c(2, 3)]` = subset by rows, rows 2 and 3 in this case
 * **column subsetting** (modified for `data.table`)
     * argument after comma is called an ***expression*** (collection of statements enclosed in `{}`)
-    * `dt[, list(means(x), sum(z)]` = returns mean of x column and sum of z column (no `""` needed to specify column names, x and z in example)
+    * `dt[, list(mean(x), sum(z))]` = returns mean of x column and sum of z column (no `""` needed to specify column names, x and z in example)
     * `dt[, table(y)]` = get table of y value (perform any functions)
 * **add new columns**
     * `dt[, w:=z^2]`
@@ -176,9 +176,9 @@ $\pagebreak$
 * **special variables**
     * `.N` = returns integer, length 1, containing the number (essentially count)
 		* `dt <- data.table (x=sample(letters[1:3], 1E5, TRUE))` = generates data table
-		* `dt[, .N by =x]` = creates a table to count observations by the value of x
+		* `dt[, .N, by =x]` = creates a table to count observations by the value of x
 * **keys** (quickly filter/subset)
-    * *example*: `dt <- data.table(x = rep(c("a", "b", "c"), each 100), y = rnorm(300))` = generates data table
+    * *example*: `dt <- data.table(x = rep(c("a", "b", "c"), each = 100), y = rnorm(300))` = generates data table
 		* `setkey(dt, x)` = set the key to the x column
 		* `dt['a']` = returns a data frame, where x = 'a' (effectively filter)
 * **joins** (merging tables)
@@ -187,9 +187,9 @@ $\pagebreak$
 		* `setkey(dt1, x); setkey(dt2, x)` = sets the keys for both data tables to be column x
 		* `merge(dt1, dt2)` = returns a table, combine the two tables using column x, filtering to only the values that match up between common elements the two x columns (i.e. 'a') and the data is merged together
 * **fast reading of files**
-    * *example*: `big_df <- data.frame(norm(1e6), norm(1e6))` = generates data table
+    * *example*: `big_df <- data.frame(rnorm(1e6), rnorm(1e6))` = generates data table
 		* `file <- tempfile()` = generates empty temp file
-		* `write.table(big.df, file=file, row.names=FALSE, col.names = TRUE, sep = "\t". quote = FALSE)` = writes the generated data from big.df to the empty temp file
+		* `write.table(big_df, file=file, row.names=FALSE, col.names = TRUE, sep = "\t", quote = FALSE)` = writes the generated data from big.df to the empty temp file
 		* `fread(file)` = read file and load data = much faster than `read.table()`
 
 
@@ -202,7 +202,7 @@ $\pagebreak$
 * free/widely used open sources database software, widely used for Internet base applications
 * each row = record
 * data are structured in databases $\rightarrow$ series tables (dataset) $\rightarrow$ fields (columns in dataset)
-* `dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu)` = open a connection to the database
+* `dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu")` = open a connection to the database
     * `db = "hg19"` = select specific database
     * `MySQL()` can be replaced with other arguments to use other data structures
 * `dbGetQuery(db, "show databases;")` = return the result from the specified SQL query executed through the connection
@@ -473,7 +473,7 @@ $\pagebreak$
 ## Subsetting and Sorting
 * **subsetting**
     * `x <- data.frame("var1" = sample(1:5), "var2" = sample(6:10), "var3" = (11:15))` = initiates a data frame with three names columns
-    * `x <- x[sample(1:5)` = this scrambles the rows
+    * `x <- x[sample(1:5),]` = this scrambles the rows
     * `x$var2[c(2,3)] = NA` = setting the 2nd and 3rd element of the second column to NA
     * `x[1:2, "var2"]` = subsetting the first two row of the the second column
     * `x[(x$var1 <= 3 | x$var3 > 15), ]` = return all rows of x where the first column is less than or equal to three or where the third column is bigger than 15

diff --git a/7_REGMODS/Regression Models Course Notes.Rmd b/7_REGMODS/Regression Models Course Notes.Rmd
@@ -743,13 +743,14 @@ $\pagebreak$
 ### Intervals/Tests for Coefficients
 * standard errors for coefficients
 $$\begin{aligned}
-Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)}{((X_i - \bar X)^2)}\right) \\
-(expanding) & = Var\left(\frac{\sum_{i=1}^n Y_i (X_i - \bar X) - \bar Y \sum_{i=1}^n (X_i - \bar X)}{((X_i - \bar X)^2)}\right) \\
-& Since~ \sum_{i=1}^n X_i - \bar X = 0 \\
-(simplifying) & = \frac{\sum_{i=1}^n Y_i (X_i - \bar X)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \Leftarrow \mbox{denominator taken out of } Var\\
+Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2}\right) \\
+(expanding) & = Var\left(\frac{\sum_{i=1}^n Y_i (X_i - \bar X) - \bar Y \sum_{i=1}^n (X_i - \bar X)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2}\right) \\
+& Since~ \sum_{i=1}^n (X_i - \bar X) = 0 \\
+(simplifying) & = \frac{Var\left(\sum_{i=1}^n Y_i (X_i - \bar X)\right)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \Leftarrow \mbox{denominator taken out of } Var\\
+& Since~ Var\left(\sum aY\right) = \sum a^2 Var\left(Y\right) \\
 (Var(Y_i) = \sigma^2) & = \frac{\sigma^2 \sum_{i=1}^n (X_i - \bar X)^2}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \\
 \sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) &= \frac{\sigma^2 }{ \sum_{i=1}^n (X_i - \bar X)^2 }\\
-\Rightarrow \sigma_{\hat \beta_1} &= \frac{\sigma}{ \sum_{i=1}^n X_i - \bar X}  \\
+\Rightarrow \sigma_{\hat \beta_1} &= \frac{\sigma}{ \sqrt {\sum_{i=1}^n (X_i - \bar X)^2}}  \\
 \\
 \mbox{by the same derivation} \Rightarrow & \\
 \sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) & = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2 \\