CS395T Computational Statistics with Application to Bioinformatics
Prof. William H. Press
Course Lecture Notes (Spring, 2008)

Unit 1: Probability Theory and Bayesian Inference

Concepts: probability theorems and examples; inference, Bayesian inference; marginalization; nuisance parameter; posterior; Bernoulli trials; conjugate prior

MATLAB: syms, int, ezplot, diff, simplify, solve, pretty
Mathematica: Integrate, GenerateConditions, D, Plot, Simplify, Solve

Unit 2: Univariate Distributions and the Central Limit Theorem

Concepts: measures of central tendency, mean, median; normal (Gaussian), Student, Cauchy, lognormal, exponential, gamma, chi-square; PDF, CDF, characteristic function; Central Limit Theorem

NR3 (C++): Normaldist

Unit 3: Random Number Generators, Tests for Randomness, and Tail Tests Generally

Concepts: random number generator (RNG); multiplicative RNG, p-values, t-values; binomial distribution; chi-square test; 1- vs. 2-point distribution; Xorshift RNG; combinations of generators; p-value paradigm

MATLAB: uint32, mod, accumarray, betainc, normcdf, ceil, zeros, chi2cdf
MATLAB API (C): mex functions, mxGetData, mxGetM, mxGetN
NR3 (C++): nr3.h, struct Toyran1, Chisqdist, Ran

Unit 4: Tail Test Perils and Pitfalls: Chi-Square Misuse, Multiple Hypotheses, Stopping Criteria

Concepts: moments of chi-square variable, how chi-square becomes normal; chi-square failure for Poisson events; linear constraints; multiple hypothesis correction, Bonferroni, FDR; stopping rule paradoxes

MATLAB: symsum, betapdf, quad, linspace
Mathematica: Sum

Unit 5: More on Random Deviate Generation

Concepts: Xorshift generators; matrix powers by successive squaring; GCD and Gorilla randomness tests; transformation method; rejection method; ratio of uniforms method; squeezes; Leva's algorithm

MATLAB: ndgrid, eye, spy, jacobian, abs, det
Mathematica: FactorInteger

Unit 6: Understanding Distributions Known Only Empirically

Concepts: empirical distributions, samples; Kolmogorov-Smirnov (KS) test; IQagent data structure; genestats.dat data file, intron and exon lengths; plotting PDFs, uniformity of errors, PDFs on log scales; resampling; statistical significance vs. data quantity

MATLAB: readgenestats (custom), fopen, fclose, repmat, cell, textscan, dataset, error, cell2mat, plot, hold, log10, cdfplot, kstest2, arrayfun, loglog, semilogy
MATLAB API (C): mxCreateDoubleMatrix
NR3 (C++): IQagent

Unit 7: Fitting Models to Data and Estimating Errors in Model-Derived Quantities

Concepts: binned data; nonlinear leaset squares (NLS) fits; covariance matrix; goodness of fit; linear propagation of errors; Jacobian matrix; sampling the posterior distribution; bootstrap resampling

MATLAB: hist, bar, nlinfit, nlinfitw (custom), diag, randn, numel, chi2cdf, jacobian, subs, mvnrand, mean, std, randsample, arrayfun

Unit 8: Contingency Tables, Experimental Protocols, and All That

Concepts: contingency tables; null hypothesis; Pearson statistic; retrospective or case-control; prospective or longitudinal; cross-sectional or snapshot; nuisance parameters, marginalization; hypergeometric distribution; multinomial distribution; Fisher Exact Test; Wald statistic; nominal, ordinal, cardinal tables; permutation test; bootstrap resampling; Dirichlet distribution

MATLAB: crosstab, contingencytable (custom), sum, repmat, size, squeeze, permute, ndgrid, repmat, accumarray, arrayfun, randperm, hist, randsample, gamrnd, mnrnd, reshape

Unit 9: Working with Multivariate Normal Distributions

Concepts: multivariate normal distribution; covariance matrix; spliceosome; linear correlation matrix; Cholesky decomposition; error ellipses

MATLAB: mean, cov, randsample, mvnrand, corrcoef, chol, errorellipse (custom)

Unit 10: Hierarchical Clustering by Phylogenetic Trees

Concepts: phylogenetic trees; cladograms, additive trees, ultrametric trees; distance matrix, neighbor joining; agglomerative method; vertebrate species; gene chip; Hamming distance; rooted vs. unrooted; gene co-expression; Pearson r; TreeView

NR3 (C++): Phylo_nj, newick

Unit 11: Gaussian Mixture Models and EM Methods

Concepts: Gaussian mixture model (GMM); E-step, M-step, EM method; log-sum-exp; k-means clustering; Jensen's inequality; missing data problems

MATLAB: sum, repmat, arrayfun, ksdensity, mvnrnd
NR3 MATLAB interface: nr3_matlab.h, mxScalar, mxT, MatDoub, VecDoub

Unit 12: Maximum Likelihood Estimation (MLE) on a Statistical Model

Concepts: likelihood function; Fisher Information Matrix, Hessian; centered second difference; outliers; Student-t; AIC, BIC;

MATLAB: hist, bar, fminsearch, hessian (custom), inv, jacobian, subs, arrayfun

Unit 13: Markov Chain Monte Carlo (MCMC)

Concepts: unnormalized distribution, posterior; Markov chain; detailed balance, ergodicity; Metropolis-Hastings algorithm, proposal distribution, acceptance probability; Poisson process, fluctuations

MATLAB: rand, subfunction

Unit 14: SVD, PCA, and the Linear Perspective

Concepts: data matrix, design matrix; standardize; Singular Value Decomposition (SVD); orthogonal basis; low-rank approximation; Principal Component Analysis (PCA); main effects; Gaussian random matrix; order statistic; dimensional reduction; eigengenes, eigenarrays; non-negative matrix factorization (NMF)

MATLAB: prctile, repmat, colormap, image, svd, axis, semilogy, randn, cumsum

Unit 15: Dynamic Programming, Viterbi, and Needleman-Wunsch

Concepts: Bellman-Dijkstra-Viterbi algorithm, forward pass, backward pass; error-correcting code; trellis graph; soft decision decoding; sequence alignment; Needleman-Wunsch algorithm; multiple alignment

NR3 (C++): stringalign

Unit 16: Hidden Markov Models

Concepts: Markov model; transition probability; irreducibility, aperiodicity, ergodicity; successive squaring method; Hidden Markov Model (HMM); symbol probability; state estimation; forward-backward algorithm, alpha pass, beta pass; Baum-Welch re-estimation; likelihood; EM method; Generalized HMM, Hidden Semi-Markov Model

NR3 (C++): HMM
NR3 MATLAB interface: hmmmex

Unit 17: Classifier Performance: ROC, Precision-Recall, and All That
Concepts: confusion matrix, TP, FP, TN, FN; conservative, liberal; performance curve; TPR, FPR, PPV, NPV, FDR; accuracy, sensitivity, specificity, precision, recall; ROC curve; convex hull; precision-recall curve

Mathematica: Solve, FullSimplify, substitution operator (./)

Unit 18: Support Vector Machines (SVMs)
Concepts: linear separation; fat plane; maximum margin SVM; quadratic programming; primal vs. dual problem; soft-margin SVM; embedding; the kernel trick; linear, power, polynomial, sigmoid, Gaussian radial basis kernels; mitochondrial genes

Software: SVMlight

Unit 19: Wiener Filtering (and some Wavelets)
Concepts: signal, noise, filter; Wiener filter; best estimate in L2 norm; Fourier basis; Nyquist frequency; low-pass filter; signal and noise models; spatial (pixel) basis; smoothed image; wavelet basis; quadrature mirror filter; orthogonality conditions; moment conditions; pyramidal algorithm; DAUB; left- and right-derivative

MATLAB: fopen, fread, fclose, flipud, image, axis, fft2, ndgrid, randsample, ifft2, wiener2, wavelet2 (custom)

Unit 20: Multidimensional Interpolation on Scattered Data
Concepts: dimensional explosion; Shepherd interpolation; Radial Basis Function interpolation; multiquadric, inverse multiquadric, thin plate spline, Gaussian; over- and under-smoothing; Laplace interpolation; boundary conditions; biconjugate gradient method; Gaussian process regression; linear prediction; Kriging; variogram

MATLAB: interp1, meshgrid, arrayfun, contour, cell, cellfun, shepinterp (custom), \-operator, std, laplaceinterp (custom), krig (custom)

Unit 21: Information Theory Characterization of Distributions
Concepts: character, alphabet, message; entropy; compression; log cut-down; fair game; payoff odds; protein, amino acid; monographic, digraphic entropy; flattened; conditional entropy; mutual information; Lagrange multiplier; Kelly's formula, proportional betting; CG richness, 3rd codon; Kullbach-Leibler distance; log odds