Introduction to Computational Thinking and Data Science(4)

3.1 Fitting Curves to Data

Hook’s Law

\[F = -K * X\] \[F = mass * acc\]

胡克定律由R.胡克于1678年提出，表达式为F=-k·x或△F=-k·Δx，其中k是常数，是物体的劲度系数（倔强系数）（弹性系数）。在国际单位制中，F的单位是牛，x的单位是米，它是形变量（弹性形变），k的单位是牛/米。劲度系数在数值上等于弹簧伸长（或缩短）单位长度时的弹力。

Fitting Curves to Data
1. Least squares
\[\sum_i^{len(obserbved)-1}(observed[i]-predicted[i])^2\]
Polynomials(多项式表示Curve)
- Line: ax+b
  - Parabola(U-shape): ax**2 = bx + c

pylab.polyfit(x,y,n)

polyfit(x,y,n)多项式曲线拟合
- p = polyfit(x,y,n) 返回阶数为 n 的多项式 p(x) 的系数，该阶数是 y 中数据的最佳拟合（在最小二乘方式中）。p 中的系数按降幂排列，p 的长度为 n+1。

y = polyval(p,x)多项式计算

y = polyval(p,x) 计算多项式 p 在 x 的每个点处的值。参数 p 是长度为 n+1 的向量，其元素是 n 次多项式的系数（降幂排序）

   def fitData(fileName):
       xVals, yVals = getData(fileName)
       xVals = pylab.array(xVals)
       yVals = pylab.array(yVals)
       xVals = xVals*9.81 #get force
       pylab.plot(xVals, yVals, 'bo',
                  label = 'Measured  points')
       labelPlot()                 
       a,b = pylab.polyfit(xVals, yVals, 1) #返回最佳系数值
       estYVals = a*xVals + b #最佳系数值图形
       print('a =', a, 'b =', b)
       pylab.plot(xVals, estYVals, 'r',
                  label = 'Linear fit, k = '
                  + str(round(1/a, 5)))
       pylab.legend(loc = 'best')	

R^2—coefficient of determination
\[R^2 = 1-{\frac{\sum_i(y[i]-p[i])^2}{\sum_i(y[i]-\mu)^2}}\]
```
def rSquared(observed, predicted):
    error = ((predicted - observed)**2).sum()
    meanError = error/len(observed)
    return 1 - (meanError/numpy.var(observed))
```
- The coefficient of determination is a measure used in statistical analysis that assesses how well a model explains and predicts future outcomes. It is indicative of the level of explained variability in the data set. The coefficient of determination, also commonly known as “R-squared,” is used as a guideline to measure the accuracy of the model.
  - The coefficient of determination is a complex idea centered on the statistical analysis of a future model of data.
  - The coefficient of determination is used to explain how much variability of one factor can be caused by its relationship to another factor.
    - This correlation is known as the “goodness of fit.” A value of 1.0 indicates a perfect fit, and it is thus a very reliable model for future forecasts, indicating that the model explains all of the variations observed. A value of 0, on the other hand, would indicate that the model fails to accurately model the data at all. For a model with several variables, such as a multiple regression model, the adjusted R2 is a better coefficient of determination. In economics, an R2 value above 0.60 is seen as worthwhile.

3.2 Fits for Data, Overfitting & Cross-Validation

Model needs to be generalized, therefore the fittest on training data, did worst on out of sample.
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.
- Choosing an over-complex model can lead to overfitting to the training data.
- Increases the risk of a model that works poorly on data not included in the training set.
  
  everything should be made as simple as possible but not simpler.—Albert Einstein
Cross-Validation
- Use cross-validate results guide the choice of model complexity
  - If dataset small, use leave-one-out cross validation
  - If dataset large enough, use k-fold cross validation or repeated-random-sampling validation.
- Cross Validation(CV)是一种评估模型性能的重要方法，主要用于在多个模型中（不同种类模型或同一种类不同超参数组合）挑选出在当前问题场景下表现最优的模型（model selection）。cv主要分为以下两类：
  - k折，K-fold
    
    k折交叉验证是最基本的cv方法，具体方法为，将训练集随机等分为k份，取其中一份为验证集评估模型，其余k-1份为训练集训练模型，重复该步骤k次，每次都取一份不同的子集为验证集，最终得到k个不同的模型（不是对一个模型迭代k次）和k个评分，综合这k个模型的表现（平均得分或其他）评估模型在当前问题中的优劣。
    
    k值的选取很有讲究，k越大，在训练集上的Bias就会越小，但训练集越大会导致Variance越大，同时花费的时间越长，所以选取适当大小的k很重要，经验值（empirical value）是k=10。
  - 留一法，Leave one out（LOO）
    
    留一法每次在训练集的N个样本中选一个不同的样本作为验证集，其余样本为训练集，训练得到N-1个不同的模型。LOOCV是特殊的K-fold，当K=N时，二者相同。
  - 重复随机抽样，Repeated Random Sampling(RRS)
    
    将样本集合随机“打散”后划分为训练集、测试集
    
    The ShuffleSplit iterator will generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.
```
L = [0.59,18.38, 33.01, 54.14, 72.48, 89.8, 97.07, 112.6, 142.87, 199.84]
A = [1,2,3,4,5,6,7,8,9,10]
xVals = pylab.array(A)
yVals = pylab.array(L)
  
numSubsets = 10
dimensions = [1,2,3,4,5]
rSquares = {}
for d in dimensions:
    rSquares[d] = []
  
def splitData(xVals, yVals):
    toTrain = random.sample(range(len(xVals)),len(xVals)//2)
    trainX, trainY, testX, testY = [],[],[],[]
    for i in range(len(xVals)):
        if i in toTrain:
            trainX.append(xVals[i])
            trainY.append(yVals[i])
        else:
            testX.append(xVals[i])
            testY.append(yVals[i])
    return trainX, trainY, testX, testY
for f in range(numSubsets):
    trainX, trainY, testX, testY = splitData(xVals,yVals)
    for d in dimensions:
        model = pylab.polyfit(trainX, trainY, d)
        estYVals = pylab.polyval(model, trainX)
        estYVals = pylab.polyval(model, testX)
        rSquares[d].append(r2_score(testY, estYVals))
for d in dimensions:
    mean = round(sum(rSquares[d])/len(rSquares[d]), 4)
    sd = round(numpy.std(rSquares[d]), 4)
    print('For dimensionality', d, 'mean =', mean,
          'Std =', sd)
```