Appendix C: Theory¶

Linear Regression¶

Models which consist of a linear combination of a particular set of functions X_k are called linear models, and linear regression can be used to minimize the difference between the model and data. The general form of this kind of model is

$y(x) = \sum a_k X_k(x)$

where X_k(x) are fixed functions of x that are called the basis functions, and a_k are the free parameters. Note that “linear” refers only to dependence of the model on the parameters a_k; the functions X_k(x) may be nonlinear. Minimization of the above linear model is performed with respect to the merit function

$S(a) = \sum_{i=1}^n \left [ y_i - \sum_{k=1}^{np} a_k X_k(x_i) \right ]^2$

The minimum of the above equation occurs where the derivative of S with respect to the parameters disappears. Substituting the linear model into this function, taking the first derivative, and setting this equal to zero gives the normal equations that can be solved directly for the parameters a_k.

Nonlinear Regression¶

Introduction¶

This program uses the Levenberg-Marquardt method to solve nonlinear regressions. This method combines the steepest-descent method and a Taylor series based method to obtain a fast, reliable technique for nonlinear optimization. This statement will be repeated later, but can bear stating at the beginning of this discussion: neither of the above optimization methods are ideal all of the time; the steepest descent method works best far away from the minimum, and the Taylor series method works best close to the minimum. The Levenberg-Marquardt (LM) algorithm allows for a smooth transition between these two methods as the iteration proceeds.

Development¶

In general, the data modeling equation (with any number of independent variables) can be written as follows:

$y = y(\vec x; \vec a)$

The above expression simply states that the dependent variable y can be expressed as a function of the independent variables $\vec x$ and vector of parameters $\vec a$ of arbitrary length. Note that using the ML method, any nonlinear equation with an arbitrary number of parameters can be used as the data modeling equation. Then, the merit function we are trying to minimize is

$\chi^2 (\vec a) = \sum_{i=1}^n \left ( \frac{y_i - y(\vec x_i; \vec a)}{ \sigma_i} \right )^2$

where N is the number of data points, $\vec x_i$ denotes the x data points, $y_i$ denotes the y data points, $\sigma_i$ is the standard deviation (uncertainty) at point i, and $y(x_i;a)$ is an arbitrary nonlinear model evaluated at the ith data point. This merit function simply measures the agreement between the data points and the parametric model; a smaller value for the merit function denotes better agreement. Commonly, this merit function is called the chi-square.

Now, we will take a step back to provide a framework of optimization methods. From the area of pure optimization, which will not be explained here, two basic ways of finding a function minimum are a Taylor series based method and the steepest-descent method. The Taylor series method states that sufficiently close to the minimum, the function can be approximated as a quadratic. Without detailed explanation of that method, a step from the current parameters a to the best parameters amin can be written as

$\vec a_{min} = \vec a_{cur} + H^{-1} \left [ - \Delta \chi^2 ( \vec a_{cur}) \right ]$

where H is the Hessian matrix (a matrix of second derivatives). If the approximation of the function as a quadratic is a poor one, then we might instead use the steepest-descent method, where a step to the best parameters from the current parameters is

$\vec a_{min} = \vec a_{cur} - c \Delta \chi^2 ( \vec a_{cur})$

This equation simply states that the next guess for the parameters is a step down the gradient of the merit function. The constant c is forced to be small enough that a small step is taken and the gradient is accurate in the region that the step is taken.

Since we know the chi-square function, we can directly differentiate to obtain the gradient vector and the Hessian matrix. Taking the partial derivatives of the merit function with respect to a gives

$\frac{\partial \chi^2}{\partial a_k} = -2 \sum_{i=1}^n \frac{y_i - y(\vec x_i; \vec a)}{\sigma_i^2} \frac{\partial y (\vec x_i; \vec a)}{\partial a_k}$

To obtain the Hessian matrix, take the gradient of the gradient above (so that we have a matrix of partial second derivatives):

$\frac{\partial^2 \chi^2}{\partial a_k \partial a_l} = -2 \sum_{i=1}^n \left [ \frac{1}{\sigma_i^2}\frac{\partial y(\vec x_i;\vec a)}{\partial a_k} \frac{\partial y (\vec x_i; \vec a)}{\partial a_l} - \frac{y_i - y(\vec x_i; \vec a)}{\sigma_i^2} \frac{\partial^2 y (\vec x_i; \vec a)}{\partial a_l \partial a_k} \right ]$

Now, for convenience, define the gradient vector and the curvature matrix as

$G_k = -\frac{1}{2} \frac{\partial \chi^2}{\partial a_k} = \sum_{i=1}^n \left [ \frac{y_i - y(\vec x_i; \vec a)}{\sigma_i^2} \frac{\partial y (\vec x_i; \vec a)}{\partial a_k} \right ]$

$C_{kl} = \frac{\partial^2 \chi^2}{\partial a_k \partial a_l} = \sum_{i=1}^n \left [ \frac{1}{\sigma_i^2}\frac{\partial y(\vec x_i;\vec a)}{\partial a_k} \frac{\partial y (\vec x_i; \vec a)}{\partial a_l} \right ]$

[Note that the second derivative term in C will be ignored because of two reasons: it tends to be small because it is multiplied by ( $y-y_i$ ), and it tends to destabilize the algorithm for badly fitting models or data sets contaminated with outliers. This action in no way affects the minimum found by the algorithm; it only affects the route in getting there.] So, the Taylor series method (inverse Hessian method) can be written as the following set of linear equations:

$\sum_{k=1}^{np} C_{kl} \delta a_l = G_k [1]$

where NP is the number of parameters in the model that is being optimized. This linear matrix will be our workhorse for this method after some modification; it can be solved for the increments da that, when added to the current approximation for the parameters, gives the next approximation. Likewise, we can substitute our “convenient” definitions into the steepest descent formula to obtain

$\delta a_l = c G_l [2]$

Neither of the aforementioned optimization methods are ideal all of the time; the steepest descent method works best far away from the minimum, and the Taylor series method works best close to the minimum. The Levenberg-Marquardt (LM) algorithm allows for a smooth transition between these two methods as the iteration proceeds.

The first issue in deriving the LM method is to attach some sort of scale to the constant c in the steepest-gradient method (equation 2). Typically, there is no obvious way to determine this number, even within an order of magnitude. However, in this case, we have access to the Hessian matrix; examining its members, we see that the scale on this constant must be $1/C_{ll}$ . But, that still may be too large, so let’s divide that scale by a nondimensional factor (l) and plan on setting this much larger than one so that the step will be reduced (for safety and stability).

The second issue to formulate the LM method is noting that the steepest-descent and Taylor series methods may be combined if we define a new matrix Mij by the following:

$M_{ii} = C_{ii} (1 + \lambda), i=j$

$M_{ij} = C_{ij}, i \ne j$

This matrix combines equations [1] and [2] into a convenient and compact form. So finally, we have a means of calculating the step da in the parameters by the following system of linear equations:

$\sum_{k=1}^{np} M_{kl} \delta a_l = G_k [3]$

Note that when l is large, the matrix M is forced to be diagonally dominant; consequently, the above equation is equivalent to the steepest descent method (equation 2). Conversely, when the parameter l goes to zero, the above equation is equivalent to the Taylor series method (equation 1). So, we vary l to switch between the two methods, continually calculating a parameter correction da that we apply to our most recent guess for the parameter vector.

Procedure¶

The steps that are taken in the LM algorithm are as follows:

Compute $\chi^2(a)$
Pick a conservative value for $\lambda$ (0.001 in CurveExpert)
Solve the linear equations (equation 3) for $\delta a$
Evaluate $\chi^2(a+\delta a)$
If $\chi^2(a+\delta a) >= \chi^2(a)$ , increase $\lambda$ by a factor (10 in CurveExpert) and go back to step [3].
If $\chi^2(a+\delta a) < \chi^2(a)$ , decrease $\lambda$ by a factor (10 in CurveExpert), correct the parameter vector by $a=a+\delta a$ , and go back to step [3].

Iteration is stopped when $| \chi^2(a+\delta a) - \chi^2(a)| < \epsilon$ . This tolerance $\epsilon$ is the value specified by the user.