.. _modeling:

Modeling your data 
*******************************************************

Introduction
=========================

Given a set of data points, often called "observations," a common
need is to condense the data by fitting it to a model in the form
of a parametric equation. This "model equation" can be anything
that the user desires -- it can range from a simple polynomial to
an extremely complex model with many parameters. One important
consideration in selecting an appropriate model is the underlying
law that the data represents. Optimally, the model should be
chosen to reflect that law so that the parameters in the curve fit
have physical interpretation and meaning.

Data modeling may be performed by several techniques:
interpolation, regression, or data smoothing. Interpolation and
regression are the types supported by CurveExpert. Interpolation
guarantees that the fitted curve will pass through each and every
data point. Regression simply ensures that the "merit function",
which is an arbitrary function that measures the disagreement
between the data and the model, is minimized. In this approach,
the model parameters are adjusted until the merit function becomes
as small as possible. One of the features of CurveExpert is the
large number of nonlinear regression models that are available.
This type of model is the most commonly used in real-world
applications.


Linear Regression
=========================

All linear regressions are accessible in the first half of the "Apply Fit" menu.

Models which consist of a linear combination of a particular set of
functions X\ :sub:`k`\  are called linear models, and linear
regression can be used to minimize the difference between the model
and data. The general form of this kind of model is

.. math::

   y(x) = \sum a_k X_k(x)

where X\ :sub:`k`\ (x) are fixed functions of x that are called the
basis functions, and a\ :sub:`k` are the free parameters. Note that
"linear" refers only to dependence of the model on the parameters
a\ :sub:`k`\ ; the functions X\ :sub:`k`\ (x) may be nonlinear.
Minimization of the above linear model is performed with respect
to the merit function

.. math::

   S(a) = \sum_{i=1}^n \left [ y_i - \sum_{k=1}^{np} a_k X_k(x_i) \right ]^2

The minimum of the above equation occurs where the derivative of S
with respect to the parameters disappears. Substituting the linear
model into this function, taking the first derivative, and setting
this equal to zero gives the normal equations that can be solved
directly for the parameters a\ :sub:`k`\ .

Linear Regression Models
--------------------------------------

In CurveExpert, all of the linear models implemented are linear
combinations of x\ :sub:`n`\ . These models are listed below:

* Linear Fit y = a+bx
* Quadratic Fit y = a+bx+cx^2
* Polynomial Fit y = a+bx+cx^2+dx^3+....

Setting the Polynomial Degree
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For a polynomial regression, you must supply what degree of
polynomial that you would like; CurveExpert will prompt you for
that information. Polynomials usually tend to give decent curve
fits to almost any data set, but higher order polynomials (as a
rule of thumb, over 9) tend to oscillate badly. Also, polynomials
offer no insight into the model that governs the data.

.. image:: images/polydeg.png
   :align: center

Note that you may not choose a polynomial degree greater than or
equal to the number of points in the data set minus one. A
polynomial degree equal to the number of points in the data set
minus one is simply interpolation, and that type of curve fitting
can be obtained via the Interpolation menu. Degrees greater than
the number of data points cause an overconstrained problem, to
which the solution is undefined. Note also that you may not choose
a polynomial degree of one or two; use the built-in linear and
quadratic curve fits instead. In short, the polynomial degree that
you can enter in this dialog box is limited to 2 < degree < n-2.

Nonlinear Regression
============================

All nonlinear regressions are accessible in the last half of the "Apply Fit" menu.  These models
include the CurveExpert built-in models, as well as any user defined models that you have designed.

This program uses the Levenberg-Marquardt method to solve nonlinear
regressions. This method combines the steepest-descent method and
a Taylor series based method to obtain a fast, reliable technique
for nonlinear optimization. This statement will be repeated later,
but can bear stating at the beginning of this discussion: neither
of the above optimization methods are ideal all of the time; the
steepest descent method works best far away from the minimum, and
the Taylor series method works best close to the minimum. The
Levenberg-Marquardt (LM) algorithm allows for a smooth transition
between these two methods as the iteration proceeds.
In general, the data modeling equation (with one independent
variable) can be written as follows:

.. math::

  y = y(x; \vec a)

The above expression simply states that the dependent variable y
can be expressed as a function of the independent variable x and
vector of parameters a of arbitrary length. Note that using the ML
method, any nonlinear equation with an arbitrary number of
parameters can be used as the data modeling equation. Then, the
merit function we are trying to minimize is

.. math::
  
   \chi^2 (\vec a) = \sum_{i=1}^n \left ( \frac{y_i - y(x_i; \vec a)}{ \sigma_i} \right )^2


where N is the number of data points, :math:`x_i` denotes the x data points,
:math:`y_i` denotes the y data points, si is the standard deviation
(uncertainty) at point i, and :math:`y(x_i;a)` is an arbitrary nonlinear
model evaluated at the ith data point. This merit function simply
measures the agreement between the data points and the parametric
model; a smaller value for the merit function denotes better
agreement. Commonly, this merit function is called the chi-square.

Now, we will take a step back to provide a framework of
optimization methods. From the area of pure optimization, which
will not be explained here, two basic ways of finding a function
minimum are a Taylor series based method and the steepest-descent
method. The Taylor series method states that sufficiently close to
the minimum, the function can be approximated as a quadratic.
Without detailed explanation of that method, a step from the
current parameters a to the best parameters amin can be written as

.. math::

  \vec a_{min} = \vec a_{cur} + H^{-1} \left [ - \Delta \chi^2 ( \vec a_{cur}) \right ]

where H is the Hessian matrix (a matrix of second derivatives). If
the approximation of the function as a quadratic is a poor one,
then we might instead use the steepest-descent method, where a step
to the best parameters from the current parameters is

.. math::

  \vec a_{min} = \vec a_{cur} - c \Delta \chi^2 ( \vec a_{cur}) 


This equation simply states that the next guess for the parameters
is a step down the gradient of the merit function. The constant c
is forced to be small enough that a small step is taken and the
gradient is accurate in the region that the step is taken.

Since we know the chi-square function, we can directly
differentiate to obtain the gradient vector and the Hessian
matrix. Taking the partial derivatives of the merit function with
respect to a gives

.. math::

  \frac{\partial \chi^2}{\partial a_k} = -2 \sum_{i=1}^n \frac{y_i - y(x_i; \vec a)}{\sigma_i^2} \frac{\partial y (x_i; \vec a)}{\partial a_k}

To obtain the Hessian matrix, take the gradient of the gradient
above (so that we have a matrix of partial second derivatives):

.. math::

  \frac{\partial^2 \chi^2}{\partial a_k \partial a_l} = -2 \sum_{i=1}^n \left [ \frac{1}{\sigma_i^2}\frac{\partial y(x_i;\vec a)}{\partial a_k} \frac{\partial y (x_i; \vec a)}{\partial a_l} -     \frac{y_i - y(x_i; \vec a)}{\sigma_i^2} \frac{\partial^2 y (x_i; \vec a)}{\partial a_l \partial a_k} \right ]
  

Now, for convenience, define the gradient vector and the curvature matrix as

.. math::

  G_k = -\frac{1}{2} \frac{\partial \chi^2}{\partial a_k} =  \sum_{i=1}^n \left [ \frac{y_i - y(x_i; \vec a)}{\sigma_i^2} \frac{\partial y (x_i; \vec a)}{\partial a_k} \right ]


.. math::

  C_{kl} = \frac{\partial^2 \chi^2}{\partial a_k \partial a_l} =  \sum_{i=1}^n \left [ \frac{1}{\sigma_i^2}\frac{\partial y(x_i;\vec a)}{\partial a_k} \frac{\partial y (x_i; \vec a)}{\partial a_l} \right ]

[Note that the second derivative term in C will be ignored because
of two reasons: it tends to be small because it is multiplied by
(y-yi), and it tends to destabilize the algorithm for badly fitting
models or data sets contaminated with outliers. This action in no
way affects the minimum found by the algorithm; it only affects the
route in getting there.] So, the Taylor series method (inverse
Hessian method) can be written as the following set of linear
equations:

.. math::

  \sum_{k=1}^{np} C_{kl} \delta a_l = G_k [1]

where NP is the number of parameters in the model that is being
optimized. This linear matrix will be our workhorse for this
method after some modification; it can be solved for the increments
da that, when added to the current approximation for the
parameters, gives the next approximation. Likewise, we can
substitute our "convenient" definitions into the steepest descent
formula to obtain

.. math::
 
  \delta a_l = c G_l [2]

Neither of the aforementioned optimization methods are ideal all of
the time; the steepest descent method works best far away from the
minimum, and the Taylor series method works best close to the
minimum. The Levenberg-Marquardt (LM) algorithm allows for a
smooth transition between these two methods as the iteration
proceeds.

The first issue in deriving the LM method is to attach some sort of
scale to the constant c in the steepest-gradient method (equation
2). Typically, there is no obvious way to determine this number,
even within an order of magnitude. However, in this case, we have
access to the Hessian matrix; examining its members, we see that
the scale on this constant must be :math:`1/C_{ll}`. But, that still may be
too large, so let's divide that scale by a nondimensional factor
(l) and plan on setting this much larger than one so that the step
will be reduced (for safety and stability).

The second issue to formulate the LM method is noting that the
steepest-descent and Taylor series methods may be combined if we
define a new matrix Mij by the following:

.. math::

   M_{ii} = C_{ii} (1 + \lambda), i=j

.. math::
  
   M_{ij} = C_{ij}, i \ne j

This matrix combines equations [1] and [2] into a convenient and
compact form. So finally, we have a means of calculating the step
da in the parameters by the following system of linear equations:

.. math::

  \sum_{k=1}^{np} M_{kl} \delta a_l = G_k [3]


Note that when l is large, the matrix M is forced to be diagonally
dominant; consequently, the above equation is equivalent to the
steepest descent method (equation 2). Conversely, when the
parameter l goes to zero, the above equation is equivalent to the
Taylor series method (equation 1). So, we vary l to switch between
the two methods, continually calculating a parameter correction da
that we apply to our most recent guess for the parameter vector.
The steps that are taken in the LM algorithm are as follows:

#. Compute :math:`\chi^2(a)`
#. Pick a conservative value for :math:`\lambda` (0.001 in CurveExpert)
#. Solve the linear equations (equation 3) for :math:`\delta a`
#. Evaluate :math:`\chi^2(a+\delta a)`
#. If :math:`\chi^2(a+\delta a) >= \chi^2(a)`, increase :math:`\lambda` by a factor (10 in CurveExpert) and go back to step [3].
#. If :math:`\chi^2(a+\delta a) < \chi^2(a)`, decrease :math:`\lambda` by a factor (10 in CurveExpert), correct the parameter vector by :math:`a=a+\delta a`, and go back to step [3].

Iteration is stopped when :math:`| \chi^2(a+\delta a) - \chi^2(a)| < \epsilon`. This
tolerance :math:`\epsilon` is the value specified in the File|Preferences dialog.


Setting Initial Guesses
-----------------------------------------
All nonlinear regression algorithms require initial guesses to
commence the search for the optimum regression parameters. As long
as AutoGuess isenabled, CurveExpert normally will calculate initial
guesses for all built-in nonlinear regressions (linear regressions
do not require initial guesses) for you automatically. However, in
the case of a poorly performing curve fit, you might want to set
these initial guesses manually. Also, you must set these initial
guesses manually if you are using a user-defined nonlinear
regression model.

In many cases, the physical interpretation of the individual
parameters in a regression model may not be clear; hence, an
inordinate amount of time can be spent simply trying to obtain
appropriate initial guesses. To facilitate this process,
CurveExpert provides a dynamic view plot which adjusts itself every
time you adjust the initial parameters. This allows you to see the
behavior of the curve, and adjust the parameters by hand in such a
way that the curve is fairly close to the data points. To display
this plot, simply click the Show Plot button. Subsequently, any
changes you make to the initial guesses will be shown immediately
on the dynamic view plot.

Hint: The plot drawing speed is increased automatically by
CurveExpert; it increases the "Resolution" parameter. To increase
speed even more, right click on the plot window, choose
"Attributes" off of the popup menu, and increase the "Resolution"
parameter further.

The initial guess window is shown below:

.. image:: images/initguess.png
   :align: center

Initial Guess Control Buttons
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  OK: Press OK to accept the initial guesses entered in the
   Parameter Guesses section and let CurveExpert perform the nonlinear
   regression.
-  Cancel: Press Cancel to abort the regression and return to the
   main CurveExpert window.
-  Help: Press Help to get CurveExpert help on setting initial
   guesses for a nonlinear regression.

Model Display
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To ensure that the meaning of the parameters is clear, this display
shows you the equation for the model that you are providing initial
guesses for.

Parameter Guesses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  Guess Listbox/Edit Control: Enter an initial guess for each
   parameter in the corresponding field. Refer to the Model Display
   for the use of each of the parameters in your nonlinear regression
   model. The number of parameters required for you to enter will
   match the number of parameters that are in the model, and you may
   scroll up or down as necessary.
-  Show Plot: Click this button to dynamically view the changes
   that you make to the regression parameters. This will greatly
   assist you in obtaining good initial estimates for the parameters.


Regression Weighting Schemes
----------------------------------------------

CurveExpert implements five different types of regression weighting
schemes, so that you have a range of methods to choose from.
Weighting allows the user to define how much influence each data
point will have on the final set of fitted parameters; a large
weight denotes that a particular point influences the parameters
more (pulls the curve toward itself strongly), and a small
weighting on a data point decreases its effect on the parameters.
Select "Tools\|Weighting Schemes" to pick the weighting to use for
a particular curve fit. Only one can be selected at a time; the
currently selected scheme is denoted by a check in the menu
mentioned above.

The particular type of weighting used in a curve fit is reported in
the Graph Information page of the Graph Properties dialog. Also
note that weighting only has influence on regressions; any item
from the Interpolation menu will not be affected by weighting,
since (by definition) interpolations must go through each data
point.
The types of weighting offered by CurveExpert are the following:

None
~~~~

The regression is performed as is, and no weighting is applied
during the curve fitting process. This is the most common type of
regression analysis; each point has equal influence on the final
converged curve. No weighting is equivalent to weighting each data
point by 1.0.

By Uncertainty
~~~~~~~~~~~~~~~~

The regression is weighted based on the supplied uncertainties in
each data point. The uncertainties (in the STD column in the
built-in spreadsheet) are interpreted as one standard deviation
around the data point. So, during the regression, the calculation
will be weighted by

.. math::

  \frac{1}{\sigma_i^2}

The data points with high uncertainty will contribute less to the
converged curve, and data points with low uncertainty will
contribute more.
This menu choice is not available if CurveExpert detects that no
standard deviation data is present in the built-in spreadsheet.
Also, if a file that has standard deviation data is read in,
CurveExpert automatically selects uncertainty weighting for you.

By 1/x
~~~~~~~~~~~

This type of weighting is rarely used, but weights each data point
by the inverse of its x value. Be wary to \*not\* use this type of
weighting when negative x data exists in the data set.

By 1/x^2
~~~~~~~~~

This type of weighting is rarely used, but weights each data point
by the inverse of its x value squared. The farther the point is
from x=0, the less influence it will have in the final parameter
set. This type of weighting is sometimes used when it is known
that the uncertainty is a percentage of the independent variables
value.

By 1/y
~~~~~~~~~

This type of weighting is rarely used, but weights each data point
by the inverse of its y value. Be wary to \*not\* use this type of
weighting when negative y data exists in the data set.

By 1/y^2
~~~~~~~~~

This type of weighting is sometimes used when uncertainties in each
data point are not known, but when the uncertainty tends to be a
percentage of the measured y value. The larger the y value is, the
less weight a particular point has.


Spline Interpolation
=============================

CurveExpert implements four types of spline interpolation: linear/quadratic/cubic splines, and tension splines.

.. warning:: The data must be sorted for spline interpolations to complete successfully.  To sort your data, choose Data|Manipulation from the main menu and select the "Sort" tab.

Linear Splines
------------------------

Linear splines are the simplest form of spline -- they simply
approximate the range between data points by a straight line. The
purest definition of a linear spline is that is simply ensures that
each discrete spline matches the data points. These splines are
simply a "connect-the-dot" approach that can be useful in
increasing the resolution of a given table of values by using
linear interpolation between each point. However, if this is the
application, cubic splines provide a much more accurate
representation between the data points.

Note also that the data must be sorted for the spline operation to
complete successfully. To sort your data, choose Data\|Manipulate
and select the Sort tab.

Quadratic Splines
-------------------

Quadratic splines interpolate the range between data points by a
2nd order polynomial (quadratic). Quadratic spline curve fitting
ensures that each spline is equal to the data points, and that the
1st derivative of the splines are continuous, even at the knots.
Quadratic splines perform better than simple linear splines, but
are often characterized by "swinging" too high or too low past the
data points.

Quadratic splines must have one free condition specified for the
first spline. CurveExpert follows common practice, which is to
specify that the 2nd derivative of the first spline is zero -- this
has the effect of connecting the first and second points with a
straight line.

The implementation of quadratic splines used in CurveExpert
requires a good deal of memory and crunching time (even more than
that of the cubic spline). Do not use this curve fit on an
excessive number of data points! Use caution when applying this
class of splines.

Note also that the data must be sorted for the spline operation to
complete successfully. To sort your data, choose Data\|Manipulate
and select the Sort tab.

Cubic Splines
----------------------------

Cubic splines interpolate the range between data points by 3rd
order polynomials (cubics). Cubic spline curve fitting ensures
that each spline is equal to the data points, the 1st derivatives
are continuous at the knots, and the 2nd derivatives are continuous
at the knots. Since discontinuities in 3rd and higher derivatives
cannot be visually detected, cubic splines have an "aesthetically
pleasing" appearance. In fact, cubic splines mimic the behavior of
drafting splines used in olden days.
Various end conditions can be used for cubic splines. Common
practice is to use "natural" end conditions by specifying that the
2nd derivative of the end splines are zero at the endpoints. This
gives the visual appearance of "straightening out" at the ends,
which is the behavior of a physical drafting spline. This behavior
is the type implemented in this version of CurveExpert.
Note also that the data must be sorted for the spline operation to
complete successfully. To sort your data, choose Data\|Manipulate
and select the Sort tab.

Tension Splines
----------------------

Tension splines behave similarly to cubic splines, with the
addition of a parameter (the tension) which specifies how "tightly"
the spline forms to each knot. A low tension parameter will yield
a spline similar to a cubic spline; whereas at high tension, the
spline will approach a linear spline.
The Set Tension dialog box allows you to set this parameter before
the tension spline curve fit is performed. You can experiment with
the tension to obtain the visual quality that you want in your
spline.

Note that there is no easy way to specify a tension spline, except
by creating a table of its values. It is also very ill-behaved --
expect a few complaints from CurveExpert about floating point
errors. To alleviate this type of error, make sure that your data
is scaled to an order of one, and give intermediate tension
parameters (0.1 to 20).

Note also that the data must be sorted for the spline operation to
complete successfully. To sort your data, choose Data\|Manipulate
and select the Sort tab.


Lagrangian Interpolation
=====================================

.. warning:: The data must be sorted for spline interpolations to complete successfully.  To sort your data, choose Data|Manipulation from the main menu and select the "Sort" tab.

Lagrangian interpolation is simply interpolating the data set with
a polynomial. The order of the polynomial is determined by the
number of points in the data set -- the polynomial must have enough
degrees of freedom to fit the data set at every point. Therefore,
in a data set with n points, the interpolating polynomial will be
of n-1 degree.

Lagrangian polynomials are particularly well suited for data sets
with a low number of data points (8 or less). Preferably, the data
points should not have too much scatter. Like all polynomials, the
Lagrangian interpolating polynomial exhibits more oscillatory
behavior as the degree increases. Therefore, the use of Lagrangian
interpolations should be avoided if the number of data points is
large. Currently, CurveExpert limits the use of this type of
interpolations to data sets with 13 points or less.
As the name implies, extrapolating beyond the scope of the data
points with an interpolating polynomial can be very detrimental.
Interpolations only provide accuracy between the data points --
using them to obtain y data outside of the range of the data points
must be done very carefully.

Note also that the data must be sorted for the interpolation to
complete successfully. To sort your data, choose Data\|Manipulate
and select the Sort tab.


.. _curvefinder:

Using CurveFinder
=================

Using the built-in CurveFinder essentially does all of the hard
work for you. You provide the data file, and the CurveFinder does
the rest. It sifts through every possible curve fit, ranks the
fits from best to worst, and presents you with the best one.
Choosing CurveFinder from the Data menu will invoke a dialog box
that will ask you for input on which families of regression models
to search. Also, you may use the "all on" and "all off" buttons to
check or uncheck all of the families at once. See Regression
Models and Linear Regression for details on each built-in model.
If the polynomial family is included for consideration, you must
specify the maximum degree of the polynomial that CurveFinder will
consider in the "Polynomial Constraint" area. Of course, if the
polynomial family is not included, the polynomial constraint will
simply be ignored.

Once you are finished selecting which model families CurveFinder
should search, press the OK button. The calculation will then
proceed, and each curve fit is ranked according to its standard
error and correlation coefficient (see the Data Modeling with
CurveExpert section for the appropriate equations), and the best
fit is then displayed in the graphing window. Thats all there is
to it!

Note that CurveFinder considers ONLY regressions (linear and
nonlinear) in its search for the best fit. Further, as described
above, you can select only certain families to be considered in the
search for the best curve fit. See Regression Models for a
description of the model families. Interpolations are guaranteed
to pass through each of the data point, so an evaluation of their
performance relative to a regression-type curve fit is rather
difficult. Therefore, interpolations are excluded from the
CurveFinders search; however, you may still decide to try these
curve fits manually when CurveFinder is done.

This dialog will automatically generate initial guesses for the
nonlinear regression models, whether AutoGuess is enabled or not.
If a particular curve fit diverges or generates a floating point
exception, it is excluded from the calculation process. In the
case that an important curve fit in the group is not calculated,
you must select that curve fit manually from the Apply Fit menu and
adjust the initial guesses, if necessary.

The CurveFinder dialog is shown below:

|CurveFinder Dialog|

.. |CurveFinder Dialog| image:: images/curvefinder.png


Calculating Groups of Models
============================

The Calculate Groups dialog allows you to automate the process of
finding the right data model. This option is most helpful when
you have data that follows a certain trend that matches a
particular model family or families. Using Calculate Groups, you
can select those families that you would like to see, and
CurveExpert will automatically calculate all of the curve fits in
that group, rank them on the main windows ranking chart, and open
a graphing window for each of the curve fits. See Regression
Models for a description of the model families.

This dialog will automatically generate initial guesses for the
nonlinear regression models, whether AutoGuess is enabled or not.
If a particular curve fit diverges or generates a floating point
exception, it is excluded from the calculation process. In the
case that an important curve fit in the group is not calculated,
you must select that curve fit manually from the Apply Fit menu and
adjust the initial guesses, if necessary.
The Calculate Groups dialog is shown below:

|Calculate Groups Dialog|

All On
------------

Press this button to select all of the families listed in this
dialog.

All Off
------------

Press this button to deselect all of the families listed in this
dialog.

Model Families
------------------

Click the checkbuttons beside each model family to include them in
the calculation. Families that are not marked by an X will not
be included.

Calculate Groups Control Buttons
-------------------------------------


-  OK: Click the OK button to signify that you are finished
   choosing regression families, and that CurveExpert should begin
   calculating the families that you selected.
-  Cancel: Click the Cancel button to abort and not calculate any
   curve fits.
-  Help: Click the Help button to find out how to use the
   Calculate Groups dialog.


.. |Calculate Groups Dialog| image:: images/calgroups.png