Ph. D. & Dr. Sc. Lev Gelimson's General Data Direction, as well as Scatter and Trend Measure and Estimation Theories

General Data Direction, as well as Scatter and Trend Measure and Estimation Theories

© Ph. D. & Dr. Sc. Lev Gelimson

Academic Institute for Creating Fundamental Sciences (Munich, Germany)

Mathematical Journal

of the "Collegium" All World Academy of Sciences

Munich (Germany)

11 (2011), 10

By data modeling, processing, estimation, and approximation [1], data scatter is often so great that it is not possible to directly determine the most suitable analytic approximation expression type or form, e.g. linear, piecewise linear, parabolic, hyperbolic, circumferential, elliptic, sinusoidal, etc. by two-dimensional data or linear, piecewise linear, paraboloidal, hyperboloidal, spherical, ellipsoidal, etc. by three-dimensional data. Before considering any approximation to such data, decide whether it can be regarded as directed at all. Therefore, it is necessary and very useful to precisely measure or at least to namely quantitatively estimate data scatter and trend (directedness).

Apparently, there are no known measures of data scatter and trend (directedness). All the more, purely qualitative estimations of data scatter (e.g. data scatter is often great etc.) only are no known. Generally, data direction and trend (directedness) are not considered at all. The nearest known concepts are principal central axes of continual areas and volumes by determining their moments of inertia in mechanics, and namely linear directions only are usually considered.

In overmathematics [2, 3] and fundamental sciences of estimation [4], approximation [5], data modeling and processing [6], data can be any possibly mixed quantisets (e.g. including discrete and continual parts), data directions can be any linear or nonlinear, and universal invariant relative (dimensionless) precise measures and (quantitative) estimations of data scatter and trend (directedness) are introduced.

Preliminary data centralization transformation of a coordinate system is often very useful. Further take valid types of data invariance and, generally, coordinate system transformation invariance into account. Namely, by rotation invariance, apply distance quadrat theories (DQT) and general theories of moments of inertia (GTMI) to the given data. By linear transformation invariance, use quadratic mean theories (QMT) along with coordinate system normalization transformation.

Nota bene: Data scatter and trend are conjugate (twin, pair, dual). Further their absolute measures are necessarily trivial (e.g. 0 and 1, respectively) because for any given data, it is principally possible to consider data graph itself as an approximation to these data. Hence these absolute measures are not sensitive at all and bring nothing to discriminate different data sets or, generally, quantisets in overmathematics [2, 3] by their scatter and trend. That is why namely relative (not absolute) precise measures and (quantitative) estimations of data scatter and trend (directedness) only can be sensitive at all and hence reasonable. They are relative in two different senses: both not absolute and dimensionless (i.e. pure numbers) and hence independent of any physical dimensions (units).

Use the both equivalent terminologies: an approximation to data (or a data approximation) and a bisector of data (or a data bisector). Further use the following naturally reduced expressions:

linear (nonlinear, quadratic, cubic, etc.) data directions, scatter and trend measures and estimations instead of measures and estimations of data scatter and trend with respect to linear (nonlinear, quadratic, cubic, etc., respectively) directions of data, as well as approximations to data or bisectors of data.

As ever, the fundamental principle of tolerable simplicity [2, 3] plays a key role.

In the simplest case of linear data directions, scatter and trend measures and estimations, straightforwardly use either (by rotation invariance) distance quadrat theories (DQT) and general theories of moments of inertia (GTMI) or (by linear transformation invariance) quadratic mean theories (QMT) along with data normalization transformation of a coordinate system.

By rotation invariance, linear data direction is the principal central axis (of data) about which the moment J of inertia (of data) and, equivalently, with respect to which the sum ²S of squared distances (of data) take their common minimum value J_min = ²S_min . Define and determine measures S_L = (J_min / J_max)^1/2 = (²S_min / ²S_max)^1/2 of data scatter and T_L = 1 - S_L = 1 - (J_min / J_max)^1/2 = 1 - (²S_min / ²S_max)^1/2 of data trend with respect to linear approximation (bisector).

By linear transformation invariance, linear data direction is determined by applying the inversion of data normalization transformation of a coordinate system to the principal central axis (of preliminarily normalized data) with respect to which the sum ²S of squared distances (of these data) takes its minimum value ²S_min . Define and determine measures S_L = (²S_min / ²S_max)^1/2 of data scatter and T_L = 1 - S_L = 1 - (²S_min / ²S_max)^1/2 of data trend with respect to linear approximation (bisector).

Naturally, we have 0 ≤ S_L ≤ 1 and 0 ≤ T_L ≤ 1.

Nota bene: In principle, to defining and determining such measures of data scatter and trend, also ratio J_min / J_max = ²S_min / ²S_maxitself and many (other than a square root above) suitable functions of this ratio can be applied. But using namely a square root seems to be the most adequate and natural. Consider a rectangle with length L and width W such that L ≥ W (otherwise, rotate the rectangle by angle π/2 = 90° to provide this relation). The moments of inertia of this rectangle about its longitudinal and transversal central axes are J_L = LW³/12 and J_T = L³W/12, respectively. These axes are its axes of symmetry and hence namely its principal central axes. Relation L ≥ W provides relation J_L ≤ J_T . Hence

J_min = J_L = LW³/12,

J_max = J_T = L³W/12,

J_min / J_max = W²/L² .

Now we see that using namely a square root above provides very simple and natural formulae for the rectangle scatter and trend measures

S_L = W/L ,

T_L = 1 - W/L .

If L = W , then we have a square for which each central axis (even if it is not an axis of symmetry) is principal, S_L = 1, and T_L = 0. The last two equalities also hold for any other fully scattered or, equivalently, fully nondirected data, e.g. a solid circle.

In the case of 2D discrete data points, given n (n ∈ N⁺ = {1, 2, ...}, n > 2) points [_j=1ⁿ (x'_j , y'_j )] = {(x'₁ , y'₁), (x'₂ , y'₂), ... , (x'_n , y'_n)] with any real coordinates. Use centralization transformation x = x' - Σ_j=1ⁿ x'_j / n , y = y' - Σ_j=1ⁿ y'_j / n to provide coordinate system xOy central for the given data and further work in this system with points [_j=1ⁿ (x_j , y_j)].

In the linear case, these data points should be approximated with a straight line ax + by = 0 containing origin O(0, 0). The distance between this line and the jth data point (x_j , y_j) and further the sum of the squared distances between this line and everyone of the n data points [_j=1ⁿ (x_j , y_j)] are, respectively,

d_j = |ax_j + by_j|/(a² + b²)^1/2,

²S(a , b) = Σ_j=1ⁿ d_j² = Σ_j=1ⁿ(ax_j + by_j)²/(a² + b²).

In the general case of nonlinear approximation to the given data, consider the true deviation and distance of a data point (x_j , y_j) from a curve y = f(x), namely from its point (x , y) nearest to that data point.

The equation of the tangent to curve y = f(x) at point (x , y) is

Y - y = f'(x , y)(X - x)

with declination angle α and

tan α = f'(x , y).

The equation of the perpendicular to curve y = f(x) at point (x , y) is

Y - y = - 1/f'(x , y) (X - x)

with declination angle β and

tan β = - 1/f'(x , y).

If data point (x_j , y_j) belongs to this perpendicular

Y - y = - 1/f'(x , y) (X - x),

then

y_j - y = - 1/f'(x , y) (x_j - x),

x + f'(x , y)f(x , y) = x_j + y_jf'(x , y).

In particular, for a straight line y = a + bx , we obtain

x + b(a + bx) = x_j + by_j ,

x = (x_j + by_j - ab)/(1 + b²),

y = (a + bx_j + b²y_j)/(1 + b²).

For a quadratic curve y = a + bx + cx² , we have

x + f'(x , y)f(x , y) = x_j + y_jf'(x , y),

x + (b + 2cx)(a + bx + cx²) = x_j + y_j(b + 2cx),

2c²x³ + 3bcx² + (2ac + b² + 1 - 2cy_j)x + ab - x_j - by_j = 0,

x³ + 3/2 b/c x² + (a/c + 1/2 b²/c² + 1/2/c² - 1/c y_j)x + 1/2 ab/c² - 1/2/c² x_j - 1/2 b/c² y_j = 0

Introducing x = u - 1/2 b/c , we obtain a reduced cubic equation [1] u³ + pu + q = 0, namely

u³ + (a/c - 1/4 b²/c² + 1/2/c² - 1/c y_j)u - 1/4 b/c³ - 1/2/c² x_j = 0

with

p = a/c - 1/4 b²/c² + 1/2/c² - 1/c y_j ,

q = - 1/4 b/c³ - 1/2/c² x_j .

Due to the Cardano formulae [1], we obtain by Q = (p/3)³ + (q/2)² > 0 one real solution and two conjugated imaginary solutions, by Q = 0 one real solution and another doubled real solution (a triple real solution by p = q = 0), by Q < 0 three different real solutions:

u₁ = A + B ,

u_{2 ,
3} = - (A + B)/2 ± 3^1/2/2 i (A - B) where i² = -1,

A = (- q/2 + Q^1/2)^1/3,

B = (- q/2 - Q^1/2)^1/3,

for each value A , take value B with AB = - p/3; for real equations (which is here the case), take real values of A and B . We consider exclusively real values of u and x .

Generally, for any nonlinear bisector of data points, use it as a global curvilinear g-axis, or abscissa. Its origin can be an arbitrary suitable point on it. Each local h-axis, or ordinate, is perpendicular to this g-axis at each point of their intersection which is the origin of this local h-axis. These axes can be bounded (limited), these bounds being sufficient to determine both the g-coordinate and the h-coordinate of every point in the given data set. To define and determine these coordinates of any point P of a work area including the given data set, find the set of the points of this g-axis which are nearest to point P. Try to define these axes so that for every point P of a work area, there is one and only one point G(P) of this g-axis which is nearest to point P, vector G(P)P being perpendicular to this g-axis at this point G(P). Consider this case only.

Define and determine the abscissa of point G(P) on this g-axis to be the abscissa of point P , and its deviation h from point G(P) (and simulaneously from this g-axis) to be the ordinate of point P .

To provide namely the minimum of the distance between point P and the points of this g-axis at point G(P), consider the product of this deviation and the signed curvature

k = f''(x)[1 + f'²(x)]^-3/2 .

If |hk| < 1 or the signs of the deviations of point P and the curvature center are different, then the minimum is the case.

For a quadratic curve y = a + bx + cx² , we have

k = 2c[1 + (b + 2cx)²]^-3/2 .

Now introduce an additional Cartesian rectangular coordinate system O'L'D' with rectifying this curve. The abscissa axis O'L' of this system develops this curve via determining its length cordinate due to indefinite integral (any constant which has no influence on calculating the corresponding definite integrals is now dropped)

∫[1 + (b + 2cx)²]^1/2 dx =

∫[1 + b² + 4bcx + 4c²x²]^1/2 dx =

(x/2 + b/c/4)[1 + (b + 2cx)²]^1/2 + ln{[1 + (b + 2cx)²]^1/2 + (b + 2cx)sign(c)}/4/|c| =

(x/2 + b/c/4)[1 + b² + 4bcx + 4c²x²]^1/2 +

ln{[1 + b² + 4bcx + 4c²x²]^1/2 + (b + 2cx)sign(c)}/4/|c|.

The ordinate axis O'D' of this system shows deviations of data points from this curve.

First take any suitable point O' on O'L' as the origin of this system. In it, determine the center O of the abscissas of all the data points and use centralization transformation L = L' - Σ_j=1ⁿ L'_j / n to provide coordinate system OLD which is abscissa-central for the given data and further work in this system with the images [_j=1ⁿ (L_j , D_j)] of the given data points [_j=1ⁿ (x_j , y_j)]. Determine ²S_min about the OL axis and ²S_max about the OD axis. Then determine data scatter measure

S = [²S_min / ²S_max]^1/2

and data trend measure

T = 1 - S = 1 - [²S_min / ²S_max]^1/2.

Nota bene: Sum ²S can be considered as a generalization of a moment of inertia about a curvilinear axis, or a generalization of a moment of inertia. Then we can also use notation J like above. If such a curve is reasonable, i.e. has essential advantages as compared with linear approximation (bisector) which justify additional complication, then S_L ≥ S and T = 1 - S .

These theories are very efficient in data estimation, approximation, and processing.

Acknowledgements to Anatolij Gelimson for our constructive discussions on coordinate system transformation invariances and his very useful remarks.

References

[1] Encyclopaedia of Mathematics. Ed. M. Hazewinkel. Volumes 1 to 10. Kluwer Academic Publ., Dordrecht, 1988-1994

[2] Lev Gelimson. Providing Helicopter Fatigue Strength: Flight Conditions [Overmathematics and Other Fundamental Mathematical Sciences]. In: Structural Integrity of Advanced Aircraft and Life Extension for Current Fleets – Lessons Learned in 50 Years After the Comet Accidents, Proceedings of the 23rd ICAF Symposium, Dalle Donne, C. (Ed.), 2005, Hamburg, Vol. II, 405-416

[3] Lev Gelimson. Overmathematics: Fundamental Principles, Theories, Methods, and Laws of Science. The ”Collegium” All World Academy of Sciences Publishers, Munich, 2010

[4] Lev Gelimson. Fundamental Science of Estimation. The ”Collegium” All World Academy of Sciences Publishers, Munich, 2010

[5] Lev Gelimson. Fundamental Science of Approximation. The ”Collegium” All World Academy of Sciences Publishers, Munich, 2010

[6] Lev Gelimson. Fundamental Science of Data Modeling and Processing. The ”Collegium” All World Academy of Sciences Publishers, Munich, 2010

[7] Lev Gelimson. Corrections and Generalizations of the Least Square Method. In: Review of Aeronautical Fatigue Investigations in Germany during the Period May 2007 to April 2009, Ed. Dr. Claudio Dalle Donne, Pascal Vermeer, CTO/IW/MS-2009-076 Technical Report, International Committee on Aeronautical Fatigue, ICAF 2009, EADS Innovation Works Germany, 2009, 59-60