ECTS
5 credits
Component
Faculty of Science
Description
Statistical data are becoming ever more massive. Before modeling them, it is essential to explore them and reduce their dimensions, losing as little information as possible. This is the aim of this course in multidimensional exploratory statistics. Methodologically, the tools used are essentially those of Euclidean geometry. Statistical problems and notions will therefore be translated into the language of Euclidean geometry before being treated in this framework. The two families of exploratory methods covered in this course are:
1) automatic classification methods, which group observations into classes and reduce the disparity between observations to the disparity between these classes;
2) component analysis methods, which search for the main directions of disparity between observations and provide images of this disparity that can be interpreted in reduced dimensions.
Objectives
Bridge the gap between Euclidean geometry and multidimensional exploratory statistics. Build a comprehensive competence in big data table exploration and analysis prior to statistical modeling.
Necessary prerequisites
Courses in Euclidean geometry, normed vector spaces and reduction of endomorphisms.
Recommended prerequisites: Univariate and bivariate descriptive statistics. Good command of matrix calculus.
Knowledge control
Continuous assessment (homework / mini-projects) + final assessment
Syllabus
I - Introduction :
a) Multidimensional data, observations, variables, coding; b) Translations of point clouds into Euclidean metric spaces. c) Need for dimensional reduction: components / classes.
II - Geometric expressions for statistical quantities
Univariate description:
a) Average, frequency,
b) Variance and standard deviation.
c) Centering and reduction of a variable.
Bivariate relationships :
a) Bivariate linkage & conditioning.
b) Covariance and correlation of two quantitative variables.
c) R2 analysis of variance of a quantitative variable on a qualitative variable. d) Phi2 and T2 of two qualitative variables. e) Unified writing of links. f) Limits of bivariate & how to overcome it.
III - Automatic classification
Dissembling and resembling.
a) Measurements.
b) Partial vs. global similarity.
Partial similarity: logical/conceptual classification by Galois lattice.
Overall resemblance :
a) Partitioning in metric space: K-means method & refinements.
b) Hierarchical classification: indices, CAH algorithm, partition selection criteria.
c) Mixed classification.
d) Class interpretation.
e) Classification on variables.
IV - Principal component analysis
Standardized PCA
a) Individual cloud, inertia and direct PCA.
b) Variable cloud, inertia and dual PCA.
c) Duality relationships and joint interpretation of graphs.
d) Additional elements & duality relationship.
e) The first component as an estimate of a continuous latent variable.
General PCA (with any metric)
a) Line cloud and direct PCA.
b) Application to multidimensional scaling.
c) Which column ACP, for which duality relations?
d) Interpretation aids.
e) Additional elements & duality relationship.
f) Reconstitution formula (decomposition into singular elements).
Binary correspondence analysis
a) Phi2 as direct and dual inertia of profile-line and profile-column clouds.
b) Which metrics for which duality relationships: barycentric positioning.
c) Interpretation of graphs.
d) Guttman effect. E) Additional elements.
Multiple correspondence analysis.
a) Application of CBA to a complete disjunctive logic array.
b) Application of CBA to a Burt table; equivalence.
c) Barycentric relations between individuals and modalities. Barycentric relations between modalities.
d) Guttman effect. e) Additional elements.
e) The first component as an estimate of a continuous latent variable.
ADM practice
a) Complementarity of AF and CA.
b) How to run a good WMD.
Further information
Hourly volumes :
CM: 21
TD : 21
TP: 0
Land: 0