ECTS
5 credits
Training structure
Faculty of Science
Description
Statistical data continues to grow in volume. Before modeling it, it is essential to explore it and reduce its size while losing as little information as possible. This is the objective of this course in multidimensional exploratory statistics. Methodologically, the tools used are essentially those of Euclidean geometry. Statistical problems and concepts will therefore be translated into the language of Euclidean geometry before being addressed within this framework. The two families of exploratory methods that will be covered in this course are:
1) automatic classification methods, which group observations into classes and reduce their disparity to the disparity between these classes;
2) Component analysis methods, which seek out the main directions of disparity between observations and provide interpretable images of this disparity in reduced dimensions.
Objectives
Bridge the gap between Euclidean geometry and multidimensional exploratory statistics. Build comprehensive skills in exploring large data tables and analyzing them prior to statistical modeling.
Teaching hours
- Multidimensional Data Analysis - CMLecture9 p.m.
- Multidimensional Data Analysis - TutorialTutorial9 p.m.
Mandatory prerequisites
Course on Euclidean geometry, normed vector spaces, and reduction of endomorphisms.
Recommended prerequisites: Courses in univariate and bivariate descriptive statistics. Good command of matrix calculations.
Knowledge assessment
Continuous assessment (homework/mini-projects) + final exam
Syllabus
I - Introduction:
a) Multidimensional data, observations, variables, coding; b) Translations into point clouds in Euclidean metric spaces. c) Need for dimensional reduction: components/classes.
II - Geometric representations of statistical quantities
Univariate description:
a) Average, frequency,
b) Variance and standard deviation.
c) Centering and reduction of a variable.
Bivariate relationships:
a) Bivariate linkage & conditioning.
b) Covariance and correlation of two quantitative variables.
c) R2 variance analysis of a quantitative variable on a qualitative variable. d) Phi2 and T2 of two qualitative variables. e) Unified notation of relationships. f) Limitations of bivariate analysis & how to overcome them.
III - Automatic classification
Dissimilarity and similarity.
a) Measures.
b) Partial vs. overall resemblance.
Partial similarity: logical/conceptual classification using Galois lattices.
Overall resemblance:
a) Partitioning in metric space: K-means method & refinements.
b) Hierarchical classification: indices, CAH algorithm, partition selection criteria.
c) Mixed classification.
d) Interpretation of classes.
e) Classification based on variables.
IV - Principal component analysis
Standardized ACP
a) Cloud of individuals, inertia, and direct ACP.
b) Variable cloud, inertia, and dual PCA.
c) Duality relationships and joint interpretation of graphs.
d) Additional elements & duality relationship.
e) The first component as an estimate of a continuous latent variable.
General ACP (with arbitrary metrics)
a) Line cloud and direct PCA.
b) Application to multidimensional scaling.
c) Which ACP of columns, for which duality relationships?
d) Interpretation assistance.
e) Additional elements & duality relationship.
f) Reconstitution formula (breakdown into individual elements).
Binary correspondence analysis
a) Phi2 as the direct and dual inertias of the clouds of line profiles and column profiles.
b) Which metrics for which duality relationships: barycentric positioning.
c) Joint interpretation of graphs.
d) Guttman effect. E) Additional elements.
Multiple correspondence analysis.
a) Application of ACB to a complete disjunctive logic table.
b) Application of ACB to a Burt table; equivalence.
c) Barycentric relationships between individuals and modalities. Barycentric relationships between modalities.
d) Guttman effect. e) Additional elements.
e) The first component as an estimate of a continuous latent variable.
The practice of ADM
a) Complementarity of FA and CA.
b) How to conduct a good ADM.
Additional information
Hourly volumes:
CM: 21
TD: 21
TP: 0
Land: 0