Machine Learning Primer -- Part I: Foundations 
 
 
Claudius Gros, WS 2025/26
Institut für theoretische Physik
Goethe-University Frankfurt a.M.
Machine Learning - Linear Classifier
linear classifier
-  A method using a linear combination of features 
 for the characterization or the separation of
 two or more classes of objects or events.
-  The resulting combination may be used as a linear 
 classifier, or,  more commonly, for dimensionality
 reduction before later classification.
$\qquad\quad
y= f(\mathbf{w}\cdot\mathbf{x}-b),
\qquad\quad
\mathbf{w}\cdot\mathbf{x}=\sum_{j=0}^{d-1} w_j x_j
$
-  feature vector: $\ \ \mathbf{x}=(x_0,\,..,\,x_{d-1})$ 
 weight vector:   $\ \ \mathbf{w}=(w_0,\,..,\,w_{d-1})$
 offset (threshold): $\ \ b$
-  (non-linear) transfer function: $\ \ f(z)$
     
hyperplane
$$ \mathbf{x}\qquad \mathrm{such\ that}
   \qquad 
   \fbox{$\phantom{\big|}\mathbf{w}\cdot\mathbf{x}=b\phantom{\big|}$}
$$
-  a $(d-1)$ dimensional hyperplane divides a $d$ dimensional space into two halves
     
$$
\begin{array}{rcl}
\mathrm{right-side} &:&  \mathbf{w}\cdot\mathbf{x}> b \\
\mathrm{left-side}  &:&  \mathbf{w}\cdot\mathbf{x}< b
\end{array}
$$
-  use transfer function $\ \ f(z)=y(z) \ \ $ to map to feature class
     
$$
y(z<0)\ \to\  -1,\quad\qquad 
y(z>0)\ \to\  1,\quad\qquad 
z=\mathbf{w}\cdot\mathbf{x}-b
$$
-  also called 'perceptron' $\ \ \to \ \ $ neural networks
     
task
-  determine "optimal" 
 weights $\ \ \mathbf{w}$
 offset $\ \ b$
 
training / testing sets
supervised learning
-  labeled data: $\ \ N=N_1+N_2$ 
$
\qquad
\begin{array}{rcl}
\mathrm{first\ feature}   &:&  \{\mathbf{x}_\alpha\}, \qquad \alpha\in[1,N_1] \\
\mathrm{second\ feature}  &:&  \{\mathbf{x}_\beta\}, \qquad \beta\in[N_1+1,N]
\end{array}
$
-  divide randomly into training / test data
 : typically 70/30, 80/20, ...
-  non-optimized model parameters (hyper parameters) 
 may be fine tuned using a validation data set, if needed
center of mass of individual classes
$$
\mathbf{m}_1 = \frac{1}{N_1}\sum_\alpha \mathbf{x}_\alpha,
\quad\qquad
\mathbf{m}_2 = \frac{1}{N_2}\sum_\beta \mathbf{x}_\beta
$$
-  implicit: sum is over training set
     
-  intuition: $\ \ \mathbf{w}\ \propto \ \mathbf{m}_1-\mathbf{m}_2$ 
 : only true for uniformly distributed data
 
covariance matrix
probability distribution
$\qquad\quad
p(\mathbf{x})\ge0, 
     \qquad\qquad 
\int p(\mathbf{x})\,d\mathbf{x}=1
$
-  generating the data 
 
$
p(\mathbf{x}) \quad\longrightarrow
              \quad \mathbf{x}_\alpha,
$ with mean (generic)
 
$
\mathbf{m} = \frac{1}{N}\sum_\alpha \mathbf{x}_\alpha
 = \int \mathbf{x}\,p(\mathbf{x})\,d\mathbf{x}
 = \langle \mathbf{x}\rangle
$ 
-  Cartesian components 
$$
\mathbf{x}=\left(\begin{array}{c}
x_0\\ \vdots \\ x_{d-1}\end{array}\right),\qquad\quad
\mathbf{m}=\left(\begin{array}{c}
m_0\\ \vdots \\ m_{d-1}\end{array}\right),\qquad\quad
\mathbf{x}^T=(x_0,\,\dots,x_{d-1})
$$
note: distinguish here between and vector $\ \ \mathbf{x}\ \ $ and
transposed vector $\ \ \mathbf{x}^T\ \ $ 
covariance matrix
$$
S_{ij} = \big\langle (x_i-m_i)(x_j-m_j)\big\rangle
$$
-  symmetric 
 : eigenvalues are real
 : they correspond to half-width of covariance ellipse
-  matrix notation
     
$$
\hat S =
\big\langle (\mathbf{x}-\mathbf{m})\,(\mathbf{x}-\mathbf{m})^T\big\rangle,
\qquad\quad
S_{ij} = (\hat{S})_{ij},
\qquad\quad
\mathbf{x}=\left(\begin{array}{c}
x_0\\ \vdots \\ x_{d-1}
\end{array}\right)
$$
-  outer product 
$$ \mathbf{x}\,\mathbf{x}^T= \mathbf{x}\otimes\mathbf{x},
     \qquad\quad \big(\mathbf{x}\,\mathbf{x}^T\big)_{ij}=x_ix_j
$$
note
$$ \mathbf{x}^T\mathbf{x} = \mathbf{x}\cdot\mathbf{x}
   = \sum_i x_i x_i
$$
     
 
covariance ellipse
$$
\frac{x^2}{a^2}+\frac{y^2}{b^2} = 1,
\quad\qquad
\left\{\begin{array}{rcl}
x(t) &=& \sqrt{\lambda_1}\cos\Theta\cos t 
-\sqrt{\lambda_2}\sin\Theta\sin t
\\[0.1ex]
y(t) &=& \sqrt{\lambda_1}\sin\Theta\cos t 
+\sqrt{\lambda_2}\cos\Theta\sin t
\end{array}\right.
$$
-  angle with x-axis: $\ \Theta$  
-  eigenvalues $\ \lambda_i$  
$$
\mathbf{x} = 
\hat{V} \cdot \hat{\Lambda} \cdot 
\mathbf{T},
\quad\qquad
\mathbf{x} = {x\choose y},
\quad\qquad
\mathbf{T} = {\cos t\choose \sin t},
$$
-  eigen-vector matrix $\ \hat{V}$
     
$$
\hat{V}=\left(\begin{array}{cc} 
\cos\Theta & -\sin\Theta \\ 
\sin\Theta & \cos\Theta 
\end{array}\right)
\quad\qquad
\hat{\Lambda}=\left(\begin{array}{cc} 
\sqrt{\lambda_1} & 0 \\ 0 & \sqrt{\lambda_2},
\end{array}\right)
$$
 
 
 
plotting a covariance ellipse
-  if (__name__ == "__main__"):  
 not executed when imported as a  module
 allows to include testing
-  $A=\sum_i\lambda_i |\lambda_i\rangle\langle\lambda_i|$
 matrix decomposition in terms of eigensystem
 projection operators   $|\lambda\rangle\langle\lambda|$  
     are outer products
-  np.random.multivariate_normal(mean, 
           coVar, N)
 draws  N  samples from a
 multi-variate Gaussian with covariance matrix 
      coVar
-  two covariance matrices: 
 for data generation and measured
 
 
 
#!/usr/bin/env python3
import math                      # math
import numpy as np               # numerics
import matplotlib.pyplot as plt  # plotting
from numpy import linalg as LA   # linear algebra
# ***
# *** covariance matrix from data
# ***
def covarianceMatrix(data):
  '''normalized covariance matrix of input data'''
  nRow = len(data)                     # number of data points
  nCol = len(data[0])                  # dimension
  mean = [0.0 for iCol in range(nCol)]
  for thisRow in data:                 # loop over data points
    mean += thisRow*1.0/nRow           # vector-mean
#
  coVar = np.zeros((nCol,nCol))        # empty matrix
  for thisRow in data:
    dataPoint = thisRow - mean         # shifted data point
    coVar += np.outer(dataPoint,dataPoint)/nRow    # outer product
#
  if (1==2):
    print("\ninput data\n", data)
    print("nRow, nCol\n", nRow, nCol)
    print("mean\n", mean)
  print("\n covariance matrix: measured \n",coVar)
#
  return mean, coVar                   # returning both
# ***
# *** generate data for plotting an ellipse
# ***
def plotEllipseData(data):
  "generates ellipse from symmetric 2x2 slice of the input matrix"
#
  slice22 = [inner[:2] for inner in data[:2]]   # slice
  slice22[0][1] = slice22[1][0]                 # symmetrize
                                                # ^ should not be necessary
  eigenValues, eigenVectors = LA.eig(slice22)   # eigensystem
#
  if (eigenValues[0]<0.0) or (eigenValues[1]<0.0):
    print("# plotEllipseData: only positive eigenvalues (variance)")
    return
#
  if (1==2):
    print("\n# coVar     \n", coVar  )
    print("\n# slice22     ", slice22)
    print("\n# eigenValues ", eigenValues)
    print("\n# eigenVectors\n",eigenVectors)
#
  a = math.sqrt(eigenValues[0])
  b = math.sqrt(eigenValues[1])
  cTheta = eigenVectors[0][0]
  sTheta = eigenVectors[1][0]
  x = []
  y = []
  for i in range(nPoints:=101):          # walrus assignment
    tt = i*2.0*math.pi/(nPoints-1)       # full loop
    cc = math.cos(tt)
    ss = math.sin(tt)
    xx = a*cTheta*cc - b*sTheta*ss
    yy = a*sTheta*cc + b*cTheta*ss
    x.append(xx)
    y.append(yy)
#   print(xx,yy)
  return x, y
# ***
# *** generate 2D test data
# ***
def testData(angle, var1, var2, nData, startMean=[0.0,0.0]):
  "2D Gaussian for a given angle and main variances.\
   A = \sum_i \lambda_i |lambda_i><lambda_i|"
#
  eigen1 = [math.cos(angle),-math.sin(angle)]
  eigen2 = [math.sin(angle), math.cos(angle)]
  startCoVar  = var1*np.outer(eigen1,eigen1)
  startCoVar += var2*np.outer(eigen2,eigen2)
  print("\n covariance matrix: data generation \n",startCoVar)
  return np.random.multivariate_normal(startMean, startCoVar, nData)
# ***
# *** main (with a block)
# ***
if (__name__ == "__main__"):                 # blocking out-of-use access
#
  dataMatrix = testData(0.3*math.pi, 1.0, 9.0, 100)
  mean, coVar = covarianceMatrix(dataMatrix)
  if (1==2):
    print("\n#main: data\n", dataMatrix)
    print("#main: mean\n", mean)
    print("\n#main: coVar matrix \n",coVar)
  xEllipse, yEllipse = plotEllipseData(coVar)  # coVar-ellipse
  xData = [thisRow[0] for thisRow in dataMatrix]
  yData = [thisRow[1] for thisRow in dataMatrix]
  Z_90 = math.sqrt(4.601)                      # 90% confidence
  Z_95 = math.sqrt(5.991)                      # 95% confidence
  Z_99 = math.sqrt(9.210)                      # 99% confidence
#
  xE_90 = [Z_90*xx + mean[0] for xx in xEllipse]
  yE_90 = [Z_90*yy + mean[1] for yy in yEllipse]
  xE_99 = [Z_99*xx + mean[0] for xx in xEllipse]
  yE_99 = [Z_99*yy + mean[1] for yy in yEllipse]
#
  plt.plot(xE_90, yE_90, label="90%")
  plt.plot(xE_99, yE_99, label="99%")
  plt.plot(xData, yData, "ob", markersize=5)
#
  plt.legend(loc="upper left")
  plt.axis('square')                           # square plot
  plt.savefig('foo.svg')                       # export figure
  plt.show()
  
 
python intermezzo
-  loading a privat module 
 :: plotting the covariance ellipse
-  __pycache__
 :: directory created when importing private files
-  imports are not transferred
     
#!/usr/bin/env python3
import math                        # math
import matplotlib.pyplot as plt    # plotting
import ML_covarianceMatrix as CM   # loading user-defined module
dataMatrix  = CM.testData(0.3*math.pi, 1.0, 9.0, 100)
mean, coVar = CM.covarianceMatrix(dataMatrix)
xEllipse, yEllipse = CM.plotEllipseData(coVar)  # coVar-ellipse
xData = [thisRow[0] for thisRow in dataMatrix]
yData = [thisRow[1] for thisRow in dataMatrix]
Z_90 = math.sqrt(4.601)                      # 90% confidence
Z_95 = math.sqrt(5.991)                      # 95% confidence
Z_99 = math.sqrt(9.210)                      # 99% confidence
xE_90 = [Z_90*xx + mean[0] for xx in xEllipse]
yE_90 = [Z_90*yy + mean[1] for yy in yEllipse]
xE_99 = [Z_99*xx + mean[0] for xx in xEllipse]
yE_99 = [Z_99*yy + mean[1] for yy in yEllipse]
plt.plot(xE_90, yE_90, label="90%")
plt.plot(xE_99, yE_99, label="99%")
plt.plot(xData, yData, "ob", markersize=5)
plt.legend(loc="upper left")
plt.axis('square')                           # square plot
plt.savefig('foo.svg')                       # export figure
plt.show()
  
 
 multivariate Gaussians 
-  uni/multi-variate : one/multi-dimenional
     
-  one-dimensional Gaussian
$$
N(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)}
$$
     
-  multivariate
$$
N(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d|\hat{S}|}}
\exp\left(\frac{-1}{2}({\mathbf x}-{\mathbf m})^T{\hat{S}}^{-1}
           ({\mathbf x}-{\mathbf m})\right)
$$
     
diagonal case
$$
\hat{S}=\left(\begin{array}{ccc}
\sigma_0^2 & \dots & 0 \\
\vdots     & \ddots & \vdots \\
0          & \dots  & \sigma_{d-1}^2
\end{array}\right),
\qquad\quad
\sigma_k^2 = \big\langle (x_k-m_k)^2\big\rangle
$$
-  if $\ \ \big\langle (x_i-m_i)(x_j-m_j)\big\rangle=0\quad \mbox{for}\quad i\ne j$
     
 $i\ne j\ \ $ uncorrelated
$$
|\hat{S}| = \prod_k \sigma_k^2,
\qquad\quad
\hat{S}^{-1}=\left(\begin{array}{ccc}
1/\sigma_0^2 & \dots & 0 \\
\vdots     & \ddots & \vdots \\
0          & \dots  & 1/\sigma_{d-1}^2
\end{array}\right)
$$
-  multivariate Gaussian factorizes
     
$$
N(\mathbf{x}) = \prod_k
 \frac{\exp(-(x_k-m_k)^2/(2\sigma_k^2))}{\sqrt{2\pi\sigma_k^2}}
$$
 
statistics of features
-  a linear classifiers combines data of features via
     $\ \ \mathbf{w}\cdot\mathbf{x}_\alpha,\qquad \alpha=1,2$
     
-  mean $\ \ \mu_\alpha=\mathbf{w}\cdot\mathbf{m}_\alpha$
     
-  variance
$$
\sigma_\alpha^2=\left\langle
     (\mathbf{w}\cdot\mathbf{x}_\alpha
     -\mathbf{w}\cdot\mathbf{m}_\alpha)^2
\right\rangle = \left\langle
     (\mathbf{w}\cdot(\mathbf{x}_\alpha -\mathbf{m}_\alpha))^2
\right\rangle = \left\langle
     (\mathbf{w}^T(\mathbf{x}_\alpha -\mathbf{m}_\alpha))^2
\right\rangle 
$$
or
$$
\sigma_\alpha^2=\left\langle
\mathbf{w}^T
(\mathbf{x}_\alpha -\mathbf{m}_\alpha)\,
(\mathbf{x}_\alpha -\mathbf{m}_\alpha)^T
\mathbf{w}
\right\rangle =
\mathbf{w}^T \left\langle
(\mathbf{x}_\alpha -\mathbf{m}_\alpha)\,
(\mathbf{x}_\alpha -\mathbf{m}_\alpha)^T
\right\rangle \mathbf{w}
$$
     
in terms of the covariance matrix
$$
\fbox{$\phantom{\big|}\sigma_\alpha^2=
\mathbf{w}^T \hat S_\alpha \mathbf{w}\phantom{\big|}$}
\qquad\quad
\hat S_\alpha =
\big\langle (\mathbf{x}_\alpha-\mathbf{m}_\alpha)\,
            (\mathbf{x}_\alpha-\mathbf{m}_\alpha)^T\big\rangle
$$
 
linear discriminant analysis (LDA)
-  linear discriminant, mean, variance
$$
     \mathbf{w}\cdot\mathbf{x}_\alpha,\qquad\quad
\mu_\alpha=\mathbf{w}\cdot\mathbf{m}_\alpha,\qquad\quad
\sigma_\alpha^2=\mathbf{w}^T \hat S_\alpha \mathbf{w}
$$
     
-  many linear classifier exist
     
Fisher objective function
-  maximize $\ \ (\mu_1-\mu_2)^2\ \ $ 
 keeping the combined variance $\ \ \sigma_1^2+\sigma_2^2\ \ $ constant
task
-  find feature vector $\ \ \mathbf{w}\ \ $ maximizing
     
$$
S = \frac{(\mu_1-\mu_2)^2}{\sigma_1^2+\sigma_2^2}
  = \frac{(\mathbf{w}\cdot(\mathbf{m}_1-\mathbf{m}_2))^2}
         {\mathbf{w}^T (\hat{S}_1+\hat{S}_2) \mathbf{w}}
  = \frac{\mathbf{w}^T(\mathbf{m}_1-\mathbf{m}_2)
                      (\mathbf{m}_1-\mathbf{m}_2)^T\mathbf{w}}
         {\mathbf{w}^T (\hat{S}_1+\hat{S}_2) \mathbf{w}}
$$
$$
S = \frac{\mathbf{w}^T\hat{A}\mathbf{w}}
         {\mathbf{w}^T \hat{B}\mathbf{w}}
\qquad\quad
\hat{A}=(\mathbf{m}_1-\mathbf{m}_2)(\mathbf{m}_1-\mathbf{m}_2)^T
\qquad\quad
\hat{B}=\hat{S}_1+\hat{S}_2
$$
-  quantum mechanics:
 $\mathbf{w}\ \ $ wavefunction
 $\hat{A}\ \ $ Hamilton operator
 $\hat{B}\ \ $ identiy matrix
 $S\ \ $ expectation value of energy
 
 
LDA - maximization
$$
S = \frac{\mathbf{w}^T\hat{A}\mathbf{w}}
         {\mathbf{w}^T \hat{B}\mathbf{w}}
\qquad\quad
\hat{A}=(\mathbf{m}_1-\mathbf{m}_2)(\mathbf{m}_1-\mathbf{m}_2)^T
\qquad\quad
\hat{B}=\hat{S}_1+\hat{S}_2
$$
-  variation
$$
\delta S=
\frac{\mathbf{w}^T\hat{A}\,\delta\mathbf{w}}{\mathbf{w}^T \hat{B}\mathbf{w}}
-
\frac{\mathbf{w}^T\hat{A}\mathbf{w}}{(\mathbf{w}^T \hat{B}\mathbf{w})^2}
\mathbf{w}^T \hat{B}\,\delta\mathbf{w}
$$
plus terms in $\ \ (\delta\mathbf{w})^T$
     
-  stationarity
$$
\delta S=0\qquad\quad
\big(\mathbf{w}^T \hat{B}\mathbf{w}\big)\big(\mathbf{w}^T\hat{A}\big)=
\big(\mathbf{w}^T \hat{A}\mathbf{w}\big)\big(\mathbf{w}^T\hat{B}\big)
$$
     
proportionality factors
$$
\mathbf{w}^T\hat{A}=\mathbf{w}^T
\big((\mathbf{m}_1-\mathbf{m}_2)(\mathbf{m}_1-\mathbf{m}_2)^T\big)
\propto
(\mathbf{m}_1-\mathbf{m}_2)^T
$$
-  scalar quantities
$$
\mathbf{w}^T(\mathbf{m}_1-\mathbf{m}_2)\qquad\quad
\mathbf{w}^T \hat{A}\mathbf{w}\qquad\quad
\mathbf{w}^T \hat{B}\mathbf{w}
$$
     
-  proportionality 
$$
(\mathbf{m}_1-\mathbf{m}_2)^T\propto \mathbf{w}^T\hat{B}
\qquad\quad
\mathbf{m}_1-\mathbf{m}_2\propto 
\hat{B}^T \mathbf{w}= \hat{B} \mathbf{w}
\qquad\quad
\hat{B}^T = \hat{B} 
$$
     
-  the solution for the feature vector
$$
\fbox{$\phantom{\big|}
\mathbf{w}\propto \big(\hat{S}_1+\hat{S}_2\big)^{-1}
\big(\mathbf{m}_1-\mathbf{m}_2\big)
      \phantom{\big|}$}
\qquad
\hat{B}=\hat{S}_1+\hat{S}_2
$$
may be normalized to unity
     
regularization
-  the system of linear equation
$$
\big(\hat{S}_1+\hat{S}_2\big)\mathbf{w}=
\mathbf{m}_1-\mathbf{m}_2
\qquad\quad
\hat{S}_1+\hat{S}_2\ \to\ (1-\lambda)\big(\hat{S}_1+\hat{S}_2\big)
                            +\lambda \hat{I}
$$
    may be regularized for a singular combined covariance matrix
    $\ \ \hat{S}_1+\hat{S}_2$
     
-  shrinking intensity (regularization parameter): $\ \ \lambda$
     
-  identity matrix: $\ \ \hat{I}$
     
 
LDA threshold
-  already determined: $\ \ \mathbf{w}$
$$
     z_\alpha=\mathbf{w}\cdot\mathbf{x}_\alpha\qquad\quad
\mu_\alpha=\mathbf{w}\cdot\mathbf{m}_\alpha\qquad\quad
\sigma_\alpha^2=\mathbf{w}^T \hat S_\alpha \mathbf{w}\qquad\quad
|\mathbf{w}|=1
$$
     
-  determine now threshold $\ \ b:\quad\quad \mathbf{w}\cdot\mathbf{x}=b$
     
 
 
miss-classifications
$$
E= \int_{-\infty}^b p_2(z)\,dz + \int_b^{\infty} p_1(z)\,dz 
$$
$$
\frac{\partial E}{\partial b} = 0
\qquad\quad
\fbox{$\phantom{\big|} p_1(b) = p_2(b)\phantom{\big|}$}
$$
-  principle of equal probabilities 
     
-  threshold close to $\ \ (\mu_1+\mu_2)/2$
 otherwise: Gaussian approximation
 
basic LDA code
-  solving 
 $\displaystyle\qquad\qquad
\big(\hat{S}_1+\hat{S}_2\big)\mathbf{w}= \mathbf{m}_1-\mathbf{m}_2
$
-  numpy.linalg.inv
 ::  matrix inversion
-  numpy.matmul
 ::  matrix multiplication
-  np.sqrt(np.sum(vector**2))
 ::  vector norm
 
 
#!/usr/bin/env python3
import math                        # math
import matplotlib.pyplot as plt    # plotting
import numpy as np 
from numpy.linalg import inv       # inverse matrix
import ML_covarianceMatrix as CM   # loading user-defined module
#
# --- data generation
#
dataMatrix_A  = CM.testData( 0.4*math.pi, 1.0, 9.0, 30, [ 6.0,0.0])
dataMatrix_B  = CM.testData(-0.4*math.pi, 1.0, 9.0, 30, [-6.0,0.0])
mean_A, coVar_A = CM.covarianceMatrix(dataMatrix_A)  
mean_B, coVar_B = CM.covarianceMatrix(dataMatrix_B)  
#
# --- LDA
#
S_inverse    = inv(coVar_A+coVar_B)                     # build in
w_vector     = np.matmul(S_inverse,mean_B-mean_A)       # LDA
w_normalized = w_vector / np.sqrt(np.sum(w_vector**2))  # normalization
w_orthogonal = [w_normalized[1], -w_normalized[0]]      # \perp vector
LL = 5.0
midPoint = 0.5*(mean_B+mean_A)
x_lower  = midPoint[0] - LL*w_orthogonal[0]
y_lower  = midPoint[1] - LL*w_orthogonal[1]
x_upper  = midPoint[0] + LL*w_orthogonal[0]
y_upper  = midPoint[1] + LL*w_orthogonal[1]
x_LDA_plane = [x_lower, x_upper]
y_LDA_plane = [y_lower, y_upper]
if (1==2):
  print("\n#main w_orthogonal\n", w_orthogonal)
  print("\n#main midPoint    \n", midPoint    )
#
# --- printing
#
xEllipse_A, yEllipse_A = CM.plotEllipseData(coVar_A)  # coVar-ellipse
xEllipse_B, yEllipse_B = CM.plotEllipseData(coVar_B)  # coVar-ellipse
xData_A = [thisRow[0] for thisRow in dataMatrix_A]
yData_A = [thisRow[1] for thisRow in dataMatrix_A]
xData_B = [thisRow[0] for thisRow in dataMatrix_B]
yData_B = [thisRow[1] for thisRow in dataMatrix_B]
Z_95 = math.sqrt(5.991)                      # 95% confidence  
 
xE_95_A = [Z_95*xx + mean_A[0] for xx in xEllipse_A]
yE_95_A = [Z_95*yy + mean_A[1] for yy in yEllipse_A]
xE_95_B = [Z_95*xx + mean_B[0] for xx in xEllipse_B]
yE_95_B = [Z_95*yy + mean_B[1] for yy in yEllipse_B]
x_connectMean = [ mean_A[0], mean_B[0] ]
y_connectMean = [ mean_A[1], mean_B[1] ]
 
plt.plot(xE_95_A, yE_95_A, "k", linewidth=2.0, label="95%")
plt.plot(xE_95_B, yE_95_B, "b", linewidth=2.0, label="95%")
plt.plot(xData_A, yData_A, "ok", markersize=5)
plt.plot(xData_B, yData_B, "ob", markersize=5)
plt.plot(x_connectMean, y_connectMean, "r", linewidth=3.0)
plt.plot(x_connectMean, y_connectMean, "or", markersize=9.0)
plt.plot(x_LDA_plane, y_LDA_plane, "g", linewidth=4.0, label="LDA plane")
 
plt.legend(loc="upper left")
#plt.axis('square')                           # square plot
plt.savefig('foo.svg')                       # export figure
plt.show()