# Advanced Introduction to C++, Scientific Computing and Machine Learning

Claudius Gros, SS 2024

Institut für theoretische Physik
Goethe-University Frankfurt a.M.

# statistical evaluation of integrals

• evaluation of multi-dimensional integrals
• $$I \ =\ \int \, f(x_1,x_x,\dots,x_N)\,dx_1\,dx_2\cdot\cdot\cdot dx_N$$
• e.g. $1000$ trapezoids per dimension
• $$\left(10^3\right)^N \ =\ 10^{3N} \qquad\quad \mbox{function evaluations}$$
• not feasible for large $N$

Monte Carlo sampling

• evaluate the integral by sampling statistically the integrand $\ f(x_1,x_x,\dots,x_N)$

Metropolis algorithm

• evaluate the integral by sampling statistically the
most important terms of the integrand $\ f(x_1,x_x,\dots,x_N)$
(importance sampling)

# integration as averaging

$\hspace{10ex} \displaystyle I \ =\ \int_a^b \, f(x)\, dx \ =\ \sum_{i=1}^N f(x_i)\,\Delta x$
• box width $\ \Delta x = \frac{b-a}{N}$
$\hspace{10ex} \displaystyle I \ =\ (b-a) \left(\frac{1}{N} \sum_i\, f(x_i) \right) \ =\ (b-a)\, \langle f\rangle$
• expectation value (average) $\ \ \ \langle f\rangle$
of integrand
$$I\ =\ V\, \langle f\rangle$$
• volume $\ V$
• expectation values may be evaluated statistically

# area integration

$\hspace{10ex} \displaystyle I\ = \ \int_{-a}^a f(x,y)\,dx\,dy$

with

$\hspace{10ex} \displaystyle f(x,y)\ =\ \left\{ \begin{array}{ccl} 1 && \mbox{inside}\\ 0 && \mbox{outside} \end{array} \right.$

• evaluation of $\ \pi$
/** Evaluating \pi with simple Monte Carlo evaluation.
*/
#include <iostream>
#include <stdio.h>     // for printf
#include <stdlib.h>    // srand, rand
#include <math.h>      // for pow, etc.
#include <fstream>     // file streams
using namespace std;

double myFunction(double x, double y, double x0, double x1, double y0, double y1)
{
if ( (x<x0)||(x>x1)||(y<y0)||(y>y1) )
{
printf("# myFunction(): x or y value out of range \n");
return 0.0;
}
if (y*y<=1.0-x*x)     // x*x+y*y=1 is the circle equation
return 1.0;
else
return 0.0;
} // end of myFunction()

// ---
// --- main
// ---
int main()
{
const double x0 = 0.0, x1 = 1.0;    // x-bounds
const double y0 = 0.0, y1 = 1.0;    // y-bounds
//
int N = 1000000;                           // number of draws
int deltaN = N/10;                         // printout intervals
double x, y, sum = 0;
srand(time(NULL));                         // seed for random()
// --- loop over all random evaluations
printf("%8s %8s %8s\n","# steps","pi", "error");
for (int i=0; i<N; i++)
{
x = (random()/(double)RAND_MAX);     // random draws
y = (random()/(double)RAND_MAX);     // random draws
sum += myFunction(x,y,x0,x1,y0,y1);
double result = 4.0*sum/(i+1.0);
if ((i+1)%deltaN==0)
printf("%8d %8.5f %8.5f\n",i+1,result,(M_PI-result)/M_PI);
}
printf("%s\n","-----------------");
printf("%8s %8.5f\n","infinity",M_PI);
return 1;
}  // end of main()


# basic statistics

statistics

• mean (expecatation value)
• $$\langle f\rangle \ =\ \frac{1}{N}\sum_{i=1}^N f(x_i)$$
• variance $\ \sigma^2$
standard deviation $\ \sigma$
• $$\sigma^2 \ =\ \frac{1}{N}\sum_{i=1}^N \Big(f(x_i)-\langle f\rangle \Big)^2$$
• note: $\ \sigma\$ is not the standard deviation for the process,
not the an estimate for the accuracy of $\ \langle f\rangle$
• question: how do we estimate the accuracy of the measurement: $\ \langle f\rangle$

# Monte Carlo - convergence

data binning

• divide the $\ N\$ measurements into $\ k=1,\dots,K\$ bins
containing each $\ N_K=N/K\$ data points
• $$\langle f\rangle_k \ =\ \frac{1}{N_K} \sum_{i=(k-1)N_k}^{k N_k} f(x_i), \qquad\quad k=1,\dots,K$$
• example: $\ N_k=3$
• $$\underbrace{f(x_1)\ \ f(x_2)\ \ f(x_3)}_{\langle f\rangle_1}\ \ \underbrace{f(x_4)\ \ f(x_5)\ \ f(x_6)}_{\langle f\rangle_2}\ \ \underbrace{f(x_7)\ \ f(x_8)\ \ f(x_9)}_{\langle f\rangle_3}\ \ \dots$$
• the accuracy of $\ \langle f\rangle\$ is given
by the standard deviation of $\ \langle f\rangle_k$
• $$\Delta \langle f\rangle \ =\ \sqrt{\frac{1}{K}\sum_{k=1}^K \Big(\langle f\rangle_k-\langle f\rangle\Big)^2}$$
• error estimate: $\ \Delta \langle f\rangle$
• usually $\ K\sim 10,\, 20$
$N_k\$ large

law of large numbers

$$\Delta \langle f\rangle \ \sim\ \frac{1}{\sqrt{N}}$$
• convergence very slow

# Monte Carlo with error estimation

• evaluation of $\ \pi$ (again)
/** Evaluating \pi with simple Monte Carlo evaluation,
* calculating variance and standard deviation (error).
* On input: number of measurements N.
*/
#include <iostream>
#include <stdio.h>     // for printf
#include <stdlib.h>    // srand, rand
#include <math.h>      // for pow, etc.
#include <fstream>     // file streams

using namespace std;

double myFunction(double x, double y, double x0, double x1, double y0, double y1)
{
if ( (x<x0)||(x>x1)||(y<y0)||(y>y1) )
{
printf("# myFunction(): x or y value out of range \n");
return 0.0;
}
if (y*y<=1.0-x*x)     // x*x+y*y=1 is the circle equation
return 1.0;
else
return 0.0;
} // end of myFunction()

// ---
// --- main
// ---
int main(int argLength, char* argValues[])   // parsing command line arguments
{
const double x0 = 0.0, x1 = 1.0;    // x-bounds
const double y0 = 0.0, y1 = 1.0;    // y-bounds
//
int N = 0;                          // number of MC steps
if (argLength==2)
N = atof(argValues[1]);
else
{
printf("on input N: number of MC steps\n");
return 0;
}
//
// --- other parameters and variables
//
const int K = 20;                                // number of bins
int nK = N/K;                                    // number of data per bin
double* data = new double[N];                    // storing all measurements
double* meanBin = new double[K];                 // means of bins
double mean = 0.0, sigma = 0.0;                  // result: mean and error
double x, y;
//
// --- loop over all random evaluations
//
srand(time(NULL));                         // seed for random()
for (int i=0; i<N; i++)
{
x = (random()/(double)RAND_MAX);            // random draws
y = (random()/(double)RAND_MAX);            // random draws
data[i] = 4.0*myFunction(x,y,x0,x1,y0,y1);      // storing measurement
}
//
// --- evaluate the overall mean and the mean of the bins
//
for (int i=0; i<N; i++)
mean += data[i];
mean = mean/N;
//
for (int k=0; k<K; k++)
{
meanBin[k] = 0.0;
for (int n=0; n<nK; n++)                     // nK data points in each bin
meanBin[k] += data[k*nK+n];
meanBin[k] = meanBin[k]/nK;
printf("# (mean of bin) - mean : %8.5f\n",meanBin[k]-mean);
}  // end of loop over all bins
//
// --- evaluate the error: variance of means of bins
//
for (int k=0; k<K; k++)
sigma += (meanBin[k]-mean)*(meanBin[k]-mean);
sigma = sqrt(sigma/K);
//
// --- printout of result
//
printf(" number of measurements : %d\n",N);
printf(" number of bins         : %d\n",K);
printf(" data points per bin    : %d\n",nK);
printf(" estimate for pi        : %8.5f\n",mean);
printf(" estimate for accuracy  : %8.5f\n",sigma);
printf(" true error             : %8.5f\n",M_PI-mean);
//
return 1;
}  // end of main()


# integrals over distribution functions

• probability distribution function $\ \rho(x)$
• $$I \ =\ \int g(x)\,\rho(x)\,dx, \qquad\quad \rho(x)\ \ge\ 0, \qquad\quad \int\rho(x)\,dx\ =\ 1$$
• example: Boltzman factor $\ e^{\beta E_\lambda}$
probability that a state with energy $\ E_\lambda$
is thermally occupied
• $$\rho_\lambda \ =\ \frac{e^{\beta E_\lambda}}{Z}, \qquad\quad Z \ =\ \sum_\alpha\, e^{\beta E_\alpha}$$
• inverse temperature $\ \beta = 1/(k_B T)$
partition function $\ Z$

importance sampling

• the integral can be dominate by small regions of
phase space with large values for $\ \rho(x)$
$\hspace{6ex}\displaystyle \rho(x) \ =\ \frac{1}{y_n}\frac{x^n}{x+a}, \quad\quad y_n\ =\ \int_0^1\, \frac{x^n}{x+a}\, dx$
• importance sampling
Monte Carlo sampling of phase space with probability $\rho(x)$ to sample $\ x$

# random walk through configuration space

• random walk
• $$x_1, \ \ x_2,\ \ x_3,\ \ \dots, \ \ x_N$$
• probability to visit a given phase space $\ x$
• $$P_N(x) \ =\ \frac{1}{N}\sum_{n=1}^N\delta(x-x_n), \qquad\quad \int P_N(x)dx = 1$$
• construct walk with correct limiting behaviour
• $$\lim_{N\to\infty} P_N(x)\ =\ \rho(x)$$
• we then have
• \begin{eqnarray} I& =& \int g(x)\rho(x)dx \ =\ \lim_{N\to\infty} \int g(x)P_N(x)dx\\ & =& \lim_{N\to\infty} \frac{1}{N}\sum_{n=1}^N\int g(x)\delta(x-x_n)dx \end{eqnarray}
• and
• $$I\ =\ \lim_{N\to\infty} \frac{1}{N}\sum_{n=1}^N g(x_n)$$
• this is the basis of importance sampling:
the important parts of phase space are visited (sampled) often
the important parts of phase space seldomly
by the random walk $\ \{x_n\}$

# Metropolis algorithm

• method to generate a random walk $\ \{x_n\}$
with the correct limiting behviour
• $$P_N(x) \ =\ \frac{1}{N}\sum_{n=1}^N\delta(x-x_n), \qquad\quad \lim_{N\to\infty} P_N(x)\ =\ \rho(x)$$
• Markov chain
random walk where the probility to select $\ x_{n+1}\$
depends only on the $\ x_n$
(and not on $\ x_{n-1},\ x_{n-2},\ \dots\$: no memory)

transition rules

• select candidate for $\ \tilde x_{n+1}\$ randomly
• draw a random number $\ r\in[0,1]$
• $$\mbox{if} \quad \left\{ \begin{array}{c} {\rho(\tilde x_{n+1})}/{\rho(x_n)} \ge r \\ {\rho(\tilde x_{n+1})}/{\rho(x_n)} \le r \end{array} \right\} \quad\mbox{then}\quad \left\{ \begin{array}{rrcl} \mbox{accept candidate:} & x_{n+1}&=&\tilde x_{n+1} \\ \mbox{reject candidate:} & x_{n+1}&=& x_{n} \end{array} \right.$$
• repeat
• transition probability from state $\ x\$ to state $\ y$
• $$P(x\to y) \ =\ \frac{\rho(y)}{\rho(x)}$$

# detailed balance

• Markov chain $\ \{x_n\}\$ with transition probability
• $$P(x_n\to x_{n+1}) \ =\ \frac{\rho(x_{n+1})}{\rho(x_n)}$$
• two states $\ x\$ and $\ y\$ phase space
with $\ \rho(y)>\rho(x)\$
• (relative) number of transtions $\ x\to y$ $$T(x\to y) \ =\ P_N(x)\,P(x\to y) \ =\ P_N(x)$$ since $\ \rho(y)/\rho(x)>1$
• (relative) number of transtions $\ y\to x$ $$T(y\to x) \ =\ P_N(y)\,P(y\to x) \ =\ P_N(y)\rho(x)/\rho(y)$$ since $\ \rho(x)/\rho(y)<1$

detailed balance

• for a large number of walkers (random walks)
of for $\ N\to\infty$
the probability $\ P_N(x)\$ will be determined
by the steady-state condition (detailed balance) $$T(y\to x) \ =\ T(x\to y)$$ there would be otherwise an infinite accumulation
• $$P_N(x) \ =\ P_N(y)\rho(x)/\rho(y), \qquad\quad \frac{P_N(x)}{P_N(y)} \ =\ \frac{\rho(x)}{\rho(y)}$$
• the Metropolis algorithm hence correctly yields
• $$\lim_{N\to\infty} P_N(x)\ =\ \rho(x)$$

# evaluation of relative properties

$$y_n \ =\ \int_0^1 \frac{x^n}{x+a}\,dx,\qquad\quad y_{n+1} \ =\ \int_0^1 \frac{x^{n+1}}{x+a}\,dx \ =\ \int_0^1 x\,\frac{x^{n}}{x+a}\,dx$$
• with the Metropolis algorithm only relativ quantities can be evaluated
• $$\frac{y_{n+1}}{y_n} \ =\ \int x\,\rho(x)\,dx,\qquad\quad \rho(x)\ =\ \frac{1}{y_n}\frac{x^n}{x+a}$$
• use statistics (error estimate) from Monte Carlo program
/** Evaluating y_{n+1}/y_n, with y_n = \int_0^1 x^n/(a+x) dx
* with Monte Carlo (Metropolis) algorithm.
* On input: number of Monte Carlo steps N.
*/
#include <iostream>
#include <stdio.h>     // for printf
#include <stdlib.h>    // srand, rand
#include <math.h>      // for pow, etc.
#include <fstream>     // file streams

using namespace std;

/** The proability density.
*/
double myDensity(double x, double nPow, double aPar) {
return pow(x,nPow)/(x+aPar);
}

// ---
// --- main
// ---
int main(int argLength, char* argValues[])   // parsing command line arguments
{
const double aPar = 10.0;                 // parameter
const double nPow = 3.0;                  // power
srand(time(NULL));                  // seed for random()
//
int N = 0;                          // number of MC steps
if (argLength==2)
N = atof(argValues[1]);
else
{
printf("on input N: number of MC steps\n");
return 0;
}
//
// --- other parameters and variables
//
const int K = 20;                                // number of bins
int nK = N/K;                                    // number of data per bin
double* data = new double[N];                    // storing all measurements
double* meanBin = new double[K];                 // means of bins
double mean = 0.0, sigma = 0.0;                  // result: mean and error
double x = random()/(double)RAND_MAX;            // random starting value
double y, ratio;                                 // variables
data[0] = x;                                     // first measurment
//
// --- loop over all Monte Carlo steps
//
for (int i=1; i<N; i++)
{
y = random()/(double)RAND_MAX;
ratio = myDensity(y,nPow,aPar)/ myDensity(x,nPow,aPar);
//
double probability = random()/(double)RAND_MAX; // of acceptance
if (ratio>probability)
x = y;                                        // accept candidate
data[i] = x;                                    // storing measurement
}  // end of loop over all Monte Carlo draws
//
// --- evaluate the overall mean and the mean of the bins
//
for (int i=0; i<N; i++)
mean += data[i];
mean = mean/N;
//
for (int k=0; k<K; k++)
{
meanBin[k] = 0.0;
for (int n=0; n<nK; n++)                     // nK data points in each bin
meanBin[k] += data[k*nK+n];
meanBin[k] = meanBin[k]/nK;
}  // end of loop over all bins
//
// --- evaluate the error: variance of means of bins
//
for (int k=0; k<K; k++)
sigma += (meanBin[k]-mean)*(meanBin[k]-mean);
sigma = sqrt(sigma/K);
//
// --- printout of result
//
printf(" a  in  y^n/(x+a)       : %8.4f\n",aPar);
printf(" n  in  y^n/(x+a)       : %8.4f\n",nPow);
printf(" number of measurements : %d\n",N);
printf(" number of bins         : %d\n",K);
printf(" data points per bin    : %d\n",nK);
printf(" y_{n+1}/y_n  estimate  : %8.4f\n",mean);
printf(" estimate for accuracy  : %8.4f\n",sigma);
printf("\n");
printf(" ratio y_4/y_3 (exact)  : %8.4f\n",0.797490089);
//
return 1;
}  // end of main()