parallelheat.tex 11.4 KB
Newer Older
Thomas Witkowski's avatar
Thomas Witkowski committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
\section{Parallelization}

\begin{figure}
\center
\begin{tabular}{cccc}
\includegraphics[width=0.21\textwidth]{fig/macro.pdf} &
\includegraphics[width=0.21\textwidth]{fig/local1.pdf} &
\includegraphics[width=0.21\textwidth]{fig/global.pdf} &
\includegraphics[width=0.21\textwidth]{fig/local2.pdf} \\
(a)&(b)&(c)&(d)\\
\end{tabular}
\caption{(a) A triangular macro mesh, (b) domain decomposition after six global refinements and overlap for partition 0, (c) composite mesh after adaptation loop, (d) local mesh of process 0 after adaptation loop}
\label{f:full domain covering meshes2}
\end{figure}

Before we start with tha application example, we give a short overview over the parallelization approach of full domain covering meshes used in AMDiS. The approach can be summerized by the following steps:

\begin{itemize}
\item Create a partitioning level by some adaptive or global refinements of the macro mesh.
\item Create a partitioning with approximately the same number of elements in each partition.
\item Every process computes on the whole domain, but adaptive mesh refinements are only allowed within the local partition including some overlap.
\item After each adaptive iteration, a new partitioning can be computed if the parallel load balance becomes to bad. Partitionings are always computed at the same mesh level. The partitioning elements are weighted by the number of corresponding leaf elements (elements with no children).
\item At the end of each timestep or at the end of parallel computations, a global solution is constructed by a partititon of unity method (weighted sum over all process solutions). 
\end{itemize}
Figure \ref{f:full domain covering meshes2} illustrates the concept of full domain covering meshes.

The so called three level approach allows to decouple the following levels:
\begin{enumerate}
\item Partitioning level: Basis for the partitioning. The partitioning level is built by the current leaf elements of the mesh, when the parallelization is initialized.
\item Global coarse grid level: Number of global refinements starting from the partitioning level for the whole domain. The global coarse grid level builds the coarsest level for all future computations. 
\item Local coarse grid level: Number  of global refinements starting from the partitioning level within the local domain including neighbor elements. The local coarse grid level reduces the size of partition overlap.
\end{enumerate}
In Figure \ref{f:threelevels2}, an example for the three level approach is given. The dashed line shows the overlap of size 1 for the local partition computed at partitining level.

The parallelization in AMDiS uses the {\it Message Passing Interface} MPI, see
\begin{lstlisting}{}
http://www-unix.mcs.anl.gov/mpi/ .
\end{lstlisting}


For the partitioning the parallel graph partitioning library ParMETIS is used, see
\begin{lstlisting}{}
http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview .
\end{lstlisting}

To use AMDiS in parallel, it must be configured like
\begin{lstlisting}{}
> ./configure --prefix=<AMDIS-DIR> 
              --with-mpi=<MPI-DIR>
              --with-parmetis=<PARMETIS-DIR> .
\end{lstlisting}

The parallel application (in this example \verb+parallelheat+) must be compiled with the MPI C++ compiler \verb+mpiCC+ and linked against the ParMETIS library. Then the application can be called with \verb+mpirun+:

\begin{lstlisting}{}
> mpirun -np <num-procs> ./parallelheat init/parallelheat.dat.2d
\end{lstlisting}

Now we start with the example. We want to solve a time dependent problem as described in \ref{s:heat}. As right hand side function $f$ we choose the {\it moving source} 
\begin{equation}
f(x,t)=\mbox{sin}(\pi t) e^{-10(x-\vec{t})^2}
\end{equation}
with $0 \leq t \leq 1$ and $\vec{t}$ a vector with all components set to $t$. The maximum of this function moves from $(0,0)$ to $(1,1)$ within the specified time intervall. 
We use 4 parallel processes.

\begin{figure}
\center
\includegraphics[width=0.95\textwidth]{fig/threelevels.pdf}
\caption{Example for the three levels of partitioning. The partitioning mesh is globally refined twice to get the global coarse grid level. Then two further global refinement steps applied on $\Omega_i$ and its direct neighbor elements (in each step) result in the local coarse grid level.}
\label{f:threelevels2}
\end{figure}

\begin{table}
\center
\begin{tabular}{|cc|c|c|c|}
\hline 
&& {\bf 1d} & {\bf 2d} & {\bf 3d} \\
\hline 
{\bf source code} & {\tt src/} & \multicolumn{3}{|c|}{\tt parallelheat.cc} \\
\hline 
{\bf parameter file} & {\tt init/} & \tt - & \tt parallelheat.dat.2d & \tt - \\
\hline 
{\bf macro file} & {\tt macro/} & \tt - & \tt macro.stand.2d & \tt - \\
\hline 
{\bf output files} & {\tt output/} & \multicolumn{3}{|c|}{\tt parallelheat\_proc<n>\_<t>.mesh/.dat} \\
\hline 
\end{tabular}
\caption{Files of the {\tt parallelheat} example. In the output file names, {\tt <n>} is replaced by the process number and {\tt <t>} is replaced by the time.}
\label{t:parallelheat}
\end{table}


\subsection{Source code}

In this section the interesting aspects of file \verb+parallelheat.cc+ are described. First, the moving functions $f$ and $g$ are defined.

\begin{lstlisting}{}
class G : public AbstractFunction<double, WorldVector<double> >,
	  public TimedObject
{
public:
102
  /// Implementation of AbstractFunction::operator().
Thomas Witkowski's avatar
Thomas Witkowski committed
103
104
105
106
  double operator()(const WorldVector<double>& argX) const
  {
    WorldVector<double> x = argX;
    int dim = x.getSize();
107
    for (int i = 0; i < dim; i++)
Thomas Witkowski's avatar
Thomas Witkowski committed
108
      x[i] -= *timePtr;
109

Thomas Witkowski's avatar
Thomas Witkowski committed
110
111
112
113
114
115
116
117
    return sin(M_PI * (*timePtr)) * exp(-10.0 * (x * x));
  }
};

class F : public AbstractFunction<double, WorldVector<double> >,
	  public TimedObject
{
public:
118
  F(int degree) : AbstractFunction<double, WorldVector<double> >(degree) {}
Thomas Witkowski's avatar
Thomas Witkowski committed
119

120
121
122
  /// Implementation of AbstractFunction::operator().
  double operator()(const WorldVector<double>& argX) const 
  {
Thomas Witkowski's avatar
Thomas Witkowski committed
123
124
    WorldVector<double> x = argX;
    int dim = x.getSize();
125
    for (int i = 0; i < dim; i++)
Thomas Witkowski's avatar
Thomas Witkowski committed
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
      x[i] -= *timePtr;

    double r2 = (x * x);
    double ux = sin(M_PI * (*timePtr)) * exp(-10.0 * r2);
    double ut = M_PI * cos(M_PI*(*timePtr)) * exp(-10.0 * r2);
    return ut - (400.0 * r2 - 20.0 * dim) * ux;
  }
};
\end{lstlisting}

The main program starts with \verb+MPI::Init+ to initialize MPI, and ends with \verb+MPI::Finalize+ to finalize MPI.

\begin{lstlisting}{}
int main(int argc, char** argv)
{
  MPI::Init(argc, argv);

  ...

  std::vector<DOFVector<double>*> vectors;
  ParallelProblemScal parallelheat("heat->parallel", heatSpace, heat, 
                                   vectors);

  AdaptInstationary *adaptInstat =
150
    new AdaptInstationary("heat->adapt",
Thomas Witkowski's avatar
Thomas Witkowski committed
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
			  &parallelheat,
			  adaptInfo,
			  &parallelheat,
			  adaptInfoInitial);

  ... 

  parallelheat.initParallelization(adaptInfo);
  adaptInstat->adapt();
  parallelheat.exitParallelization(adaptInfo);

  MPI::Finalize();
}
\end{lstlisting}

The parallel problem \verb+parallelheat+ has to know the sequential instationary problem \verb+heat+ and the space problem \verb+heatSpace+. The vector \verb+vectors+ stores pointers to DOF vectors which are used by the operator terms. The values of these vectors have to be exchanged during repartitioning. The solution of the last time step is considered automatically. Hence, \verb+vectors+ here is empty.

The adaptation loop \verb+adaptInstat+ uses \verb+parallelheat+ as iteration interface and as time interface. Before the adaptation loop starts, the parallelization is initialized by \verb+initParallelization+. After the adaptation loop has finished, the parallelization is finalized by \verb+exitParallelization+.

\subsection{Parameter file}

We use the parameter file described in Section \ref{s:heat parameter} as basis for the file \verb+parallelheat.dat.2d+ and describe the relevant changes for the parallel case.

\begin{lstlisting}{}
heatMesh->global refinements:         6
\end{lstlisting}

The macro mesh is globally refined 6 times to create the partitioning mesh which is used as basis for the domain decomposition. 

In the following, one can see the parameters for the parallel problem \verb+heat->parallel+.

\begin{lstlisting}{}
heat->parallel->upper part threshold: 1.5
heat->parallel->lower part threshold: 0.66
\end{lstlisting}

These two parameters determine the upper and lower partitioning thresholds $rt_{high}$ and $rt_{low}$. If at least one process exceeds the number of $rt_{high} \cdot sum_{av}$ elements or falls below $rt_{low} \cdot sum_{av}$ elements, a repartitioning must be done ($sum_{av}$ is the average number of elements per partition). 

%If adaptive thresholds are used, these values are also the limits for the threshold adaptation. The start value for the upper threshold is also the minimum for the upper threshold, the start value for the lower threshold is also the maximum for the lower threshold.

%\begin{lstlisting}{}
%heat->parallel->adaptive thresholds:  1
%heat->parallel->threshold inc factor: 2.0
%heat->parallel->threshold dec factor: 0.5
%heat->parallel->repart time factor:   10.0
%\end{lstlisting}

%If \verb+adaptive thresholds+ is set to $1$, the \verb+threshold inc factor+ and \verb+threshold dec factor+ determine the incremenent and decrement factors for the thresholds. The value of \verb+repart time factor+ is the desired relation between repartitioning time and computation time.

\begin{lstlisting}{}
heat->parallel->global coarse grid level: 0
heat->parallel->local coarse grid level:  4
\end{lstlisting}

Here, the parameters used for the three level approach are given. The partitioning level has already been determined by \verb+heatMesh->global refinements+. The value of \verb+global coarse grid level+ is set to $0$, and \verb+local coarse grid level+ is set to $4$. This means that 4 global refinements are done within the local partititon at each process (incl. neighoring elements), to reduce the overlap size. No further refinements are done outside of the local partitions. 

\begin{lstlisting}{}
heat->adapt->timestep:               0.1
heat->adapt->min timestep:           0.1
heat->adapt->max timestep:           0.1
heat->adapt->start time:             0.0
heat->adapt->end time:               1.0
\end{lstlisting}

We use a fixed timestep of $0.1$. The start time is set to $0.0$, the end time is set to $1.0$. Thus, we have 10 timesteps (without the initial solution at $0.0$). 

\begin{lstlisting}{}
heat->space->output->filename:       output/heat
\end{lstlisting}

For the output the prefix \verb+output/heat+ is used. Each process adds its process ID. Then the time is added and finally the file extensions. The result of the initial problem of process 1 will be written to \verb+output/heat_proc0_00.000.mesh+ and \verb+output/heat_proc0_00.000.dat+.

\subsection{Macro file}
We again use the macro file \verb+macro/macro.stand.2d+, which was described in Section \ref{s:ellipt macro}.

\subsection{Output}
For each timestep and for each process, one mesh file and one value file is written (total file number: $11 \cdot 4 \cdot 2 = 88$). In Figure \ref{f:parallel heat}, the local process solutions for $t=0.1$, $t=0.5$ and $t=0.9$ are visualized after the partititon of unity was applied. One can see that the partitioning considers the moving source.

\begin{figure}
\center
\begin{tabular}{ccc}
\includegraphics[width=0.3\textwidth]{fig/parallelheat_01.jpg}&
\includegraphics[width=0.3\textwidth]{fig/parallelheat_05.jpg}&
\includegraphics[width=0.3\textwidth]{fig/parallelheat_09.jpg}\\
(a)&(b)&(c)
\end{tabular}
\caption{Local process solutions for $t=0.1$ (a), $t=0.5$ (b), and $t=0.9$ (c).}
\label{f:parallel heat}
\end{figure}