Commit e9e0b8b1 authored by Maicon Hieronymus's avatar Maicon Hieronymus
Browse files

Merge branch 'physics_tests2' into 'master'

Cleaned dependencies

See merge request mahieron/process_plot_scripts!8
parents 6d874d5c 75160364
This source diff could not be displayed because it is too large. You can view the blob instead.
......@@ -2,7 +2,10 @@ AD with ICON and Processing Results
===================================
Here are various scripts to plot and process data from [netCDF](https://www.unidata.ucar.edu/software/netcdf/) files and my implementation of two moment scheme (similar to [ICON](https://www.dwd.de/EN/research/weatherforecasting/num_modelling/01_num_weather_prediction_modells/icon_description.html)) with [AD using CoDiPack](https://github.com/scicompkl/codipack). For the C++ code we follow the
[doxygen standard](http://www.doxygen.nl/manual/docblocks.html).
[doxygen standard](http://www.doxygen.nl/manual/docblocks.html). \
We recommend an [anaconda](https://www.anaconda.com/) environment for which you may
use `requirements_conda.txt` and then `requirements_pip.txt` with pip.
Contents
---------
......@@ -14,29 +17,32 @@ Contents
- ***.ipynb:** Jupyter Notebooks to get started. See below for more explanation.
Python and Plotting Prerequisites
---------------------
Python Prerequisites for Post-Processing
----------------------------------------
- [Python3](https://www.python.org/) (Tested with v3.7.6)
- [Iris](https://github.com/SciTools/iris) (Recommended to install via conda or from source; Tested with v2.4.0)
- [pandas](https://pandas.pydata.org/) (Tested with v1.0.1)
- [DASK](https://dask.org/) (Tested with v2.16.0)
- [dask_ml](https://dask-ml.readthedocs.io/en/latest/) (Tested with v1.2.0)
- [progressbar2](https://pypi.org/project/progressbar2/) (Tested with v3.37.1)
- [seaborn](https://seaborn.pydata.org/) (Tested with v0.10.0)
- [matplotlib](https://matplotlib.org/) (Tested with v3.1.3)
- [scipy](https://www.scipy.org/) (Tested with v1.4.1)
- [matplotlib toolkits](https://matplotlib.org/1.4.3/mpl_toolkits/index.html)
- [netCDF4](https://unidata.github.io/netcdf4-python/netCDF4/index.html) (Tested with v1.5.3)
- [pillow, formerly known as PIL](https://pillow.readthedocs.io/en/stable/) (Tested with v7.0.0)
- [xarray, formerly known as xray](http://xarray.pydata.org/en/stable/) (Tested with v0.15.0)
- [GEOS](https://github.com/libgeos/geos) (Tested with v3.7.1)
- [cartopy](https://scitools.org.uk/cartopy/docs/latest/) (Tested with v0.17.0)
- [scikit-learn](https://scikit-learn.org/stable/) (Tested with v0.22.1)
- [datashader](https://datashader.org/index.html) (Tested with v0.10.0)
- [fastparquet](https://github.com/dask/fastparquet) (Tested with v0.3.3)
- [python-snappy](http://google.github.io/snappy/) (Tested with v0.5.4)
Python Prerequisites for Plotting
----------------------------------------
- [Iris](https://github.com/SciTools/iris) (Install from conda-forge; Tested with v2.4.0)
- [holoviews](http://holoviews.org/) (Tested with v1.13.2)
- [hvplot](https://hvplot.holoviz.org/) (Tested with v0.5.2)
- [fastparquet](https://github.com/dask/fastparquet) (Tested with v0.3.3)
- [datashader](https://datashader.org/index.html) (Tested with v0.10.0)
- [matplotlib](https://matplotlib.org/) (Tested with v3.1.3)
- [geoviews](https://geoviews.org/) (Install via `conda install -c pyviz geoviews`; Tested with v1.8.1)
- [GEOS](https://github.com/libgeos/geos) (Tested with v3.7.1)
- [cartopy](https://scitools.org.uk/cartopy/docs/latest/) (Tested with v0.17.0)
Docs Prerequisites
-------------------------
......@@ -56,10 +62,12 @@ C++ Prerequisites
- [Boost](https://www.boost.org/) (Tested with v1.3.4)
- [CoDiPack](https://www.scicomp.uni-kl.de/software/codi/) (Tested with v1.8.0)
Optional Prerequisites
----------------------
- [GNU Parallel](https://www.gnu.org/software/parallel/) (Executing multiple processes with ease; Tested with v20161222)
- [JupyterLab](https://jupyter.org/) (Working with Python and Plotting but in an easier way; Tested with v1.2.6)
- JupyterLab extension for geoviews: `jupyter labextension install @pyviz/jupyterlab_pyviz`
Compiling code
......
......@@ -22,7 +22,7 @@ TARGET_TIME="60"
# Where to write output files. Keep the naming convention of the files, ie
# wcb403220_traj0_MAP...
OUTPUT_PATH="/data/project/wcb/sim_results/sample_vladiana_test/"
OUTPUT_PATH="data/sim/"
# Wether to take the data from the netcdf-file every 20 seconds (=1)
# or just use the initial conditions from the file and simulate microphysics
......@@ -45,7 +45,7 @@ parallel -j ${NTASKS} --no-notice --delay .2 build/apps/src/microphysics/./traje
# try "netcdf"
# DISCLAIMER: netcdf is not yet supported!
FILE_TYPE="parquet"
INPUT_PATH="/data/project/wcb/sim_results/sample_vladiana_test"
STORE_PATH="/data/project/wcb/parquet/sample_vladiana_test"
INPUT_PATH="/uni-mainz.de/homes/mahieron/Documents/PhD/physics_tests/data/sim"
STORE_PATH="/uni-mainz.de/homes/mahieron/Documents/PhD/physics_tests/data/parquet"
cd scripts
python Create_parquet_local.py ${FILE_TYPE} ${INPUT_PATH} ${STORE_PATH}
\ No newline at end of file
python Create_parquet_local.py ${FILE_TYPE} ${INPUT_PATH} ${STORE_PATH}
......@@ -104,7 +104,7 @@ const std::vector<std::string> output_par_idx =
*/
const std::vector<std::string> output_grad_idx =
{"da_1", "da_2", "de_1", "de_2", "dd", "dN_c", "dgamma", "dbeta_c",
"dbeta_r", "ddelta1", "ddelta2", "dzeta", "dcc.rain_gfak", "dcloud_k_au",
"dbeta_r", "ddelta1", "ddelta2", "dzeta", "drain_gfak", "dcloud_k_au",
"dcloud_k_sc", "dkc_autocon", "dinv_z",
// Rain
"drain_a_geo", "drain_b_geo", "drain_min_x", "drain_min_x_act",
......
datashader==0.10.0
jupyterlab==1.2.6
matplotlib==3.1.3
netCDF4==1.5.3
numpy==1.18.1
pandas==1.0.1
progressbar2==3.37.1
recommonmark==0.6.0
seaborn==0.10.0
Sphinx==3.0.3
xarray==0.15.0
python-snappy==0.5.4
holoviews==1.13.2
hvplot==0.5.2
iris==2.4.0
geoviews==1.8.1
\ No newline at end of file
fastparquet==0.3.3
sphinx-rtd-theme==0.4.3
breathe==4.18.1
exhale==0.2.3
......@@ -16,10 +16,6 @@ from matplotlib import cm
from pylab import rcParams
import os
from sklearn.cluster import MiniBatchKMeans, SpectralClustering, DBSCAN
from sklearn.mixture import BayesianGaussianMixture
from sklearn.metrics import adjusted_rand_score
from multiprocessing import Pool
from itertools import repeat
from progressbar import progressbar as pb
......@@ -80,12 +76,12 @@ class Deriv:
self.pool = Pool(processes=threads)
self.threads = threads
self.data = loader.load_mult_derivates_direc_dic(
direc=direc,
filt=filt,
EPSILON=EPSILON,
trajectories=trajectories,
suffix=suffix,
pool=self.pool,
direc=direc,
filt=filt,
EPSILON=EPSILON,
trajectories=trajectories,
suffix=suffix,
pool=self.pool,
file_list2=file_list
)
df = list(self.data.values())[0]
......@@ -646,157 +642,3 @@ class Deriv:
p, v = sorted_tuples.pop()
in_params_2.append(p)
plot_helper(df, in_params=in_params_2, out_param=out_param, **kwargs)
def cluster(self, k, method, out_params=None, features=None,
new_col="cluster", truth=None, thresh=0.10):
"""
Cluster the data where "features" are columns, "truth" can be a column
that can be used to identify a ground truth or a list of assignments.
If truth is given, purity and adjusted RAND index are calculated.
"method" can be used to choose different clustering methods. Discards
all columns that consist of at least one NaN.
Parameters
----------
k : int
Number of clusters to cluster for. Is ignored for certain cluster
methods.
method : string
Choose the clustering method. Options are "kmeans", "spectral",
"gaussian" or "dbscan".
out_params : list of strings
The parameter for which the derivatives shall be considered
for. If none is given, all output parameters will be considered.
features : list of strings
Columns of the data that shall be used. If it is None, all
derivatives will be used.
new_col : string
Name of the column where the cluster assignment shall be stored.
truth : String or list of int
Either a column of the data or a list of assignments.
thresh : float
Threshold for dropping columns. if more than this amount of values
is NaN, drop that column.
Returns
-------
Array of float, optionally float
Cluster centers if "kmeans" is selected,
adjusted RAND index if truth is given.
"""
if out_params is None:
out_params = self.data.keys()
if features is None:
features_tmp = self.data[out_params[0]].columns.values
features = []
for f in features_tmp:
if f[0] == "d":
features.append(f)
if truth is not None:
if isinstance(truth, str):
truth_list = []
for out_p in out_params:
truth_list.append(self.data[out_p][truth].tolist())
truth = truth_list
X = None
for out_p in out_params:
# Remove columns with more than half of NaNs
df_tmp = self.data[out_p]
cols_to_delete = df_tmp.columns[df_tmp.isnull().sum()/len(df_tmp) > thresh]
for col in cols_to_delete:
try:
features.remove(col)
except:
pass
if features == []:
print("No derivatives in this dataset. Returning.")
if method == "gaussian":
return None, None, None
return None, None
X_tmp = df_tmp.as_matrix(columns=features)
if X is None:
X = X_tmp
else:
X = np.concatenate((X, X_tmp), axis=0)
# Remove rows with NaNs from X and according labels from truth
delete_index = []
for i, row in enumerate(X):
delete = np.sum(np.isnan(row))
if delete > 0:
delete_index.append(i)
X = np.delete(X, delete_index, axis=0)
truth = np.delete(truth, delete_index)
print("Shape of X {}".format(np.shape(X)))
if np.shape(truth) == (0,):
print("No derivatives in this dataset. Returning.")
if method == "gaussian":
return None, None, None
return None, None
if method == "kmeans":
clustering = MiniBatchKMeans(n_clusters=k).fit(X)
elif method == "spectral":
try:
clustering = SpectralClustering(n_clusters=k).fit(X)
except MemoryError:
print("Dataset is too large to process. Aborting.")
return None, None
elif method == "gaussian":
clustering = BayesianGaussianMixture(n_components=k).fit(X)
labels = clustering.predict(X)
elif method == "dbscan":
# TODO: Change eps and min_samples
clustering = DBSCAN(eps=0.5, min_samples=5, metric='euclidean').fit(X)
else:
print("No such method")
return
for i, out_p in enumerate(out_params):
n = len(self.data[out_p].index)
if method == "gaussian":
this_labels = labels[i*n:(i+1)*n]
else:
this_labels = clustering.labels_[i*n:(i+1)*n]
# Add the one that could not be processed due to NaNs
for idx in delete_index:
if idx < i*n:
continue
if idx >= (i+1)*n:
break
this_labels = np.insert(this_labels, idx, -2)
self.data[out_p][new_col] = this_labels
if out_p in self.cluster_names.keys():
self.cluster_names[out_p].append(new_col)
else:
self.cluster_names[out_p] = [new_col]
if truth is not None:
if method == "gaussian":
adj_index = adjusted_rand_score(truth, labels)
return clustering.means_, adj_index, clustering.score(X)
else:
adj_index = adjusted_rand_score(truth, clustering.labels_)
if method == "kmeans":
return clustering.cluster_centers_, adj_index
elif method == "spectral":
return clustering.affinity_matrix_, adj_index
elif method == "dbscan":
# Calculate the centers
centers = {}
for out_p in self.cluster_names.keys():
centers[out_p] = []
df_tmp = self.data[out_p]
center_idx = df_tmp[new_col].unique()
for c in center_idx:
df_tmp2 = df_tmp.loc[df_tmp[new_col] == c].as_matrix(columns=features)
n = len(df_tmp2)
cen = np.sum(df_tmp2, axis=0)/n
centers[out_p].append(cen)
return centers, adj_index
if method == "kmeans":
return clustering.cluster_centers_
elif method == "gaussian":
return clustering.means_
This diff is collapsed.
......@@ -649,7 +649,7 @@ int main(int argc, char** argv)
uint32_t used_parameter = 0;
uint32_t used_parameter2 = 0;
std::cout << "qi,Ni,T,S,qv,p,w,delta_qi,delta_Ni,delta_qv,"
std::cout << "qi,Ni,T,S,S_i,qv,p,w,delta_qi,delta_Ni,delta_qv,"
<< "delta_lat_cool,delta_lat_heat\n";
for(uint32_t i=0; i<=n1; ++i)
......@@ -725,6 +725,7 @@ int main(int argc, char** argv)
<< Ni.getValue() << ","
<< T_prime_in.getValue() << ","
<< S.getValue() << ","
<< ssi.getValue() << ","
<< qv_prime_in.getValue() << ","
<< p_prime_in.getValue() << ","
<< w_prime_in.getValue() << ",";
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment