Data
Contents
Data#
This page describes the main data object that are used by ClusterKinG.
If you do not need to include errors in your analysis, use
Data
, else
DataWithErrors
(which inherits from
Data
but adds additional methods to it).
Both classes inherit from a very basic class,
DFMD
, which provides basic input and output
methods.
DFMD
#
- class clusterking.data.DFMD(path: Optional[Union[str, pathlib.PurePath]] = None, log: Optional[Union[str, logging.Logger]] = None)[source]#
Bases:
object
DFMD = DataFrame with MetaData. This class bundles a pandas dataframe together with metadata and provides methods to save and load such an object.
- __init__(path: Optional[Union[str, pathlib.PurePath]] = None, log: Optional[Union[str, logging.Logger]] = None)[source]#
Initialize a DFMD object.
- Parameters
path – Optional: load from this file (specified as string or
pathlib.PurePath
)log – Optional: instance of
logging.Logger
or name of logger to be created
- md#
This will hold all the configuration that we will write out
- df: Optional[pandas.core.frame.DataFrame]#
pandas.DataFrame
to hold all of the results
- log#
Instance of
logging.Logger
- write(path: Union[str, pathlib.PurePath], overwrite='ask')[source]#
Write output files.
- Parameters
path – Path to output file
overwrite – How to proceed if output file already exists: ‘ask’ (ask interactively for approval if we have to overwrite), ‘overwrite’ (overwrite without asking), ‘raise’ (raise Exception if file exists). Default is ‘ask’.
- Returns
None
Data
#
- class clusterking.data.Data(*args, **kwargs)[source]#
Bases:
clusterking.data.dfmd.DFMD
This class inherits from the
DFMD
class and adds additional methods to it. It is the basic container, that containsThe distributions to cluster
The cluster numbers after clustering
The benchmark points after they are selected.
- property bin_cols: List[str]#
All columns that correspond to the bins of the distribution. This is automatically read from the metadata as set in e.g.
clusterking.scan.Scanner.run()
.
- property par_cols: List[str]#
All columns that correspond to the parameters (e.g. Wilson parameters). This is automatically read from the metadata as set in e.g. the
clusterking.scan.Scanner.run()
.
- property npars: int#
Number of parameters that were sampled (i.e. number of dimensions of the sampled parameter space.
- data(normalize=False) numpy.ndarray [source]#
Returns all histograms as a large matrix.
- Parameters
normalize – Normalize all histograms
- Returns
numpy.ndarray of shape self.n x self.nbins
- norms() numpy.ndarray [source]#
Returns a vector of all normalizations of all histograms (where each histogram corresponds to one sampled point in parameter space).
- Returns
numpy.ndarray of shape self.n
- clusters(cluster_column='cluster') List[Any] [source]#
Return list of all cluster names (unique)
- Parameters
cluster_column – Column that contains the cluster names
- get_param_values(param: Union[None, str] = None)[source]#
Return all unique values of this parameter
- Parameters
param – Name of parameter. If none is given, instead return a dictionary mapping of parameters to their values.
Returns:
- only_bpoints(bpoint_column='bpoint', inplace=False)[source]#
Keep only the benchmark points as sample points.
- Parameters
bpoint_column – benchmark point column (boolean)
inplace – If True, the current Data object is modified, if False, a new copy of the Data object is returned.
- Returns
None or Data
- fix_param(inplace=False, bpoints=False, bpoint_slices=False, bpoint_column='bpoint', **kwargs)[source]#
Fix some parameter values to get a subset of sample points.
- Parameters
inplace – Modify this Data object instead of returning a new one
bpoints – Keep bpoints (no matter if they are selected by the other selection or not)
bpoint_slices – Keep all parameter values that are attained by benchmark points.
bpoint_column – Column with benchmark points (default ‘bpoints’) (for use with the
bpoints
option)**kwargs – Specify parameter values: Use
<parameter name>=<value>
or<parameter name>=[<value1>, ..., <valuen>]
.
- Returns
If
inplace == False
, return new Data with subset of sample points.
Examples:
d = Data("/path/to/tutorial/csv/folder", "tutorial_basics")
Return a new Data object, keeping the two values
CT_bctaunutau
closest to -0.75 or 0.5d.fix_param(CT_bctaunutau=[-.75, 0.5])
Return a new Data object, where we also fix
CSL_bctaunutau
to the value closest to -1.0:d.fix_param(CT_bctaunutau=[-.75, 0.5], CSL_bctaunutau=-1.0)
Return a new Data object, keeping the two values
CT_bctaunutau
closest to -0.75 or 0.5, but make sure we do not discard any benchmark points in that process:d.fix_param(CT_bctaunutau=[-.75, 0.5], bpoints=True)
Return a new Data object, keeping the two values
CT_bctaunutau
closest to -0.75 or 0.5, but keep all values ofCT_bctaunutau
that are attained by at least one benchmark point:d.fix_param(CT_bctaunutau=[-.75, 0.5], bpoint_slices=True)
Return a new Data object, keeping only those values of
CT_bctaunutau
, that are attained by at least one benchmark point:d.fix_param(CT_bctaunutau=[], bpoint_slice=True)
- sample_param(bpoints=False, bpoint_slices=False, bpoint_column='bpoint', inplace=False, **kwargs)[source]#
Return a Data object that contains a subset of the sample points (points in parameter space). Similar to Data.fix_param.
- Parameters
inplace – Modify this Data object instead of returning a new one
bpoints – Keep bpoints (no matter if they are selected by the other selection or not)
bpoint_slices – Keep all parameter values that are attained by benchmark points
bpoint_column – Column with benchmark points (default ‘bpoints’) (for use with the
bpoints
option)**kwargs – Specify parameter ranges:
<coeff name>=(min, max, npoints)
or<coeff name>=npoints
For each coeff (identified by <coeff name>), select (at most) npoints points between min and max. In total this will therefore result in npoints_{coeff_1} x … x npoints_{coeff_npar} sample points (provided that there are enough sample points available). If a coefficient isn’t contained in the dictionary, this dimension of the sample remains untouched.
- Returns
If
inplace == False
, return new Data with subset of sample points.
Examples:
d = Data("/path/to/tutorial/csv/folder", "tutorial_basics")
Return a new Data object, keeping subsampling
CT_bctaunutau
closest to 5 values between -1 and 1:d.sample_param(CT_bctaunutau=(-1, 1, 10))
The same in shorter syntax (because -1 and 1 are the minimum and maximum of the parameter)
d.sample_param(CT_bctaunutau=10)
For the
bpoints
andbpoint_slices
syntax, see the documentation ofclusterking.data.Data.fix_param()
.
- sample_param_random(inplace=False, bpoints=False, bpoint_column='bpoint', **kwargs)[source]#
Random subsampling in parameter space.
- Parameters
inplace – Modify this Data object instead of returning a new one
bpoints – Keep bpoints (no matter if they are selected by the other selection or not)
bpoint_column – Column with benchmark points (default ‘bpoints’) (for use with the
bpoints
option)**kwargs – Arguments for
pandas.DataFrame.sample()
- Returns
If
inplace == False
, return new Data with subset of sample points.
- find_closest_spoints(point: Dict[str, float], n=10) clusterking.data.data.Data [source]#
Given a point in parameter space, find the closest sampling points to it and return them as a
Data
object with the corresponding subset of spoints. The order of the rows in the dataframeData.df
will be in order of increasing parameter space distance from the given point.- Parameters
point – Dictionary of parameter name to value
n – Maximal number of rows to return
- Returns
Data
object with subset of rows of dataframe corresponding to the closest points in parameter space.
- find_closest_bpoints(point: Dict[str, float], n=10, bpoint_column='bpoint')[source]#
Given a point in parameter space, find the closest benchmark points to it and return them as a
Data
object with the corresponding subset of benchmark points. The order of the rows in the dataframeData.df
will be in order of increasing parameter space distance from the given point.- Parameters
point – Dictionary of parameter name to value
n – Maximal number of rows to return
bpoint_column – Column name of the benchmark column
- Returns
Data
object with subset of rows of dataframe corresponding to the closest points in parameter space.
- configure_variable(variable, axis_label=None)[source]#
Set additional information for variables, e.g. the variable on the x axis of the plots of the distribution or the parameters.
- Parameters
variable – Name of the variable
axis_label – An alternate name which will be used on the axes of plots.
- rename_clusters(arg=None, column='cluster', new_column=None)[source]#
Rename clusters based on either
A dictionary of the form
{<old cluster name>: <new cluster name>}
A function that maps the old cluster name to the new cluster name
Example for 2: Say our
Data
objectd
contains clusters 1 to 10 in the default columncluster
. The following method call will instead use the numbers 0 to 9:d.rename_clusters(lambda x: x-1)
- Parameters
arg – Dictionary or function as described above.
column – Column that contains the cluster names
new_column – New column to write to (default None, i.e. rename in place)
- Returns
None
- plot_dist(cluster_column='cluster', bpoint_column='bpoint', title: Union[None, str] = None, clusters: Optional[List[int]] = None, nlines=None, bpoints=True, legend=True, ax=None, hist_kwargs: Optional[Dict[str, Any]] = None, hist_kwargs_bp: Optional[Dict[str, Any]] = None)[source]#
Plot several examples of distributions for each cluster specified.
- Parameters
cluster_column – Column with the cluster names (default ‘cluster’)
bpoint_column – Column with bpoints (default ‘bpoint’)
title – Plot title (
None
: automatic)clusters – List of clusters to selected or single cluster. If None (default), all clusters are chosen.
nlines – Number of example distributions of each cluster to be plotted (default 0)
bpoints – Draw benchmark points (default True)
legend – Draw legend? (default True)
ax – Instance of matplotlib.axes.Axes to plot on. If None, a new one is instantiated.
hist_kwargs – Keyword arguments passed on to
plot_histogram()
hist_kwargs_bp – Like
hist_kwargs
but used for benchmark points. IfNone
,hist_kwargs
is used.
Note: To customize these kind of plots further, check the
BundlePlot
class and theplot_bundles()
method thereof.- Returns
Figure
- plot_dist_minmax(cluster_column='cluster', bpoint_column='bpoint', title: Union[None, str] = None, clusters: Optional[List[int]] = None, bpoints=True, legend=True, ax=None, hist_kwargs: Optional[Dict[str, Any]] = None, fill_kwargs: Optional[Dict[str, Any]] = None)[source]#
Plot the minimum and maximum of each bin for the specified clusters.
- Parameters
cluster_column – Column with the cluster names (default ‘cluster’)
bpoint_column – Column with bpoints (default ‘bpoint’)
title – Plot title (
None
: automatic)clusters – List of clusters to selected or single cluster. If None (default), all clusters are chosen.
bpoints – Draw benchmark points (default True)
legend – Draw legend? (default True)
ax – Instance of matplotlib.axes.Axes to plot on. If None, a new one is instantiated.
hist_kwargs – Keyword arguments to
plot_histogram()
fill_kwargs – Keyword arguments to`matplotlib.pyplot.fill_between`
Note: To customize these kind of plots further, check the
BundlePlot
class and theplot_minmax()
method thereof.- Returns
Figure
- plot_dist_box(cluster_column='cluster', bpoint_column='bpoint', title: Union[None, str] = None, clusters: Optional[List[int]] = None, bpoints=True, whiskers=2.5, legend=True, ax=None, boxplot_kwargs: Optional[Dict[str, Any]] = None, hist_kwargs: Optional[Dict[str, Any]] = None)[source]#
Box plot of the bin contents of the distributions corresponding to selected clusters.
- Parameters
cluster_column – Column with the cluster names (default ‘cluster’)
bpoint_column – Column with bpoints (default ‘bpoint’)
title – Plot title (
None
: automatic)clusters – List of clusters to selected or single cluster. If None (default), all clusters are chosen.
bpoints – Draw benchmark points (default True)
whiskers – Length of the whiskers of the box plot in units of IQR (interquartile range, containing 50% of all values). Default 2.5.
legend – Draw legend? (default True)
boxplot_kwargs – Arguments to matplotlib.pyplot.boxplot
ax – Instance of matplotlib.axes.Axes to plot on. If None, a new one is instantiated.
boxplot_kwargs – Keyword arguments to matplotlib.pyplot.boxplot
hist_kwargs – Keyword arguments to
plot_histogram()
Note: To customize these kind of plots further, check the
BundlePlot
class and thebox_plot()
method thereof.- Returns
Figure
- plot_clusters_scatter(params=None, clusters=None, cluster_column='cluster', bpoint_column='bpoint', legend=True, max_subplots=16, max_cols=4, markers=('o', 'v', '^', 'v', '<', '>'), figsize=4, aspect_ratio=None)[source]#
Create scatter plot, specifying the columns to be on the axes of the plot. If 3 column are specified, 3D scatter plots are presented, else 2D plots. If the dataframe contains more columns, such that each row is not only specified by the columns on the axes, a selection of subplots is created, showing ‘cuts’. Benchmark points are marked by enlarged plot markers.
- Parameters
params – The names of the columns to be shown on the x, (y, (z)) axis of the plots.
clusters – The get_clusters to be plotted (default: all)
cluster_column – Column with the cluster names (default ‘cluster’)
bpoint_column – Column with bpoints (default ‘bpoint’)
legend – Draw legend? (default True)
max_subplots – Maximal number of subplots
max_cols – Maximal number of columns of the subplot grid
markers – List of markers of the get_clusters
figsize – Base size of each subplot
aspect_ratio – Aspect ratio of 2D plots. If None, will be chosen automatically based on data ranges.
- Returns
Figure
- plot_clusters_fill(params=None, cluster_column='cluster', bpoint_column='bpoint', legend=True, max_subplots=16, max_cols=4, figsize=4, aspect_ratio=None)[source]#
Call this method with two column names, x and y. The results are similar to those of 2D scatter plots as created by the scatter method, except that the coloring is expanded to the whole xy plane. Note: This method only works with uniformly sampled NP!
- Parameters
params – The names of the columns to be shown on the x, y (and z) axis of the plots.
cluster_column – Column with the cluster names (default ‘cluster’)
bpoint_column – Column with bpoints (default ‘bpoint’)
legend – Draw legend? (default True)
max_subplots – Maximal number of subplots
max_cols – Maximal number of columns of the subplot grid
figsize – Base size of each subplot
aspect_ratio – Aspect ratio of 2D plots. If None, will be chosen automatically based on data ranges.
- Returns
Figure
- plot_bpoint_distance_matrix(cluster_column='cluster', bpoint_column='bpoint', metric='euclidean', ax=None)[source]#
Plot the pairwise distances of all benchmark points.
- Parameters
cluster_column – Column with the cluster names (default ‘cluster’)
bpoint_column – Column with bpoints (default ‘bpoint’)
metric – String or function. See
clusterking.maths.metric.metric_selection()
. Default: Euclidean distance.ax – Matplotlib axes or None (automatic)
- Returns
Figure
- df: Optional[pd.DataFrame]#
pandas.DataFrame
to hold all of the results
DataWithErrors
#
- class clusterking.data.DataWithErrors(*args, **kwargs)[source]#
Bases:
clusterking.data.data.Data
This class extends the
Data
class by convenient and performant ways to add errors to the distributions.See the description of the
Data
class for more information about the data structure itself.There are three basic ways to add errors:
Add relative errors (with correlation) relative to the bin content of each bin in the distribution:
add_rel_err_...
(\(\mathrm{Cov}^{(k)}_{\text{rel}}(i, j)\))Add absolute errors (with correlation):
add_err_...
(\(\mathrm{Cov}^{(k)}_{\text{abs}}(i, j)\))Add poisson errors:
add_err_poisson()
The covariance matrix for bin i and j of distribution n (with contents \(d^{(n)}_i\)) will then be
\[\begin{split}\mathrm{Cov}(d^{(n)}_i, d^{(n)}_j) = &\sum_{k}\mathrm{Cov}_{\text{rel}}^{(k)}(i, j) \cdot d^{(n)}_i d^{(n)}_j + \\ + &\sum_k\mathrm{Cov}_{\text{abs}}^{(k)}(i, j) + \\ + &\delta_{ij} \sqrt{d^{(n)}_i d^{(n)}_j} / \sqrt{s}\end{split}\]Note
All of these methods add the errors in a consistent way for all sample points/distributions, i.e. it is impossible to add a certain error specifically to one sample point only!
Afterwards, you can get errors, correlation and covariance matrices for every data point by using one of the methods such as
cov()
,corr()
,err()
.Note
When saving your dataset, your error configuration is saved as well, so you can reload it like any other
Data
orDFMD
object.Warning
The appendix of our paper mistakenly hinted at the unit of the relative uncertainties being in percent. This is not the case. That means that
d.add_rel_err_uncorr(0.1)
adds a 10% relative uncertainty, not 0.1%.- Parameters
data – n x nbins matrix
- property rel_cov#
Relative covariance matrix that will be later applied to the data (see class documentation).
\[\mathrm{Cov}_{\text{rel}}(i, j) = \sum_k\mathrm{Cov}_{\text{rel}}^{(k)}(i, j)\]If no errors have been added, this is defined to be a zero matrix.
- Returns
self.nbins * self.nbins
matrix
- property abs_cov#
Absolute covariance matrix that will be later applied to the data (see class documentation).
\[\mathrm{Cov}_{\text{abs}}(i, j) = \sum_k\mathrm{Cov}_{\text{abs}}^{(k)}(i, j)\]If no errors have been added, this is defined to be a zero matrix.
- Returns
self.nbins * self.nbins
matrix
- property poisson_errors_scale: float#
Scale poisson errors. See documentation of
add_err_poisson()
.
- cov(relative=False) numpy.ndarray [source]#
Return covariance matrix \(\mathrm{Cov}(d^{(n)}_i, d^{(n)}_j)\)
If no errors have been added, a zero matrix is returned.
- Parameters
relative – “Relative to data”, i.e. \(\mathrm{Cov}(d^{(n)}_i, d^{(n)}_j) / (d^{(n)}_i \cdot d^{(n)}_j)\)
- Returns
self.n x self.nbins x self.nbins
array
- corr() numpy.ndarray [source]#
Return correlation matrix. If covariance matrix is empty (because no errors have been added), a unit matrix is returned.
- Returns
self.n x self.nbins x self.nbins
array
- err(relative=False) numpy.ndarray [source]#
Return errors per bin, i.e. \(e_i^{(n)} = \sqrt{\mathrm{Cov}(d^{(n)}_i, d^{(n)}_i)}\)
- Parameters
relative – Relative errors, i.e. \(e_i^{(n)}/d_i^{(n)}\)
- Returns
self.n x self.nbins
array
- add_err_cov(cov) None [source]#
Add error from covariance matrix.
- Parameters
cov –
self.n x self.nbins x self.nbins
array of covariance matrices or self.nbins x self.nbins covariance matrix (if equal for all data points)
- add_err_corr(err, corr) None [source]#
Add error from errors vector and correlation matrix.
- Parameters
err –
self.n x self.nbins
vector of errors for each data point and bin or self.nbins vector of uniform errors per data point or float (uniform error per bin and datapoint)corr –
self.n x self.nbins x self.nbins
correlation matrices orself.nbins x self.nbins
correlation matrix
- add_err_uncorr(err) None [source]#
Add uncorrelated error.
- Parameters
err – see argument of
add_err_corr()
- df: Optional[pd.DataFrame]#
pandas.DataFrame
to hold all of the results
- add_err_maxcorr(err) None [source]#
Add maximally correlated error.
- Parameters
err – see argument of
add_err_corr()
- add_rel_err_cov(cov) None [source]#
Add error from “relative” covariance matrix
- Parameters
cov – see argument of
add_err_cov()
- add_rel_err_corr(err, corr) None [source]#
Add error from relative errors and correlation matrix.
err=0.1
means 10% uncertainty.- Parameters
err – see argument of
add_err_corr()
corr – see argument of
add_err_corr()
- add_rel_err_uncorr(err) None [source]#
Add uncorrelated relative uncertainty.
err=0.1
means 10% uncertainty.- Parameters
err – see argument of
add_err_corr()
- add_rel_err_maxcorr(err) None [source]#
Add maximally correlated relative error.
err=0.1
means 10% uncertainty.- Parameters
err – see argument of
add_err_corr()
- add_err_poisson(normalization_scale=1) None [source]#
Add poisson errors/statistical errors.
- Parameters
normalization_scale – Apply poisson errors corresponding to data normalization scaled up by this factor. For example, if your data is normalized to 1 and you still want to apply Poisson errors that correspond to a yield of 200, you can call
add_err_poisson(200)
. Your data will stay normalized, but the poisson errors are appropriate for a total yield of 200.- Returns
None
- plot_dist_err(cluster_column='cluster', bpoint_column='bpoint', title: Union[None, str] = None, clusters: Optional[List[int]] = None, bpoints=True, legend=True, hist_kwargs: Optional[Dict[str, Any]] = None, hist_fill_kwargs: Optional[Dict[str, Any]] = None, ax=None)[source]#
Plot distribution with errors.
- Parameters
cluster_column – Column with the cluster names (default ‘cluster’)
bpoint_column – Column with bpoints (default ‘bpoint’)
title – Plot title (
None
: automatic)clusters – List of clusters to selected or single cluster. If
None
(default), all clusters are chosen.bpoints – Draw benchmark points if available (default True). If false or not benchmark points are available, pick a random sample point for each cluster.
legend – Draw legend? (default True)
hist_kwargs – Keyword arguments to
plot_histogram()
hist_fill_kwargs – Keyword arguments to
plot_histogram_fill()
ax – Instance of matplotlib.axes.Axes to plot on. If
None
, a new one is instantiated.
Note: To customize these kind of plots further, check the
BundlePlot
class and theerr_plot()
method thereof.- Returns
Figure