Data
¶
This page describes the main data object that are used by ClusterKinG.
If you do not need to include errors in your analysis, use
Data
, else
DataWithErrors
(which inherits from
Data
but adds additional methods to it).
Both classes inherit from a very basic class,
DFMD
, which provides basic input and output
methods.
DFMD
¶
-
class
clusterking.data.dfmd.
DFMD
(*args, log=None, **kwargs)[source]¶ Bases:
object
This class bundles a pandas dataframe together with metadata and provides methods to load from and write these two to files.
-
__init__
(*args, log=None, **kwargs)[source]¶ There are five different ways to initialize this class:
- Initialize it empty:
DFMD()
. - From another DFMD object
my_dfmd
:DFMD(my_dfmd)
orDFMD(dfmd=my_dfmd)
. - From a directory path and a project name:
DFMD("path/to/io", "my_name")
orDFMD(directory="path/to/io", name="my_name"
- From a dataframe and a metadata object (a nested dictionary like
object) or paths to corresponding files:
DFMD(df=my_df, md=my_metadata)
orDFMD(df="/path/to/df.csv", md=my_metadata)
etc.
Warning
If you use
df=<pd.DataFrame>
ormd=<dict like>
, please be aware that this will not copy these objects, i.e. any changes that are done to these objects subsequently will affect both the original DataFrame/metadata and self.df or self.md. To avoid this, usepd.DataFrame.copy()
ordict.copy()
to create a deepcopy.Parameters: - log – instance of
logging.Logger
or name of logger to be created - *args – See above
- **kwargs – See above
- Initialize it empty:
-
log
= None¶ instance of
logging.Logger
-
df
= None¶ Pandas dataframe to hold all of the results
-
md
= None¶ This will hold all the configuration that we will write out
-
static
get_df_path
(directory: Union[pathlib.PurePath, str], name: str) → pathlib.Path[source]¶ Return path to metadata json file based on directory and project name.
Parameters: - directory – Path to input/output directory
- name – Name of project
Returns: Path to metadata json file.
-
static
get_md_path
(directory: Union[pathlib.PurePath, str], name: str) → pathlib.Path[source]¶ Return path to dataframe csv file based on directory and project name.
Parameters: - directory – Path to input/output directory
- name – Name of project
Returns: Path to dataframe csv file.
-
load_md
(md_path: Union[pathlib.PurePath, str]) → None[source]¶ Load metadata from json file generated by
write_md()
.
-
load_df
(df_path: Union[pathlib.PurePath, str]) → None[source]¶ Load dataframe from csv file creating by
write_md()
.
-
load
(directory: Union[pathlib.PurePath, str], name: str) → None[source]¶ Load from input files which have been generated from
write()
.Parameters: - directory – Path to input/output directory
- name – Name of project
Returns: None
-
write_md
(md_path: Union[pathlib.PurePath, str], overwrite='ask')[source]¶ Write out metadata. The file can later be read in using
load_md()
.Parameters: - md_path –
- overwrite – How to proceed if output file already exists: ‘ask’, ‘overwrite’, ‘raise’
Returns:
-
write_df
(df_path, overwrite='ask')[source]¶ Write out dataframe. The file can later be read in using
load_df()
.Parameters: - df_path –
- overwrite – How to proceed if output file already exists: ‘ask’, ‘overwrite’, ‘raise’
Returns:
-
write
(directory: Union[pathlib.PurePath, str], name: str, overwrite='ask') → None[source]¶ Write to input files that can be later loaded with
load()
.Parameters: - directory – Path to input/output directory
- name – Name of project
- overwrite – How to proceed if output file already exists: ‘ask’, ‘overwrite’, ‘raise’
Returns:
-
Data
¶
-
class
clusterking.data.data.
Data
(*args, **kwargs)[source]¶ Bases:
clusterking.data.dfmd.DFMD
This class inherits from the
DFMD
class and adds additional methods to it. It is the basic container, that contains the- The distributions to cluster
- The cluster numbers after clustering
- The benchmark points after they are selected.
-
bin_cols
¶ All columns that correspond to the bins of the distribution. This is automatically read from the metadata as set in e.g. the Scan.run.
-
par_cols
¶ All columns that correspond to the parameters (e.g. Wilson parameters). This is automatically read from the metadata as set in e.g. the
clusterking.scan.scanner.Scanner.run()
.
-
n
¶ Number of points in parameter space that were sampled.
-
nbins
¶ Number of bins of the distribution.
-
npars
¶ Number of parameters that were sampled (i.e. number of dimensions of the sampled parameter space.
-
data
(normalize=False) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f900686b748>[source]¶ Returns all histograms as a large matrix.
Parameters: normalize – Normalize all histograms Returns: numpy.ndarray of shape self.n x self.nbins
-
norms
()[source]¶ Returns a vector of all normalizations of all histograms (where each histogram corresponds to one sampled point in parameter space).
Returns: numpy.ndarray of shape self.n
-
clusters
(cluster_column='cluster')[source]¶ Return list of all cluster names (unique)
Parameters: cluster_column – Column that contains the cluster names
-
get_param_values
(param: Union[None, str] = None)[source]¶ Return all unique values of this parameter
Parameters: param – Name of parameter. If none is given, instead return a dictionary mapping of parameters to their values. Returns:
-
only_bpoints
(bpoint_column='bpoint', inplace=False)[source]¶ Keep only the benchmark points as sample points.
Parameters: - bpoint_column – benchmark point column (boolean)
- inplace – If True, the current Data object is modified, if False, a new copy of the Data object is returned.
Returns: None or Data
-
fix_param
(inplace=False, bpoints=False, bpoint_slices=False, bpoint_column='bpoint', **kwargs)[source]¶ Fix some parameter values to get a subset of sample points.
Parameters: - inplace – Modify this Data object instead of returning a new one
- bpoints – Keep bpoints (no matter if they are selected by the other selection or not)
- bpoint_slices – Keep all parameter values that are attained by benchmark points.
- bpoint_column – Column with benchmark points (default ‘bpoints’)
(for use with the
bpoints
option) - **kwargs – Specify parameter values:
Use
<parameter name>=<value>
or<parameter name>=[<value1>, ..., <valuen>]
.
Returns: If
inplace == True
, return new Data with subset of sample points.Examples:
d = Data("/path/to/tutorial/csv/folder", "tutorial_basics")
Return a new Data object, keeping the two values
CT_bctaunutau
closest to -0.75 or 0.5d.fix_param(CT_bctaunutau=[-.75, 0.5])
Return a new Data object, where we also fix
CSL_bctaunutau
to the value closest to -1.0:d.fix_param(CT_bctaunutau=[-.75, 0.5], CSL_bctaunutau=-1.0)
Return a new Data object, keeping the two values
CT_bctaunutau
closest to -0.75 or 0.5, but make sure we do not discard any benchmark points in that process:d.fix_param(CT_bctaunutau=[-.75, 0.5], bpoints=True)
Return a new Data object, keeping the two values
CT_bctaunutau
closest to -0.75 or 0.5, but keep all values ofCT_bctaunutau
that are attained by at least one benchmark point:d.fix_param(CT_bctaunutau=[-.75, 0.5], bpoint_slices=True)
Return a new Data object, keeping only those values of
CT_bctaunutau
, that are attained by at least one benchmark point:d.fix_param(CT_bctaunutau=[], bpoint_slice=True)
-
sample_param
(bpoints=False, bpoint_slices=False, bpoint_column='bpoint', inplace=False, **kwargs)[source]¶ Return a Data object that contains a subset of the sample points (points in parameter space). Similar to Data.fix_param.
Parameters: - inplace – Modify this Data object instead of returning a new one
- bpoints – Keep bpoints (no matter if they are selected by the other selection or not)
- bpoint_slices – Keep all parameter values that are attained by benchmark points
- bpoint_column – Column with benchmark points (default ‘bpoints’)
(for use with the
bpoints
option) - **kwargs – Specify parameter ranges:
<coeff name>=(min, max, npoints)
or<coeff name>=npoints
For each coeff (identified by <coeff name>), select (at most) npoints points between min and max. In total this will therefore result in npoints_{coeff_1} x … x npoints_{coeff_npar} sample points (provided that there are enough sample points available). If a coefficient isn’t contained in the dictionary, this dimension of the sample remains untouched.
Returns: If
inplace == True
, return new Data with subset of sample points.Examples:
d = Data("/path/to/tutorial/csv/folder", "tutorial_basics")
Return a new Data object, keeping subsampling
CT_bctaunutau
closest to 5 values between -1 and 1:d.sample_param(CT_bctaunutau=(-1, 1, 10))
The same in shorter syntax (because -1 and 1 are the minimum and maximum of the parameter)
d.sample_param(CT_bctaunutau=10)
For the
bpoints
andbpoint_slices
syntax, see the documenation ofclusterking.data.data.Data.fix_param()
.
-
rename_clusters
(arg=None, column='cluster', new_column=None)[source]¶ Rename clusters based on either
- A dictionary of the form
{<old cluster name>: <new cluster name>}
- A function that maps the old cluster name to the new cluster name
Example for 2: Say our
Data
objectd
contains clusters 1 to 10 in the default columncluster
. The following method call will instead use the numbers 0 to 9:d.rename_clusters(lambda x: x-1)
Parameters: - arg – Dictionary or function as described above.
- column – Column that contains the cluster names
- new_column – New column to write to (default None, i.e. rename in place)
Returns: None
- A dictionary of the form
-
plot_dist
(cluster_column='cluster', bpoint_column='bpoint', title=None, clusters=None, nlines=0, bpoints=True, legend=True)[source]¶ Plot several examples of distributions for each cluster specified.
Parameters: - cluster_column – Column with the cluster names (default ‘cluster’)
- bpoint_column – Column with bpoints (default ‘bpoint’)
- title – Plot title (
None
: automatic) - clusters – List of clusters to selected or single cluster. If None (default), all clusters are chosen.
- nlines – Number of example distributions of each cluster to be plotted (default 0)
- bpoints – Draw benchmark points (default True)
- legend – Draw legend? (default True)
Note: To customize these kind of plots further, check the
BundlePlot
class and theplot_bundles()
method thereof.Returns: Figure
-
plot_dist_minmax
(cluster_column='cluster', bpoint_column='bpoint', title=None, clusters=None, bpoints=True, legend=True)[source]¶ Plot the minimum and maximum of each bin for the specified clusters.
Parameters: - cluster_column – Column with the cluster names (default ‘cluster’)
- bpoint_column – Column with bpoints (default ‘bpoint’)
- title – Plot title (
None
: automatic) - clusters – List of clusters to selected or single cluster. If None (default), all clusters are chosen.
- bpoints – Draw benchmark points (default True)
- legend – Draw legend? (default True)
Note: To customize these kind of plots further, check the
BundlePlot
class and theplot_minmax()
method thereof.Returns: Figure
-
plot_dist_box
(cluster_column='cluster', bpoint_column='bpoint', title=None, clusters=None, bpoints=True, whiskers=2.5, legend=True)[source]¶ Box plot of the bin contents of the distributions corresponding to selected clusters.
Parameters: - cluster_column – Column with the cluster names (default ‘cluster’)
- bpoint_column – Column with bpoints (default ‘bpoint’)
- title – Plot title (
None
: automatic) - clusters – List of clusters to selected or single cluster. If None (default), all clusters are chosen.
- bpoints – Draw benchmark points (default True)
- whiskers – Length of the whiskers of the box plot in units of IQR (interquartile range, containing 50% of all values). Default 2.5.
- legend – Draw legend? (default True)
Note: To customize these kind of plots further, check the
BundlePlot
class and thebox_plot()
method thereof.Returns: Figure
-
plot_clusters_scatter
(params, clusters=None, cluster_column='cluster', bpoint_column='bpoint', legend=True, max_subplots=16, max_cols=4, figsize=(4, 4), markers=('o', 'v', '^', 'v', '<', '>'))[source]¶ Create scatter plot, specifying the columns to be on the axes of the plot. If 3 column are specified, 3D scatter plots are presented, else 2D plots. If the dataframe contains more columns, such that each row is not only specified by the columns on the axes, a selection of subplots is created, showing ‘cuts’. Benchmark points are marked by enlarged plot markers.
Parameters: - params – The names of the columns to be shown on the x, y (and z) axis of the plots.
- clusters – The get_clusters to be plotted (default: all)
- cluster_column – Column with the cluster names (default ‘cluster’)
- bpoint_column – Column with bpoints (default ‘bpoint’)
- legend – Draw legend? (default True)
- max_subplots – Maximal number of subplots
- max_cols – Maximal number of columns of the subplot grid
- figsize – Figure size of each subplot
- markers – List of markers of the get_clusters
Returns: Figure
-
plot_clusters_fill
(params, cluster_column='cluster', bpoint_column='bpoint', legend=True, max_subplots=16, max_cols=4, figsize=(4, 4))[source]¶ Call this method with two column names, x and y. The results are similar to those of 2D scatter plots as created by the scatter method, except that the coloring is expanded to the whole xy plane. Note: This method only works with uniformly sampled NP!
Parameters: - params – The names of the columns to be shown on the x, y (and z) axis of the plots.
- cluster_column – Column with the cluster names (default ‘cluster’)
- bpoint_column – Column with bpoints (default ‘bpoint’)
- legend – Draw legend? (default True)
- max_subplots – Maximal number of subplots
- max_cols – Maximal number of columns of the subplot grid
- figsize – Figure size of each subplot
Returns:
DataWithErrors
¶
-
class
clusterking.data.dwe.
DataWithErrors
(*args, **kwargs)[source]¶ Bases:
clusterking.data.data.Data
This class extends the
Data
class by convenient and performant ways to add errors to the distributions.See the description of the
Data
class for more information about the data structure itself.There are three basic ways to add errors: 1. Add relative errors (with correlation) relative to the bin content of each bin in the distribution:
add_rel_err_...
2. Add absolute errors (with correlation):add_err_...
3. Add poisson errors:add_err_poisson
Note
All of these methods, add the errors in a consistent way for all sample points/distributions, i.e. it is impossible to add a certain error specifically to one sample point only!
Afterwards, you can get errors, correlation and covariance matrices for every data point by using one of the methods such as
cov
,corr
,err
.Note
When saving your dataset, your error configuration is saved as well, so you can reload it like any other
Data
object.Parameters: data – n x nbins matrix -
rel_cov
¶
-
abs_cov
¶
-
poisson_errors
¶
-
data
(decorrelate=False, **kwargs)[source]¶ Return data matrix
Parameters: - decorrelate – Unrotate the correlation matrix to return uncorrelated data entries
- **kwargs – Any keyword argument to
Data.data()
Returns: self.n x self.nbins array
-
cov
(relative=False)[source]¶ Return covariance matrix
Parameters: relative – “Relative to data”, i.e. \(\mathrm{Cov}_{ij} / (\mathrm{data}_i \cdot \mathrm{data}_j)\) Returns: self.n x self.nbins x self.nbins array
-
err
(relative=False)[source]¶ Return errors per bin
Parameters: relative – Relative errors Returns: self.n x self.nbins array
-
add_err_cov
(cov) → None[source]¶ Add error from covariance matrix.
Parameters: cov – self.n x self.nbins x self.nbins array of covariance matrices or self.nbins x self.nbins covariance matrix (if equal for all data points)
-
add_err_corr
(err, corr) → None[source]¶ Add error from errors vector and correlation matrix.
Parameters: - err – self.n x self.nbins vector of errors for each data point and bin or self.nbins vector of uniform errors per data point or float (uniform error per bin and datapoint)
- corr – self.n x self.nbins x self.nbins correlation matrices or self.nbins x self.nbins correlation matrix
-
add_err_uncorr
(err) → None[source]¶ Add uncorrelated error.
Parameters: err – see argument of add_err_corr()
-
add_err_maxcorr
(err) → None[source]¶ Add maximally correlated error.
Parameters: err – see argument of add_err_corr()
-
add_rel_err_cov
(cov: <sphinx.ext.autodoc.importer._MockObject object at 0x7f900686b780>) → None[source]¶ Add error from “relative” covariance matrix
Parameters: cov – see argument of add_err_cov()
-
add_rel_err_corr
(err, corr) → None[source]¶ Add error from relative errors and correlation matrix.
Parameters: - err – see argument of
add_err_corr()
- corr – see argument of
add_err_corr()
- err – see argument of
-
add_rel_err_uncorr
(err: <sphinx.ext.autodoc.importer._MockObject object at 0x7f900686b5f8>) → None[source]¶ Add uncorrelated relative error.
Parameters: err – see argument of add_err_corr()
-
add_rel_err_maxcorr
(err: <sphinx.ext.autodoc.importer._MockObject object at 0x7f900686bc50>) → None[source]¶ Add maximally correlated relative error.
Parameters: err – see argument of add_err_corr()
-