References¶
Release: 1.0.3 Date: July 10, 2019
yafe.base module¶
Base functions and classes for experiments.
Designing an Experiment
¶
Designing an experiment consists in
Creating an
Experiment
object which includes, as arguments of the constructor:- a
get_data
function to load data, - a
get_problem
class or function to generate problems from the data, - a
get_solver
class or function to solve problems, - a
measure
function to compute performance.
- a
Adding tasks to the
Experiment
object by specifying the data, problem and solver parameters.
The API for designing an experiment is described below and is illustrated in the yafe tutorial.
Loading data: get_data
and data parameters¶
Experiments are often applied to several data items. Items may be entries loaded from a dataset or data synthesized from some control parameters.
The input of function get_data
is one or several parameters that
characterize which data item must be returned.
The output of function get_data
is a dictionary with arbitrary keys and
whom values are the returned data.
Note that in the current version, get_data
should have at least one
parameter (which may take a single value).
Examples:
- For a dataset of
n
items, one may design aget_data(i_item)
function that takes an integeri_item
, load item numberi_item
in a variablex
and returns a dictionary{'data_item': x}
; one may then add tasks usingdata_params = {'i_item': np.arange(n)}
. - For synthetic data, one may design a
get_data(f0)
function that takes an real-valued frequencyf0
, synthesize a sinusoid with frequencyf0
in a variablex
and returns a dictionary{'signal': x}
; one may then add tasks usingdata_params = {'f0': np.linspace(0, 1, 10)}
.
Generating problems: get_problem
and problem parameters¶
Problems to be solved are generated from the data. Several instances of a problem may be generated from one data item, by varying some problem parameters, e.g., amount of noise. Thus, designing the problem generation stage consists in specifying how to turn problems parameters and data into problem data (and solution). This may be done equivalently by a class or a function.
get_problem
may be a class, in which case:
- the inputs of the
__init__
method are the parameters of the problem, - the parameters of the
__call__
method must match the keys of the dictionary obtained fromget_data
, - the output of the
__call__
method is a tuple(problem_data, solution_data)
whereproblem_data
is a dictionary containing the problem data for the solver andsolution_data
is a dictionary containing the solution of the problem for the performance measure.
Thus, an instance of that class is generated for each set of problem parameters and is called on data to generate the problem data and its solution.
Alternatively, get_problem
may be a function, in which case:
- its inputs are the parameters of the problem,
- its output must be a function that takes some data as inputs and returns the
tuple
(problem_data, solution_data)
as described in the case above.
Note that in the current version, get_problem
should have at least one
parameter (which may take a single value).
Solving problems: get_solver
and solver parameters¶
A solver generates a solution from some problem data. Several instances of a solver may be used by varying the solver’s parameters, e.g., the order of the model. As for problems, solvers can be implemented equivalently by a class or a function.
get_solver
may be a class, in which case:
- the inputs of the
__init__
method are the parameters of the solver, - the parameters of the
__call__
method must match the keys of the dictionary obtained from the problem generation method__call__
, - the output of the
__call__
method is a dictionary containing the solution generated by the solver.
Alternatively, get_solver
may be a function, in which case:
- its inputs are the parameters of the solver,
- its output must be a function that takes some problem data as inputs and
returns a dictionary containing the solution generated by the solver,
similarly as the
__call__
method described above.
Note that in the current version, get_solver
should have at least one
parameter (which may take a single value).
Computing performance: measure
¶
Several performance measures may be calculated from the estimated solution, using other data like the original data, the problem data, or parameters of the data, problem or solver.
These performance measures must be computed within a function measure
as follows:
- its arguments should be dictionaries
source_data
,problem_data
,solution_data
andsolved_data
, as returned by the data access function, the__call__
methods of problem generation class and the__call__
methods of solver class, respectively; an additional argumenttask_params
is a dictionary that contains the data, problem and solver parameters; - its output is a dictionary whose keys and values are the names and values of the various performance measures.
Setting data, problem and solver parameters: adding tasks¶
Once an instance of Experiment
is created based on
the four processing blocks given by functions or classes get_data
,
get_problem
, get_solver
and measure
, one must define the
parameters that must be used for each of these blocks. This is done by
calling method add_tasks()
: its inputs
are three dictionaries data_params
, problem_params
and
solver_params
whose keys match the input of get_data
, get_problem
and get_solver
respectively, and whose values are array-like objects
containing the parameters that must be used in the instances of those
processing blocks. A hypercube is automatically generated, based on the
cartesian product of all the arrays of parameters (internally stored in a
so-called schema). Each point in this hypercube is a full set of
parameters to define a so-called task by generating some particular instances
of data, problem and solver.
Running an Experiment
¶
Once an Experiment
is defined and tasks have been added,
one can run the tasks by calling method
launch_experiment()
, which executes all the
tasks, except those that may have already been executed previously. At any
time, one may control the execution of the tasks by calling method
display_status()
.
Running an Experiment
on a cluster¶
Tasks may also be executed in parallel, individually or in batches, by
specifying the parameter task_ids
in
launch_experiment()
. Function
generate_oar_script()
may be used or taken as an example to
split the set of tasks among jobs that can be run by several cores in parallel.
Adding tasks¶
Method add_tasks()
may be used to add new
parameter values, in which case the parameter hypercube is extended,
recomputing all cartesian products between parameters, and new tasks are
created while results from already-executed tasks are kept.
Collecting results¶
Method collect_results()
gathers the task results
into a single structure and saves it into a single file. Method
load_results()
then loads and returns the
results so that the user can process and display them. The main structure
that contains the results is a multi-dimensional array which dimensions match
the parameters of the tasks (see above) and the performance measures (last
dimension). This array may be loaded as an numpy.ndarray
or as a
xarray.DataArray
.
Misc¶
Tasks IDs and parameters¶
Task IDs are generated from the parameters and depend on the sequence of
calls to method add_tasks()
. Finding a task ID
from a set of parameters and vice-versa is not trivial and may be done by
calling methods get_task_data_by_id()
and
get_task_data_by_params()
.
Dealing with several solvers¶
In order to compare several solvers, one may design one experiment by solver, using the same generators and parameters for data and problems, and merge the results.
Random seeds (reproducibility, consistent problem generation)¶
It is often important to control the random seeds, e.g., in order to reproduce the experiments. Since all the tasks are run independently, controlling the random seed is also required in order to have the exact same problem data and solution in all tasks with the same data parameters and problem parameters, especially when noise is added.
This can be obtained by using the random state mechanism provided by
numpy.random.RandomState
. The random state seed may be used inside
each block (see get_problem
in
yafe tutorial) or passed as a parameter to
each processing block (as in scikit-learn
).
Files and folders¶
A number of files and subfloders are generated in a folder named as the
experiment’s name and located in the data_path
set when creating the
experiment. The user may not need to handle those files since results are then
collected automatically, unless in case of tracking errors.
At the experiment level, a so-called schema is generated and contains all
the data, problem and solver parameters (file _schema.pickle
). Results
from all tasks are gathered in a single file results.npz
.
At the task level, a sub-directory is created for each task. It contains
several files generated when the task is created (parameters of the task in
task_params.pickle
), when the task is run (raw data in
source_data.pickle
, problem data in problem_data.pickle
,
true solution in solution_data.pickle
, solved data in
solved_data.pickle
, performance measures in result.pickle
) and when
an error occurs during any processing step when running a task (error_log
).
yafe
configuration file¶
The yafe
package uses a configuration file stored in
$HOME/.config/yafe.conf
to specify the default data_path
used to store
the experiment data, as well as the default logger_path
used to store
the processing log.
You can create an empty configuration file by running the static method
generate_config()
and then edit the generated
file in $HOME/.config/yafe.conf
.
The resulting configuration file should have the following form:
[USER]
# user directory (data generated by experiments)
data_path = /some/directory/path
[LOGGER]
# path to the log file
path = /some/file/path/yafe.log
If you want to avoid the use of the yafe
configuration file, you need
to systematically specify the data_path
and logger_path
parameters when
creating an Experiment
(or alternatively for
logger_path
you can set the flag log_to_file
to False
).
-
class
yafe.base.
Experiment
(name, get_data, get_problem, get_solver, measure, force_reset=False, data_path=None, log_to_file=True, log_to_console=False, logger_path=None)[source]¶ Bases:
object
Definition of a class to design experiments.
Parameters: name : str
Name of the experiment, used to store experiment data in an eponym subfolder.
get_data : function
A function that returns data from some data parameters.
get_problem : callable class or function
A class or function that takes some problem parameters and returns a callable object that takes input data from method
get_data
and returns some problem data and true solution data.get_solver : callable object or function
A class or function that takes some solver parameters and returns a callable object that takes problem data from method
get_problem
and returns some solution data.measure : function
Function that takes some task information and data returned by methods
get_data
,get_problem
andget_solver
in order to compute one or several performance measures.force_reset : bool
Flag forcing resetting the Experiment during initialization. If True, the previously computed data stored on disk for the Experiment with the same name are destroyed. If False, the previously computed data are reused (default option).
data_path : None or str or pathlib.Path
Path of the folder where the experiment data subfolder must be created. If
data_path
isNone
, the data path given in the yafe configuration file is used.log_to_file : bool
Indicates if the processing and debug information must be logged in a file.
log_to_console : bool
Indicates if the processing information must be displayed in the console.
logger_path : None or str or pathlib.Path
Full file path where the logged data must be written when
log_to_file == True
. Iflogger_path
isNone
the logger path given in the yafe configuration file is used.-
add_tasks
(data_params, problem_params, solver_params)[source]¶ Add tasks using combinations of parameters.
Parameters: data_params : dict
Parameters of generation of the dataset.
problem_params : dict
Parameters of generation of the problem.
solver_params : dict
Parameters of generation of the solver.
Raises: TypeError
If parameters are not dict.
ValueError
If parameters contain unexpected keys.
Notes
- All parameters should be dict, keyed by the name of the parameter and valued by a list.
- Tasks are generated as the cartesian product of the sets of
parameters. If parameters are added by successive calls to
add_tasks()
, the resulting set of tasks is the cartesian product of all updated sets of parameters (which is equivalent to the set of tasks obtained by callingadd_tasks()
only once with all the parameters).
-
collect_results
()[source]¶ Collect results from an experiment.
Merge all the task results in a
numpy.ndarray
and save the resulting data with axes labels and values in a file.The resulting data are stored in the data directory of the experiment in a file named
results.npz
. This file contains threenumpy.ndarray
:- results: the collected results,
- axes_labels: the labels of the axes of results,
- axes_values: the values corresponding to the indexes for all axes of results.
The dimensions of the main array
results
match the parameters of the tasks and the performance measures (last dimension). More precisely,results[i_0, ..., i_j, ..., i_N]
is the value of the performance measure axes_values[N][i_N] for parameter valuesaxes_values[j][i_j]
,j < N
, related to parameter namesaxes_label[j]
.The stored data can be loaded using the method
load_results()
.Raises: RuntimeError
If no results have been computed prior to running the method.
-
display_status
(summary=True)[source]¶ Display the status of the tasks.
Parameters: summary : bool, optional
Indicates if only a summary of the status is displayed or not.
-
generate_tasks
()[source]¶ Generate the tasks from _schema.
Already existing tasks are not rewritten.
-
get_pending_task_ids
()[source]¶ Get tasks that have not been run and successfully completed.
Returns: set
Set of pending task ids
-
get_task_data_by_id
(idt)[source]¶ Get data for a task specified by its id.
Parameters: idt : int
Id of the task.
Returns: dict
Task data.
Raises: ValueError
If
idt
is not a valid value.
-
get_task_data_by_params
(data_params, problem_params, solver_params)[source]¶ Get data for a task specified by its parameters.
Parameters: data_params : dict
Parameters of generation of the dataset.
problem_params : dict
Parameters of generation of the problem.
solver_params : dict
Parameters of generation of the solver.
Returns: dict
Task data.
Raises: TypeError
If parameters are not dict.
ValueError
If parameters contain unexpected keys or unexpected values.
-
launch_experiment
(task_ids=None)[source]¶ Launch an experiment.
Parameters: task_ids : str or list of ints or None
List of task ids to run. If str, task ids must be separated by a comma. If None, run pending tasks only to avoid running the same tasks several times.
-
load_results
(array_type='numpy')[source]¶ Load the collected results from an experiment.
Load the data stored by the method
collect_results()
in the fileresults.npz
.Parameters: array_type : {‘numpy’, ‘xarray’}
Type of output. With the option
'numpy'
, the output is split into threenumpy.ndarray
, while with the option'xarray'
the output is a singlexarray.DataArray
.Returns: (results, axes_labels, axes_values) : tuple of
numpy.ndarray
Returned when using the default option
array_type='numpy'
. Seecollect_results()
for a detailed description.xresults :
xarray.DataArray
Returned when using the option
array_type='xarray'
. This is equivalent to the casearray_type='numpy'
, combining all three arrays into a single data structure for improved consistency and ease of use.xresults
can be obtained from(results, axes_labels, axes_values)
asxresults = xarray.DataArray(data=results, dims=axes_labels, coords=axes_values)
Raises: ImportError
If the xarray package is not installed and the
array_type='xarray'
option is used.ValueError
If an unrecognized value is given for
array_type
.RuntimeError
If no collected results file can be found.
-
yafe.utils module¶
Utils classes and functions for yafe.
-
class
yafe.utils.
ConfigParser
[source]¶ Bases:
configparser.ConfigParser
Configuration file parser for yafe.
This class inherits from ConfigParser in the configparser module.
It enables reading the yafe configuration file
$HOME/.config/yafe.conf
at initialization of an experiment. It also provides a method to properly read a path in the configuration file, and a static method to generate a basic empty configuration file.-
static
generate_config
()[source]¶ Generate an empty configuration file.
The generated configuration file is stored in
$HOME/.config/yafe.conf
.
-
get_path
(section, option, check_exists=True)[source]¶ Get the path filled in a given option of a given section.
Parameters: section : str
Name of the section.
option : str
Name of the option.
check_exists : bool, optional
Indicates if the existence of the path is checked.
Returns: pathlib.Path or None
Path if the option is defined, None otherwise.
Raises: IOError
If the parameter
check_exists
is set toTrue
and the path does not exist.
-
static
-
yafe.utils.
generate_oar_script
(script_file_path, xp_var_name, task_ids=None, batch_size=1, oar_walltime='02:00:00', activate_env_command=None, use_gpu=False)[source]¶ Generate a script to launch an experiment using OAR.
Tasks are divided into batches that are executed by oar jobs.
The resulting script is written in the experiment folder, and the command to launch the jobs with OAR is displayed in the terminal.
An example script illustrating how to use
yafe.utils.generate_oar_script()
is available in the corresponding tutorial.Parameters: script_file_path : str
File path to the script that defines the experiment.
xp_var_name : str
Name of the variable containing the experiment in the script.
task_ids : list
List of tasks ids to run. If
task_ids
isNone
, the list of pending tasks of the experiment is used.batch_size : int
Number of tasks run in each batch.
oar_walltime : str
Wall time for each OAR job (‘HH:MM:SS’).
activate_env_command : str or None
Optional command that must be run to activate a Python virtual environment before launching the experiment. Typically, this is a command of the form
source some_virtual_env/bin/activate
when using virtualenv andsource activate some_conda_env
when using conda. Ifactivate_env_command
isNone
, no virtual environment is activated.use_gpu : bool
Flag specifying if a gpu ressource is needed when running the experiment.
-
yafe.utils.
get_logger
(name='', to_file=True, to_console=True, logger_path=None)[source]¶ Return a yafe logger with the given name.
Parameters: name : str
Name of the logger.
to_file : bool
Indicates if the logger writes processing and debug information in a file.
to_console : bool
Indicates if the logger displays processing information in the console.
logger_path : None or str or pathlib.Path
Full file path where the data must be written by the logger when
to_file == True
. Iflogger_path
isNone
the logger path given in the yafe configuration file is used.Returns: logging.Logger
A logger with the given name prefixed by
yafe.
.Notes
The name of the logger is automatically prefixed by
yafe.
.