References

Release:1.0.3
Date:July 10, 2019

yafe.base module

Base functions and classes for experiments.

Designing an Experiment

Designing an experiment consists in

  • Creating an Experiment object which includes, as arguments of the constructor:

    • a get_data function to load data,
    • a get_problem class or function to generate problems from the data,
    • a get_solver class or function to solve problems,
    • a measure function to compute performance.
  • Adding tasks to the Experiment object by specifying the data, problem and solver parameters.

The API for designing an experiment is described below and is illustrated in the yafe tutorial.

Loading data: get_data and data parameters

Experiments are often applied to several data items. Items may be entries loaded from a dataset or data synthesized from some control parameters.

The input of function get_data is one or several parameters that characterize which data item must be returned.

The output of function get_data is a dictionary with arbitrary keys and whom values are the returned data.

Note that in the current version, get_data should have at least one parameter (which may take a single value).

Examples:

  • For a dataset of n items, one may design a get_data(i_item) function that takes an integer i_item, load item number i_item in a variable x and returns a dictionary {'data_item': x} ; one may then add tasks using data_params = {'i_item': np.arange(n)}.
  • For synthetic data, one may design a get_data(f0) function that takes an real-valued frequency f0, synthesize a sinusoid with frequency f0 in a variable x and returns a dictionary {'signal': x} ; one may then add tasks using data_params = {'f0': np.linspace(0, 1, 10)}.

Generating problems: get_problem and problem parameters

Problems to be solved are generated from the data. Several instances of a problem may be generated from one data item, by varying some problem parameters, e.g., amount of noise. Thus, designing the problem generation stage consists in specifying how to turn problems parameters and data into problem data (and solution). This may be done equivalently by a class or a function.

get_problem may be a class, in which case:

  • the inputs of the __init__ method are the parameters of the problem,
  • the parameters of the __call__ method must match the keys of the dictionary obtained from get_data,
  • the output of the __call__ method is a tuple (problem_data, solution_data) where problem_data is a dictionary containing the problem data for the solver and solution_data is a dictionary containing the solution of the problem for the performance measure.

Thus, an instance of that class is generated for each set of problem parameters and is called on data to generate the problem data and its solution.

Alternatively, get_problem may be a function, in which case:

  • its inputs are the parameters of the problem,
  • its output must be a function that takes some data as inputs and returns the tuple (problem_data, solution_data) as described in the case above.

Note that in the current version, get_problem should have at least one parameter (which may take a single value).

Solving problems: get_solver and solver parameters

A solver generates a solution from some problem data. Several instances of a solver may be used by varying the solver’s parameters, e.g., the order of the model. As for problems, solvers can be implemented equivalently by a class or a function.

get_solver may be a class, in which case:

  • the inputs of the __init__ method are the parameters of the solver,
  • the parameters of the __call__ method must match the keys of the dictionary obtained from the problem generation method __call__,
  • the output of the __call__ method is a dictionary containing the solution generated by the solver.

Alternatively, get_solver may be a function, in which case:

  • its inputs are the parameters of the solver,
  • its output must be a function that takes some problem data as inputs and returns a dictionary containing the solution generated by the solver, similarly as the __call__ method described above.

Note that in the current version, get_solver should have at least one parameter (which may take a single value).

Computing performance: measure

Several performance measures may be calculated from the estimated solution, using other data like the original data, the problem data, or parameters of the data, problem or solver.

These performance measures must be computed within a function measure as follows:

  • its arguments should be dictionaries source_data, problem_data, solution_data and solved_data, as returned by the data access function, the __call__ methods of problem generation class and the __call__ methods of solver class, respectively; an additional argument task_params is a dictionary that contains the data, problem and solver parameters;
  • its output is a dictionary whose keys and values are the names and values of the various performance measures.

Setting data, problem and solver parameters: adding tasks

Once an instance of Experiment is created based on the four processing blocks given by functions or classes get_data, get_problem, get_solver and measure, one must define the parameters that must be used for each of these blocks. This is done by calling method add_tasks(): its inputs are three dictionaries data_params, problem_params and solver_params whose keys match the input of get_data, get_problem and get_solver respectively, and whose values are array-like objects containing the parameters that must be used in the instances of those processing blocks. A hypercube is automatically generated, based on the cartesian product of all the arrays of parameters (internally stored in a so-called schema). Each point in this hypercube is a full set of parameters to define a so-called task by generating some particular instances of data, problem and solver.

Running an Experiment

Once an Experiment is defined and tasks have been added, one can run the tasks by calling method launch_experiment(), which executes all the tasks, except those that may have already been executed previously. At any time, one may control the execution of the tasks by calling method display_status().

Running an Experiment on a cluster

Tasks may also be executed in parallel, individually or in batches, by specifying the parameter task_ids in launch_experiment(). Function generate_oar_script() may be used or taken as an example to split the set of tasks among jobs that can be run by several cores in parallel.

Adding tasks

Method add_tasks() may be used to add new parameter values, in which case the parameter hypercube is extended, recomputing all cartesian products between parameters, and new tasks are created while results from already-executed tasks are kept.

Collecting results

Method collect_results() gathers the task results into a single structure and saves it into a single file. Method load_results() then loads and returns the results so that the user can process and display them. The main structure that contains the results is a multi-dimensional array which dimensions match the parameters of the tasks (see above) and the performance measures (last dimension). This array may be loaded as an numpy.ndarray or as a xarray.DataArray.

Misc

Tasks IDs and parameters

Task IDs are generated from the parameters and depend on the sequence of calls to method add_tasks(). Finding a task ID from a set of parameters and vice-versa is not trivial and may be done by calling methods get_task_data_by_id() and get_task_data_by_params().

Dealing with several solvers

In order to compare several solvers, one may design one experiment by solver, using the same generators and parameters for data and problems, and merge the results.

Random seeds (reproducibility, consistent problem generation)

It is often important to control the random seeds, e.g., in order to reproduce the experiments. Since all the tasks are run independently, controlling the random seed is also required in order to have the exact same problem data and solution in all tasks with the same data parameters and problem parameters, especially when noise is added.

This can be obtained by using the random state mechanism provided by numpy.random.RandomState. The random state seed may be used inside each block (see get_problem in yafe tutorial) or passed as a parameter to each processing block (as in scikit-learn).

Files and folders

A number of files and subfloders are generated in a folder named as the experiment’s name and located in the data_path set when creating the experiment. The user may not need to handle those files since results are then collected automatically, unless in case of tracking errors.

At the experiment level, a so-called schema is generated and contains all the data, problem and solver parameters (file _schema.pickle). Results from all tasks are gathered in a single file results.npz.

At the task level, a sub-directory is created for each task. It contains several files generated when the task is created (parameters of the task in task_params.pickle), when the task is run (raw data in source_data.pickle, problem data in problem_data.pickle, true solution in solution_data.pickle, solved data in solved_data.pickle, performance measures in result.pickle) and when an error occurs during any processing step when running a task (error_log).

yafe configuration file

The yafe package uses a configuration file stored in $HOME/.config/yafe.conf to specify the default data_path used to store the experiment data, as well as the default logger_path used to store the processing log.

You can create an empty configuration file by running the static method generate_config() and then edit the generated file in $HOME/.config/yafe.conf.

The resulting configuration file should have the following form:

[USER]
# user directory (data generated by experiments)
data_path = /some/directory/path

[LOGGER]
# path to the log file
path = /some/file/path/yafe.log

If you want to avoid the use of the yafe configuration file, you need to systematically specify the data_path and logger_path parameters when creating an Experiment (or alternatively for logger_path you can set the flag log_to_file to False).

class yafe.base.Experiment(name, get_data, get_problem, get_solver, measure, force_reset=False, data_path=None, log_to_file=True, log_to_console=False, logger_path=None)[source]

Bases: object

Definition of a class to design experiments.

Parameters:

name : str

Name of the experiment, used to store experiment data in an eponym subfolder.

get_data : function

A function that returns data from some data parameters.

get_problem : callable class or function

A class or function that takes some problem parameters and returns a callable object that takes input data from method get_data and returns some problem data and true solution data.

get_solver : callable object or function

A class or function that takes some solver parameters and returns a callable object that takes problem data from method get_problem and returns some solution data.

measure : function

Function that takes some task information and data returned by methods get_data, get_problem and get_solver in order to compute one or several performance measures.

force_reset : bool

Flag forcing resetting the Experiment during initialization. If True, the previously computed data stored on disk for the Experiment with the same name are destroyed. If False, the previously computed data are reused (default option).

data_path : None or str or pathlib.Path

Path of the folder where the experiment data subfolder must be created. If data_path is None, the data path given in the yafe configuration file is used.

log_to_file : bool

Indicates if the processing and debug information must be logged in a file.

log_to_console : bool

Indicates if the processing information must be displayed in the console.

logger_path : None or str or pathlib.Path

Full file path where the logged data must be written when log_to_file == True. If logger_path is None the logger path given in the yafe configuration file is used.

add_tasks(data_params, problem_params, solver_params)[source]

Add tasks using combinations of parameters.

Parameters:

data_params : dict

Parameters of generation of the dataset.

problem_params : dict

Parameters of generation of the problem.

solver_params : dict

Parameters of generation of the solver.

Raises:

TypeError

If parameters are not dict.

ValueError

If parameters contain unexpected keys.

Notes

  • All parameters should be dict, keyed by the name of the parameter and valued by a list.
  • Tasks are generated as the cartesian product of the sets of parameters. If parameters are added by successive calls to add_tasks(), the resulting set of tasks is the cartesian product of all updated sets of parameters (which is equivalent to the set of tasks obtained by calling add_tasks() only once with all the parameters).
collect_results()[source]

Collect results from an experiment.

Merge all the task results in a numpy.ndarray and save the resulting data with axes labels and values in a file.

The resulting data are stored in the data directory of the experiment in a file named results.npz. This file contains three numpy.ndarray:

  • results: the collected results,
  • axes_labels: the labels of the axes of results,
  • axes_values: the values corresponding to the indexes for all axes of results.

The dimensions of the main array results match the parameters of the tasks and the performance measures (last dimension). More precisely, results[i_0, ..., i_j, ..., i_N] is the value of the performance measure axes_values[N][i_N] for parameter values axes_values[j][i_j], j < N, related to parameter names axes_label[j].

The stored data can be loaded using the method load_results().

Raises:

RuntimeError

If no results have been computed prior to running the method.

display_status(summary=True)[source]

Display the status of the tasks.

Parameters:

summary : bool, optional

Indicates if only a summary of the status is displayed or not.

generate_tasks()[source]

Generate the tasks from _schema.

Already existing tasks are not rewritten.

get_pending_task_ids()[source]

Get tasks that have not been run and successfully completed.

Returns:

set

Set of pending task ids

get_task_data_by_id(idt)[source]

Get data for a task specified by its id.

Parameters:

idt : int

Id of the task.

Returns:

dict

Task data.

Raises:

ValueError

If idt is not a valid value.

get_task_data_by_params(data_params, problem_params, solver_params)[source]

Get data for a task specified by its parameters.

Parameters:

data_params : dict

Parameters of generation of the dataset.

problem_params : dict

Parameters of generation of the problem.

solver_params : dict

Parameters of generation of the solver.

Returns:

dict

Task data.

Raises:

TypeError

If parameters are not dict.

ValueError

If parameters contain unexpected keys or unexpected values.

launch_experiment(task_ids=None)[source]

Launch an experiment.

Parameters:

task_ids : str or list of ints or None

List of task ids to run. If str, task ids must be separated by a comma. If None, run pending tasks only to avoid running the same tasks several times.

load_results(array_type='numpy')[source]

Load the collected results from an experiment.

Load the data stored by the method collect_results() in the file results.npz.

Parameters:

array_type : {‘numpy’, ‘xarray’}

Type of output. With the option 'numpy', the output is split into three numpy.ndarray, while with the option 'xarray' the output is a single xarray.DataArray.

Returns:

(results, axes_labels, axes_values) : tuple of numpy.ndarray

Returned when using the default option array_type='numpy'. See collect_results() for a detailed description.

xresults : xarray.DataArray

Returned when using the option array_type='xarray'. This is equivalent to the case array_type='numpy', combining all three arrays into a single data structure for improved consistency and ease of use. xresults can be obtained from (results, axes_labels, axes_values) as

xresults = xarray.DataArray(data=results,
                            dims=axes_labels,
                            coords=axes_values)
Raises:

ImportError

If the xarray package is not installed and the array_type='xarray' option is used.

ValueError

If an unrecognized value is given for array_type.

RuntimeError

If no collected results file can be found.

reset()[source]

Remove all tasks and previsouly computed results.

run_task_by_id(idt)[source]

Run a task given an id.

Parameters:

idt : int

Id of the task.

yafe.utils module

Utils classes and functions for yafe.

class yafe.utils.ConfigParser[source]

Bases: configparser.ConfigParser

Configuration file parser for yafe.

This class inherits from ConfigParser in the configparser module.

It enables reading the yafe configuration file $HOME/.config/yafe.conf at initialization of an experiment. It also provides a method to properly read a path in the configuration file, and a static method to generate a basic empty configuration file.

static generate_config()[source]

Generate an empty configuration file.

The generated configuration file is stored in $HOME/.config/yafe.conf.

get_path(section, option, check_exists=True)[source]

Get the path filled in a given option of a given section.

Parameters:

section : str

Name of the section.

option : str

Name of the option.

check_exists : bool, optional

Indicates if the existence of the path is checked.

Returns:

pathlib.Path or None

Path if the option is defined, None otherwise.

Raises:

IOError

If the parameter check_exists is set to True and the path does not exist.

yafe.utils.generate_oar_script(script_file_path, xp_var_name, task_ids=None, batch_size=1, oar_walltime='02:00:00', activate_env_command=None, use_gpu=False)[source]

Generate a script to launch an experiment using OAR.

Tasks are divided into batches that are executed by oar jobs.

The resulting script is written in the experiment folder, and the command to launch the jobs with OAR is displayed in the terminal.

An example script illustrating how to use yafe.utils.generate_oar_script() is available in the corresponding tutorial.

Parameters:

script_file_path : str

File path to the script that defines the experiment.

xp_var_name : str

Name of the variable containing the experiment in the script.

task_ids : list

List of tasks ids to run. If task_ids is None, the list of pending tasks of the experiment is used.

batch_size : int

Number of tasks run in each batch.

oar_walltime : str

Wall time for each OAR job (‘HH:MM:SS’).

activate_env_command : str or None

Optional command that must be run to activate a Python virtual environment before launching the experiment. Typically, this is a command of the form source some_virtual_env/bin/activate when using virtualenv and source activate some_conda_env when using conda. If activate_env_command is None, no virtual environment is activated.

use_gpu : bool

Flag specifying if a gpu ressource is needed when running the experiment.

yafe.utils.get_logger(name='', to_file=True, to_console=True, logger_path=None)[source]

Return a yafe logger with the given name.

Parameters:

name : str

Name of the logger.

to_file : bool

Indicates if the logger writes processing and debug information in a file.

to_console : bool

Indicates if the logger displays processing information in the console.

logger_path : None or str or pathlib.Path

Full file path where the data must be written by the logger when to_file == True. If logger_path is None the logger path given in the yafe configuration file is used.

Returns:

logging.Logger

A logger with the given name prefixed by yafe..

Notes

The name of the logger is automatically prefixed by yafe..