Data loading¶

In plotszoo data is organized in a plotszoo.data.DataCollection.

plotszoo.data.DataCollection collect two data types:

scalars: organized in a pandas DataFrame
series: organized in a python dict having as keys the indices of the scalars and as values time series DataFrame

Classes to pull data from common services are also provided such as plotszoo.data.WandbData

class plotszoo.data.DataCollection¶

Base class for data collection

Attributes:

scalars: DataFrame cointaning the scalars
series: dict cointaining the time series

align_series(to='longest', **kwargs)¶

Algin series to the longest or shortest one.

Reset all the series indices, chooses the new index according to the strategy and reindex all the series

Args:

to: alignment strategy (one of longest or shortest) (Default: longest)
**kwargs: keyword arguments for pandas reindex

Example::

>>>  data.align_series(to="longest", method="nearest")

are_series_aligned()¶: Returns True if all series share the same indices

astype(columns, type=<class 'float'>)¶

Change the type of the scalars columns calling pandas astype

Args:

columns: List of colum to convert to numeric
type: Type to cast the column to (Default: float)

create_categorical(column, new_column)¶

Creates a new categorical column (0, 1, 2, …) from a textual one (“sin”, “cos”, “tan”, ..-)

Args:

column: Column to use as input
new_column: Name of the new categorical column

create_scalar_from_series(scalar_name, agg_fn)¶

Create a new column of scalars using the corresponding time series

Args:

scalar_name: The name of the new scalar
agg_fn: Function to be called to the corresponding time series to create the scalar

Example:

>>> data.create_scalar_from_series("start_time", lambda s: s["timestamp"].min())

dropna(columns)¶

Discard NaN values from the scalars and also discard the corresponding time series

Args:

columns: Columns to check for NaN

dropna_series(columns)¶

Drop all the rows where column in NaN in series. Will probably unalign the series

Args:

columns: Columns to check for NaN

fillna(column, value=0)¶

Sobsitute NaN values from the scalars

Args:

column: Column to check for NaN

fillna_series(column, value=0)¶

Sobsitute NaN values from the series with a new one

Args:

column: Series column to fill
value: Value to use (Default: 0)

is_both()¶: Returns True if the DataCollection contains both scalars and time series

is_empty()¶: Returns True if the DataCollection is empty

is_scalars()¶: Returns True if the DataCollection contains scalars

is_series()¶: Returns True if the DataCollection contains time series

rolling_series(column, new_column, fn='mean', **kwargs)¶

Apply pandas rolling function to all the series

Args:

column: Series column to apply the rolling to
new_column: Series column in which store the rolling function results
fn: pandas rolling function, for example "mean"` means ``series[column].rolling().mean()
**kwargs: Keyword arguments for the pandas rolling function

Example::

>>> data.rolling_series("reward", "mean_reward", window=20, fn="mean")

set_scalars(data)¶

Set the scalars

Args:

data: The DataFrame cointaning the scalars

set_series(series)¶

Set the series

Args:

series: The :class`dict` cointaning the time series

series must be set after scalars

the series dict must have a key for each index of the scalars

class plotszoo.data.WandbData(entity, project, query, cache=True, cache_dir='./.plotszoo-wandb-cache', verbose=True)¶

Retrive scalars and time series from wandb.

Args:

entity: wandb entity (username or team name)
project: wandb project
query: MongoDB query for wandb (check here.)
cache: Cache retrived data (Default: True)
cache_dir: Directory to cache the data to (Default: ./.plotszoo-wandb-cache)
verbose: Be verbose about pulling and caching (Default: True)

pull_scalars(state='finished', force_update=False)¶

Pull scalars from wandb

Args:

state: Filter the runs using their state, None to disable (Default: “finished”)
force_update: Force cache update (Default: False)

pull_series(scan_history=True, force_update=False)¶

Pull series from wandb

Args:

scan_history: Use wandb.Api.run.scan_history to pull the full history (Default: True)
force_update: Force cache update (Default: False)

class plotszoo.data.OptunaData(storage, study_name, cache=True, cache_dir='./.plotszoo-optuna-cache', verbose=True)¶

Retrive scalars and time series from an optuna. storage

Args:

storage: optuna storage (example: sqlite:///example.db)
study_name: optuna study name
cache: Cache retrived data (Default: True)
cache_dir: Directory to cache the data to (Default: ./.plotszoo-optuna-cache)
verbose: Be verbose about pulling and caching (Default: True)

pull_scalars(force_update=False)¶

Pull scalars from the optuna storage

Args:

force_update: Force cache update (Default: False)