Data loading

In plotszoo data is organized in a plotszoo.data.DataCollection.

plotszoo.data.DataCollection collect two data types:

  • scalars: organized in a pandas DataFrame

  • series: organized in a python dict having as keys the indices of the scalars and as values time series DataFrame

Classes to pull data from common services are also provided such as plotszoo.data.WandbData

class plotszoo.data.DataCollection

Base class for data collection

Attributes:
scalars

DataFrame cointaning the scalars

series

dict cointaining the time series

align_series(to='longest', **kwargs)

Algin series to the longest or shortest one.

Reset all the series indices, chooses the new index according to the strategy and reindex all the series

Args:
to

alignment strategy (one of longest or shortest) (Default: longest)

**kwargs

keyword arguments for pandas reindex

Example::
>>>  data.align_series(to="longest", method="nearest")
are_series_aligned()

Returns True if all series share the same indices

astype(columns, type=<class 'float'>)

Change the type of the scalars columns calling pandas astype

Args:
columns

List of colum to convert to numeric

type

Type to cast the column to (Default: float)

create_categorical(column, new_column)

Creates a new categorical column (0, 1, 2, …) from a textual one (“sin”, “cos”, “tan”, ..-)

Args:
column

Column to use as input

new_column

Name of the new categorical column

create_scalar_from_series(scalar_name, agg_fn)

Create a new column of scalars using the corresponding time series

Args:
scalar_name

The name of the new scalar

agg_fn

Function to be called to the corresponding time series to create the scalar

Example:

>>> data.create_scalar_from_series("start_time", lambda s: s["timestamp"].min())
dropna(columns)

Discard NaN values from the scalars and also discard the corresponding time series

Args:
columns

Columns to check for NaN

dropna_series(columns)

Drop all the rows where column in NaN in series. Will probably unalign the series

Args:
columns

Columns to check for NaN

fillna(column, value=0)

Sobsitute NaN values from the scalars

Args:
column

Column to check for NaN

fillna_series(column, value=0)

Sobsitute NaN values from the series with a new one

Args:
column

Series column to fill

value

Value to use (Default: 0)

is_both()

Returns True if the DataCollection contains both scalars and time series

is_empty()

Returns True if the DataCollection is empty

is_scalars()

Returns True if the DataCollection contains scalars

is_series()

Returns True if the DataCollection contains time series

rolling_series(column, new_column, fn='mean', **kwargs)

Apply pandas rolling function to all the series

Args:
column

Series column to apply the rolling to

new_column

Series column in which store the rolling function results

fn

pandas rolling function, for example "mean"` means ``series[column].rolling().mean()

**kwargs

Keyword arguments for the pandas rolling function

Example::
>>> data.rolling_series("reward", "mean_reward", window=20, fn="mean")
set_scalars(data)

Set the scalars

Args:
data

The DataFrame cointaning the scalars

set_series(series)

Set the series

Args:
series

The :class`dict` cointaning the time series

series must be set after scalars

the series dict must have a key for each index of the scalars

class plotszoo.data.WandbData(entity, project, query, cache=True, cache_dir='./.plotszoo-wandb-cache', verbose=True)

Retrive scalars and time series from wandb.

Args:
entity

wandb entity (username or team name)

project

wandb project

query

MongoDB query for wandb (check here.)

cache

Cache retrived data (Default: True)

cache_dir

Directory to cache the data to (Default: ./.plotszoo-wandb-cache)

verbose

Be verbose about pulling and caching (Default: True)

pull_scalars(state='finished', force_update=False)

Pull scalars from wandb

Args:
state

Filter the runs using their state, None to disable (Default: “finished”)

force_update

Force cache update (Default: False)

pull_series(scan_history=True, force_update=False)

Pull series from wandb

Args:
scan_history

Use wandb.Api.run.scan_history to pull the full history (Default: True)

force_update

Force cache update (Default: False)

class plotszoo.data.OptunaData(storage, study_name, cache=True, cache_dir='./.plotszoo-optuna-cache', verbose=True)

Retrive scalars and time series from an optuna. storage

Args:
storage

optuna storage (example: sqlite:///example.db)

study_name

optuna study name

cache

Cache retrived data (Default: True)

cache_dir

Directory to cache the data to (Default: ./.plotszoo-optuna-cache)

verbose

Be verbose about pulling and caching (Default: True)

pull_scalars(force_update=False)

Pull scalars from the optuna storage

Args:
force_update

Force cache update (Default: False)