Data loading¶
In plotszoo
data is organized in a plotszoo.data.DataCollection
.
plotszoo.data.DataCollection
collect two data types:
scalars
: organized in a pandasDataFrame
series
: organized in a pythondict
having as keys the indices of thescalars
and as values time seriesDataFrame
Classes to pull data from common services are also provided such as plotszoo.data.WandbData
-
class
plotszoo.data.
DataCollection
¶ Base class for data collection
- Attributes:
- scalars
DataFrame
cointaning the scalars- series
dict
cointaining the time series
-
align_series
(to='longest', **kwargs)¶ Algin series to the longest or shortest one.
Reset all the series indices, chooses the new index according to the strategy and
reindex
all the series- Args:
- to
alignment strategy (one of
longest
orshortest
) (Default:longest
)- **kwargs
keyword arguments for
pandas
reindex
- Example::
>>> data.align_series(to="longest", method="nearest")
-
are_series_aligned
()¶ Returns
True
if all series share the same indices
-
astype
(columns, type=<class 'float'>)¶ Change the type of the scalars columns calling pandas
astype
- Args:
- columns
List of colum to convert to numeric
- type
Type to cast the column to (Default:
float
)
-
create_categorical
(column, new_column)¶ Creates a new categorical column (0, 1, 2, …) from a textual one (“sin”, “cos”, “tan”, ..-)
- Args:
- column
Column to use as input
- new_column
Name of the new categorical column
-
create_scalar_from_series
(scalar_name, agg_fn)¶ Create a new column of scalars using the corresponding time series
- Args:
- scalar_name
The name of the new scalar
- agg_fn
Function to be called to the corresponding time series to create the scalar
Example:
>>> data.create_scalar_from_series("start_time", lambda s: s["timestamp"].min())
-
dropna
(columns)¶ Discard
NaN
values from the scalars and also discard the corresponding time series- Args:
- columns
Columns to check for
NaN
-
dropna_series
(columns)¶ Drop all the rows where column in
NaN
in series. Will probably unalign the series- Args:
- columns
Columns to check for
NaN
-
fillna
(column, value=0)¶ Sobsitute
NaN
values from the scalars- Args:
- column
Column to check for
NaN
-
fillna_series
(column, value=0)¶ Sobsitute
NaN
values from the series with a new one- Args:
- column
Series column to fill
- value
Value to use (Default: 0)
-
is_both
()¶ Returns
True
if theDataCollection
contains both scalars and time series
-
is_empty
()¶ Returns
True
if theDataCollection
is empty
-
is_scalars
()¶ Returns
True
if theDataCollection
contains scalars
-
is_series
()¶ Returns
True
if theDataCollection
contains time series
-
rolling_series
(column, new_column, fn='mean', **kwargs)¶ Apply
pandas
rolling function to all the series- Args:
- column
Series column to apply the rolling to
- new_column
Series column in which store the rolling function results
- fn
pandas
rolling function, for example"mean"` means ``series[column].rolling().mean()
- **kwargs
Keyword arguments for the
pandas
rolling function
- Example::
>>> data.rolling_series("reward", "mean_reward", window=20, fn="mean")
-
set_scalars
(data)¶ Set the scalars
- Args:
- data
The
DataFrame
cointaning the scalars
-
set_series
(series)¶ Set the series
- Args:
- series
The :class`dict` cointaning the time series
series
must be set afterscalars
the
series
dict
must have a key for each index of thescalars
-
class
plotszoo.data.
WandbData
(entity, project, query, cache=True, cache_dir='./.plotszoo-wandb-cache', verbose=True)¶ Retrive scalars and time series from wandb.
- Args:
- entity
wandb
entity (username or team name)- project
wandb
project- query
MongoDB query for wandb (check here.)
- cache
Cache retrived data (Default:
True
)- cache_dir
Directory to cache the data to (Default:
./.plotszoo-wandb-cache
)- verbose
Be verbose about pulling and caching (Default:
True
)
-
pull_scalars
(state='finished', force_update=False)¶ Pull scalars from
wandb
- Args:
- state
Filter the runs using their
state
,None
to disable (Default: “finished”)- force_update
Force cache update (Default:
False
)
-
pull_series
(scan_history=True, force_update=False)¶ Pull series from
wandb
- Args:
- scan_history
Use wandb.Api.run.scan_history to pull the full history (Default:
True
)- force_update
Force cache update (Default:
False
)
-
class
plotszoo.data.
OptunaData
(storage, study_name, cache=True, cache_dir='./.plotszoo-optuna-cache', verbose=True)¶ Retrive scalars and time series from an optuna. storage
- Args:
- storage
optuna
storage (example:sqlite:///example.db
)- study_name
optuna
study name- cache
Cache retrived data (Default:
True
)- cache_dir
Directory to cache the data to (Default:
./.plotszoo-optuna-cache
)- verbose
Be verbose about pulling and caching (Default:
True
)
-
pull_scalars
(force_update=False)¶ Pull scalars from the
optuna
storage- Args:
- force_update
Force cache update (Default:
False
)