dataset module

The dataset module provides an abstraction for sets of data, primarily aimed at use in machine learning (ML).

class dataset.DataSet(logger)

Bases: object

Represents a set of ML data with methods to ingest, manipulate (i.e. preprocess) and extract


Passed a list of columns and remove them from the dataset


Display data

duplicate_column(current_column_name, new_column_name)

Passed name of a current column and copy that column to a new column with name passed for new column name


Return data in native format

in_partition(partition_name, row_number)

Passed a partition name, row number and total number of rows in the dataset and after consulting internal partition settings, return a 1 if the given row belongs to the partition, otherwise 0


Load data CSV from file into class as a list of dictionaries of rows. Requires first row in file to be a header row and uses these values as keys in row dictionaries. Example row: {‘dataset’: ‘ML’, ‘min_interpacket_interval’: ‘0.001’}


Return input data as a numpy array Filter out output column(s) and only include rows from specified partition, which defaults to ‘A’

one_hot_encode(column_name, keys)

Take an existing column and use it to build new columns that are each one hot encoded for one of the specified keys.

Supplied with the column_name string and a list that has the specific key names to build new columns.


Return output data as a numpy array Filter out input columns


Set partition parameters for split of dataset into arbitrary partitions, which are named by strings. Note that partitioning is applied when data is retrieved, not to internal dataset

Passed a list of partition names which are used to divide the dataset based on modulo division by the length of the list.

Setting partitions overwrites any previously set partition configuration

Default partition is partitions=[‘A’] (i.e. all data in partition ‘A’)

Standard convention for usage of partitions is: * Partition ‘Training’ is used as training data * Partition ‘Validation’ is used as validation (test) data

Example: Randomise row order, then allocate 75% of rows to partition ‘Training’ with the last 25% in partition ‘Validation’:

dataset.shuffle() dataset.partition(partitions=[‘Training’, ‘Training’,

‘Training’, ‘Validation’])

Return the number of sets in the partition

rescale(column_name, min_x, max_x)

Rescale all values in a column so that they sit between 0 and 1. Uses rescaling formula: x` = (x - min(x)) / (max(x) - min(x))


Set the name for the dataset


Set what columns are used as output data from dataset (i.e. what columns contain the expected answer(s) Pass it a list of output column names


Shuffle dataset rows. Set seed=1 if want predictable randomness for reproduceable shuffling


Passed policy transforms and run them against the dataset.

translate(column_name, value_mapping)

Go through all values in a column replacing any occurences of key in value_mapping dictionary with corresponding value


Passed a list of fields (columns) to retain and trim the internal representation of the training data to just those columns

trim_to_rows(key, fields)

Passed a key (column name) and list of fields (column values) match rows that should be retained and remove other rows