A dataset is a collection of samples (input, output) pairs that is used to train a model to perform a specific task. Within a dataset you can have any number of versions (each with their own model).

You should make new datasets for the following reasons:

  • The problem the model should solve is different from any other dataset.

  • The model will be evaluated differently (aka a change in how you submit feedback).

    • e.g. previously you sent only boolean (0 or 1) feedback and now you need to send feedback that has more flexibility (0.1,0.4,0.6 etc...)

Datasets are designed to help you iterate on solving a single problem with a set of ever improving models .You should not create datasets that mix different problems, even if the they are very similar types of problems.

  • e.g a tweet classifier and a report classifier are both classifiers but should be in separate datasets.

Last updated