Improving/Editing a dataset

Once you have a model trained and running in production, you will have an understanding of where the model needs to perform better or you would have a need to update the schema of prompt or response to have more information or formatted in a different way. All these changes and improvements can be made by editing the dataset and creating a new version.

Note: Editing a dataset will create a new version, which you can use to train a new model version. You have access to all prior versions as well.

Right now, you can use the following edits:

Samples

You can add or remove rows from your dataset based on certain conditions using the sample edits. The query builder on the edit page lets you define:

  • If you want to Add or Remove samples

  • How many (or all) samples you want to add or remove

  • What condition should be used to add or remove samples eg. "Add 100 samples where prompt has a minimum length of 500 tokens" or "Remove all samples where response is not json"

Knowledge

You can use the knowledge edit to add specific information to your dataset by using keywords and key phrases. The knowledge edit page lets you define keywords and phrases which are then used by the data generation pipeline to create data samples for those topics, and these are appended to your existing dataset.

Schema

You can use schema edits to make changes to the prompt or response templates and schema variables of your dataset. All prompts and responses of your existing dataset will be converted to the new schema.

Sources

If you are using custom data sources for your dataset, you can use the sources edit to Add, Remove or Refresh the sources present in the dataset.

Add sources will add samples from the defined custom source to your dataset.

Remove source will remove all samples from the defined source in your dataset.

Refresh source will pull in new information from the defined sources and create and add samples from that information. This is useful if you have defined a web source where the underlying information keeps updating and you want to keep the dataset up-to-date with that.

Last updated