Designing a dataset

Building a well-designed dataset is the first and most important step towards creating a model that reliably and accurately produces the results you want.

This guide covers:

  • How to create a dataset

  • An overview of how to use the Dataset Designer

  • How to effectively iterate with previews

Creating a dataset

From your personal or team page, navigate to the Datasets page on the left-hand nav, then click the "New dataset" button on the top-left side of the main panel. A popup will prompt you for a dataset name & optional description. The name can contain any characters, and is only used as a display name for your own reference.

Once you're ready, clicking "Create Dataset" will take you to the template selection page. Templates can provide convenient preset inputs for common use cases - if none of them are relevant to your use case, then proceed with "Custom". Either way, you'll be taken to the Dataset Designer.

Using the Dataset Designer

This section will give a step-by-step outline of how to make use of each tool in the Dataset Designer.

Description

The description input is where you define the use case, key goals and behaviours of the model you eventually want to train, as well as any important characteristics the dataset should exhibit. There aren't any hard and fast rules for what can go in the description, but some general guidelines for good results are:

  • Keep the description as straightforward and concise as possible

  • Avoid any ambiguity in phrasing, word choice, and logic

  • Include any important distinctions, caveats, and assumptions.

To see examples of what a description should look like, you can check out the preset templates that are presented before the Dataset Designer page.

Note that that the description is NOT editable after you create your dataset, so it's important to have a clear idea of what you need before finalizing this.

Data Sources

Data sources allow your dataset to be generated from custom data and information that you provide. For in-depth information on how to create & manage sources, see the Sources page.

Under "Custom sources", you can select a source you've already created on your profile to link to this dataset. When the dataset is generated, this source will be referenced and whatever data it contains will be used to generate some of the samples in the dataset

If you want your dataset to be generated only on the data from your custom sources, you can select the "Custom sources only" option. This will disable our large internal data crawl as a source for data generation, and thus disable the Knowledge Graph feature.

Keyphrases

Keyphrases allow you to adjust the topics and concepts that will be included in the dataset by our data generation pipeline when generating your dataset. Clicking the "Generate Keyphrases button will generate a list of suggested topics & concepts inferred from the description you provide at the beginning of the Dataset Designer.

The suggested keyphrases can be tweaked by adding or deleting phrases.

Note that the Keyphrases feature is disabled if using the "Custom sources only" option in the Data Sources section.

Prompt Schema & Response Schema

The prompt & response schema builders are very powerful tools that allow you to describe structure of the prompts & responses generated in your dataset. In other words, they give you control over what the prompts & responses should look like using typed variables & templates.

For basic examples of what a well-formed schema should look like, you can take a look at the ones provided in the templates. For in-depth examples and information, see the Schemas page.

Schema Variables

Schema variables are the building blocks of a schema. Each schema variable describes a distinct component of the schema. In cases where the prompt/response should be unstructured, a single string variable will suffice. However, in cases when the prompt/response should follow a consistent structure with multiple parts, multiple schema variables can be used to compose a richer schema.

To create a variable, enter its name into the text field next to the "New variable" button and select a type for the variable using the "Type" dropdown menu. Once the variable is created, provide a description of what information the variable represents. For complex types (structs, arrays, enums), you will need to add additional information to the variable data.

Variables can be reordered using the handles on the top-left of the variable card, or deleted using the trash icon on the top-right of the card.

Schema Template

The schema template is used to define how the schema variables should be arranged. Variables are referenced inside the template by wrapping the variable name in double curly braces like this: {{variable_name}}. Other text can also be placed within the template as arbitrary labels or delimiters of information.

Any variables not included in the schema will be ignored when generating the dataset.

Schema variables are not a requriement, and you can create a dataset by just defining the templates, but we strongly recommend defining variables.

Response Complexity

This is an optional selector that allows you to specify the complexity of the responses generated. Simpler complexities tend to result in shorter, more concise, and less in-depth responses. Greater complexities tend to result in longer, more verbose, and more complicated responses.

Previews

Generating a full dataset is a longer process that can take upwards of an hour, depending on your inputs. Previews allow you to quickly see a sample of what your actual dataset will look like before you commit to generating the whole thing.

Once you've finished filling in the rest of the Dataset Designer, try generating a small (~5-10) number of previews to quickly view samples. Make adjustments and repeat as necessary. Once you feel confident with your design, you can try generating a larger number of samples for a better snapshot of what kind of variety the final dataset might contain.

Generating version 1

Once you've finished iterating using the Dataset Designer and are satisfied with your previews, you're ready to click the big "Create Version" button located at the bottom of the page (or on the right-hand navigation panel). If your subscription includes model training, you'll be given the option to queue a model for immediate training once your dataset is finished generating. Otherwise, confirm the initialization and you're all set!

Generation typically takes anywhere from 30 minutes to over 2 hours, depending on the complexity of your design and current platform load. If you have notifications enabled on your account, you'll receive an email when it's ready.

Last updated