Templates: LLM Dataset Generation

LLM Dataset Generation

Before you can generate your dataset, make sure you have your documents you want to use ready. Gather all relevant documents you want your model to be built off of. These documents must be in one folder and in the following formats - PDFs, text, and docx. You can have a mix of any and all types in the same folder.

Open the LLM Dataset Generator Element settings and adjust the following settings:

Topic: This can be anything you would like.

References folder path: Using the “Select Directory” button, choose the folder where your documents are located.

‍

Output folder path: Using the “Select Directory” button, choose the folder where you would like to save the output of the dataset generation

Dataset size: Add the number of topics you want your dataset to train with.
Note: We recommend starting with 5 for testing and getting familiar with the process of dataset generation. This generates a list of five topics and is quicker for training, but it will not produce as accurate of a model as a larger dataset size. The higher the dataset size, the more accurate your dataset and trained model will be. However, the larger the dataset size, the longer it will take to generate your dataset. It can take several hours to generate large dataset, so be patient.

Next, enter your Groq, GPT, Claude, or Gemini API keys. You can add as many as you like, but 1 is required. You can get a free Groq key here.

Now you can now hit run. Dependencies will be installed the first time this flow is run, so it may take a while for them to install.
The output will be a folder with the name dataset_[your_topic_name]_[timestamp]
This folder is what you connect to the Dataset Folder Path in the LLM Trainer Element in the Training step.

For a complete deep dive into Dataset Generation, See the LMM Dataset Generator article.

‍