Load Dataset
Processes and loads a dataset, generating prompts based on the provided templates. This function supports various input formats such as file paths, dictionaries, or Hugging Face Dataset objects. It uses templates to create structured prompts and supports concurrent processing for efficiency.
Parameters
Sample data
data_dict = [
{
"id": "JnZJolR76_u2",
"title": "Gemma open models",
"description": "Gemma: Introducing new state-of-the-art open models",
"document": "Gemma open models are built from the same research and technology as Gemini models. Gemma 2 comes in 2B, 9B and 27B and Gemma 1 comes in 2B and 7B sizes.",
"categories": ["Topic 1", "Topic 2"],
"tags": ["Tag 1", "Tag 2"],
"output": "Sample output",
"main_points": ["Main point 1", "Main point 2"],
},
{
"id": "JnZJolR76_u2",
"title": "Gemma open models",
"description": "Gemma: Introducing new state-of-the-art open models",
"document": "Gemma open models are built from the same research and technology as Gemini models. Gemma 2 comes in 2B, 9B and 27B and Gemma 1 comes in 2B and 7B sizes.",
"categories": ["Topic 1", "Topic 2"],
"tags": ["Tag 1", "Tag 2"],
"output": "Sample output",
"main_points": ["Main point 1", "Main point 2"],
},
]
Text Dataset
Generate Text dataset Format
>>> from gemma_template import gemma_template
>>> dataset = gemma_template.load_dataset(data_dict, output_format='text')
>>> dataset
Dataset({
features: ['text', 'analysis', 'is_masked', 'origin_data'],
num_rows: 2
})
>>> dataset[0]
{
'text': '<start_of_turn>user\nYou are...<end_of_turn>\n<start_of_turn>model\n## **Title:**...<end_of_turn>\n',
'analysis': {'bigrams': ['technology as'],
'keyword_value': 'Tag 1, Tag 2',
'language': 'English',
'language_code': 'en',
'topic_value': 'Topic 1, Topic 2',
'trigrams': ['technology as Gemini'],
'unigrams': ['and', 'built', 'from', 'the', 'research']},
'is_masked': False,
'origin_data': {}
}
Alpaca Dataset
Generate Alpaca dataset format.
>>> from gemma_template import gemma_template
>>> dataset = gemma_template.load_dataset(data_dict, output_format='alpaca')
>>> dataset
Dataset({
features: ['instruction', 'input', 'output', 'analysis', 'is_masked', 'origin_data'],
num_rows: 2
})
>>> dataset[0]
{
'instruction': 'You are a multilingual professional writer...',
'input': '# Input Text:\nRewrite the input text..',
'output': '## **Title:**\n### Gemma open models\n\n## **Meta Description:**\nGemma: Introducing new state-of-the-art open models...',
'analysis': {'bigrams': ['technology as'],
'keyword_value': 'Tag 1, Tag 2',
'language': 'English',
'language_code': 'en',
'topic_value': 'Topic 1, Topic 2',
'trigrams': ['technology as Gemini'],
'unigrams': ['and', 'built', 'from', 'the', 'research']},
'is_masked': False,
'origin_data': {}
}
OpenAI Dataset
Generate OpenAI dataset format.
>>> from gemma_template import gemma_template
>>> dataset = gemma_template.load_dataset(data_dict, output_format='openai')
>>> dataset
Dataset({
features: ['messages', 'analysis', 'is_masked', 'origin_data'],
num_rows: 2
})
>>> dataset[0]
{
'messages': [
{
'content': 'You are a multilingual professional writer...',
'role': 'developer'
},
{
'content': '# Input Text:\nRewrite the input text...',
'role': 'user'
},
{
'content': '## **Title:**\n### Gemma open models...',
'role': 'assistant'
}
],
'analysis': {
'bigrams': ['technology as'],
'keyword_value': 'Tag 1, Tag 2',
'language': 'English',
'language_code': 'en',
'topic_value': 'Topic 1, Topic 2',
'trigrams': ['technology as Gemini'],
'unigrams': ['and', 'built', 'from', 'the', 'research']
},
'is_masked': False,
'origin_data': {}
}