Dataset Builder

A tool for building, standardizing and validating datasets using language models.

Quickstart

Initialization

First, import and initialize the DatasetBuilder class:

from txt2dataset import DatasetBuilder

builder = DatasetBuilder(input_path, output_path)

# Set your API key
builder.set_api_key(api_key)

Set the base prompt that defines what the model should extract:

base_prompt = """Extract officer changes and movements to JSON format.
    Track when officers join, leave, or change roles.
    Provide the following information:
    - date (YYYYMMDD)
    - name (First Middle Last)
    - title
    - action (one of: ["HIRED", "RESIGNED", "TERMINATED", "PROMOTED", "TITLE_CHANGE"])
    Return an empty dict if info unavailable."""

Define the expected response schema:

response_schema = {
    "type": "ARRAY",
    "items": {
        "type": "OBJECT",
        "properties": {
            "date": {"type": "STRING", "description": "Date of action in YYYYMMDD format"},
            "name": {"type": "STRING", "description": "Full name (First Middle Last)"},
            "title": {"type": "STRING", "description": "Official title/position"},
            "action": {
                "type": "STRING",
                "enum": ["HIRED", "RESIGNED", "TERMINATED", "PROMOTED", "TITLE_CHANGE"],
                "description": "Type of personnel action"
            }
        },
        "required": ["date", "name", "title", "action"]
    }
}

Optional Configurations

You can customize various settings:

builder.set_rpm(1500)  # Set requests per minute
builder.set_save_frequency(100)  # Save progress every 100 items
builder.set_model('gemini-1.5-flash-8b')  # Set the model to use

Building the Dataset

Build your dataset using the configured settings:

builder.build(
    base_prompt=base_prompt,
    response_schema=response_schema,
    text_column='text',
    index_column='accession_number',
    input_path="data/msft_8k_item_5_02.csv",
    output_path='data/msft_officers.csv'
)

Standardizing the Dataset

Standardize the output dataset:

builder.standardize(
    response_schema=response_schema,
    input_path='data/msft_officers.csv',
    output_path='data/msft_officers_standardized.csv',
    columns=['name']
)

Validating the Dataset

Validate the generated dataset:

results = builder.validate(
    input_path='data/msft_8k_item_5_02.csv',
    output_path='data/msft_officers_standardized.csv',
    text_column='text',
    index_column='accession_number',
    base_prompt=base_prompt,
    response_schema=response_schema,
    n=5,
    quiet=False
)

Example Validation Output

The validation returns results in this format:

[{
    "input_text": "Item 5.02 Departure of Directors... Kevin Turner provided notice he was resigning his position as Chief Operating Officer of Microsoft.",
    "process_output": [{
        "date": 20160630,
        "name": "Kevin Turner",
        "title": "Chief Operating Officer",
        "action": "RESIGNED"
    }],
    "is_valid": true,
    "reason": "The generated JSON is valid..."
},...]