Adam Bronte

Using LLMs for factual dataset generation

When building Fresh Tracks I needed to put together a dataset of a bunch of ski resorts and some information around them. In the past, this would of been a tedious task either trying to scrape the data together or maybe you could find something already existing.

With todays LLMs, this becomes really easy to do. It feels like such an obvious, yet non-obvious use case LLMs can solve. The only thing you'd really need to be careful of is out of date information, but with things like addresses, websites, etc that information doesn't generally change.

I used Claude 3.5 sonnet for this, first to prototype then for the full dataset. For example, I started with asking for a list of ski resorts in Utah.

List all the ski resorts in utah. Format your response in <resort> tag.

Which returned

Here are the ski resorts in Utah:

<resort>Alta Ski Area</resort>
<resort>Beaver Mountain</resort>
<resort>Brian Head Resort</resort>
<resort>Brighton Resort</resort>
<resort>Cherry Peak Resort</resort>
<resort>Deer Valley Resort</resort>
<resort>Eagle Point Resort</resort>
<resort>Nordic Valley Resort</resort>
<resort>Park City Mountain Resort</resort>
<resort>Powder Mountain</resort>
<resort>Snowbasin Resort</resort>
<resort>Snowbird</resort>
<resort>Solitude Mountain Resort</resort>
<resort>Sundance Mountain Resort</resort>
<resort>Woodward Park City</resort>

Then I simply asked for additional data for each resort and to put it into a ruby hash so I could seed my database easily.

For each resort, get the address, city, zip, state, latitude, longitude, and website for these ski resorts.

Put them in a ruby hash for example:
resorts = [
  {
    name: "Vail Ski Resort",
    address1: "1000 Vail Mountain Drive",
    city: "Vail",
    zip: 81657,
    state: "CO",
    country: “USA”,
    lat: 39.6061,
    lon: -106.3550,
    website: "https://www.vail.com”
  }
]

This gave me a 100% accurate list of all the information I needed.

resorts = [
  {
    name: "Alta Ski Area",
    address1: "10230 E Highway 210",
    city: "Alta",
    zip: 84092,
    state: "UT",
    country: "USA",
    lat: 40.5884,
    lon: -111.6377,
    website: "https://www.alta.com"
  },
  # ...
  # truncated for the blog post
]

You should still double check to make sure everything is correct. This saved a ton of time and will be a technique I use more in the future.

#AI