You may find yourself in a situation where you’d like to generate mock data, like when writing a blog post series on taking a pipeline and model to production. Luckily numpy and pandas makes this task incredibly easy.

For my use case, I wanted to generate a Pandas DataFrame with one independent column temperature_celsius and one dependent column ice_cream_sales_euro. The goal was to make a data set where temperature_celsius would affect ice_cream_sales_euro.

On first attempt, working with the data showed to easy of a relation between the two, so we should generate some noise and add that in. This adds variation and unexplainable variance.

First we’ll import our required packages;

import datetime
import pandas as pd
import numpy as np

Its a good idea to set a random seed, so our code is reproducible.

np.random.seed = 42

Now we can create our index column, which will be the year, month and day and list our column names.

how_many_days = 365
today_last_year = datetime.datetime.now().date()
index = pd.date_range(today_last_year-datetime.timedelta(how_many_days), periods=365, freq='D')

columns = ["temperature_celsius", "ice_cream_sales_euro"]

Now we can create the data for our temperature_celsius column and the noise we mentioned above

temperature_celsius_data_x = np.arange(how_many_days)
temperature_celsius_data_delta = np.random.uniform(-1, 3, size=(how_many_days,))
temperature_celsius_data = (.1 * temperature_celsius_data_x) + temperature_celsius_data_delta

noise_data = np.random.normal(loc=25000, scale=10000, size=(how_many_days,))

And now we can take all the data and create our ice_cream_sales_euro_data column

ice_cream_sales_euro_data = (temperature_celsius_data * 1200) + noise_data

At this point we’re ready to create a single numpy array and output the data as a Pandas DataFrame and save that to a csv file.

data = np.array([temperature_celsius_data, ice_cream_sales_euro_data]).T

df = pd.DataFrame(data, index=index, columns=columns)
df.to_csv("ice_cream_shop.csv")

And the full script

import datetime
import pandas as pd
import numpy as np

np.random.seed = 42

how_many_days = 365
today_last_year = datetime.datetime.now().date()
index = pd.date_range(today_last_year-datetime.timedelta(how_many_days), periods=365, freq='D')

columns = ["temperature_celsius", "ice_cream_sales_euro"]

temperature_celsius_data_x = np.arange(how_many_days)
temperature_celsius_data_delta = np.random.uniform(-1, 3, size=(how_many_days,))
temperature_celsius_data = (.1 * temperature_celsius_data_x) + temperature_celsius_data_delta

noise_data = np.random.normal(loc=25000, scale=10000, size=(how_many_days,))

ice_cream_sales_euro_data = (temperature_celsius_data * 1200) + noise_data

data = np.array([temperature_celsius_data, ice_cream_sales_euro_data]).T

df = pd.DataFrame(data, index=index, columns=columns)
df.to_csv("ice_cream_shop.csv")

Additional Approaches

One could also use a mocker library, say like the popular Faker package