Visualizing Geospatial Data

April 22, 2018
python tutorial visualization
Estimated Reading Time: 0 minute(s)

Visualizing Geospatial Data

# ensure our graphs are displayed inline
%matplotlib inline
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np
import folium
from folium import plugins
from folium.plugins import HeatMap
from folium.plugins import MarkerCluster
# useful to define where we'll be storing our data
data_directory = "data/"

# useful to define where we'll be storing our output
output_directory = "output/"


Our goal today is to create some visualizations for some geospatial data. We’ll do that by first acquiring the data itself, quickly looking at the data set and doing a very minor cleanup.

Then we’ll walk through creating multiple visualizations, which can be applied to many data sets. Specifically we’ll bedoing the following:

Data Acquisition

First we’ll create a Pandas.DataFrame out of a json file hosted by NASA.

# Data from NASA on meteorite landings
df = pd.read_json(data_directory + "y77d-th95.json")

Now we’ll simply do some high level overview of the data.

Initial Data High Level View

I like to always start out by looking at the thirty thousand foot view of any data set.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 12 columns):
:@computed_region_cbhk_fwbd    133 non-null float64
:@computed_region_nnqa_25f4    134 non-null float64
fall                           1000 non-null object
geolocation                    988 non-null object
id                             1000 non-null int64
mass                           972 non-null float64
name                           1000 non-null object
nametype                       1000 non-null object
recclass                       1000 non-null object
reclat                         988 non-null float64
reclong                        988 non-null float64
year                           999 non-null object
dtypes: float64(5), int64(1), object(6)
memory usage: 101.6+ KB
:@computed_region_cbhk_fwbd :@computed_region_nnqa_25f4 id mass reclat reclong
count 133.000000 134.000000 1000.00000 9.720000e+02 988.000000 988.000000
mean 26.939850 1537.888060 15398.72800 5.019020e+04 29.691592 19.151208
std 12.706929 899.826915 10368.70402 7.539857e+05 23.204399 68.644015
min 1.000000 10.000000 1.00000 1.500000e-01 -44.116670 -157.866670
25% 17.000000 650.250000 7770.50000 6.795000e+02 21.300000 -5.195832
50% 24.000000 1647.000000 12757.50000 2.870000e+03 35.916665 17.325000
75% 37.000000 2234.250000 18831.25000 1.005000e+04 45.817835 76.004167
max 50.000000 3190.000000 57168.00000 2.300000e+07 66.348330 174.400000
:@computed_region_cbhk_fwbd :@computed_region_nnqa_25f4 fall geolocation id mass name nametype recclass reclat reclong year
0 NaN NaN Fell {'type': 'Point', 'coordinates': [6.08333, 50.... 1 21.0 Aachen Valid L5 50.77500 6.08333 1880-01-01T00:00:00.000
1 NaN NaN Fell {'type': 'Point', 'coordinates': [10.23333, 56... 2 720.0 Aarhus Valid H6 56.18333 10.23333 1951-01-01T00:00:00.000
2 NaN NaN Fell {'type': 'Point', 'coordinates': [-113, 54.216... 6 107000.0 Abee Valid EH4 54.21667 -113.00000 1952-01-01T00:00:00.000
3 NaN NaN Fell {'type': 'Point', 'coordinates': [-99.9, 16.88... 10 1914.0 Acapulco Valid Acapulcoite 16.88333 -99.90000 1976-01-01T00:00:00.000
4 NaN NaN Fell {'type': 'Point', 'coordinates': [-64.95, -33.... 370 780.0 Achiras Valid L6 -33.16667 -64.95000 1902-01-01T00:00:00.000

We see twleve columns: * five floats * six strings or mixed data * one int64

Additionally, the geolocation column is JSON, which is something I’ve never worked with inside of a Pandas DataFrame. Also, we may be able to leverage Pandas’ DateTime dtype for the year column.

Removing Redundant Data

As geolocation’s data is already represented in reclat and reclong, we’ll simply remove it. We’re specifically picking this column as its a more complex JSON data type, instead of already separated columns.

df.drop(labels="geolocation", axis=1, inplace=True)

NaN Inspection

Lets look at all columns that have atleast one NaN value.

nan_columns = df.columns[df.isna().any()].tolist()

We see that seven of the tweleve columns have atleast one NaN value. Lets look into how many NaN values are in each column so we can get an idea on how to proceed with cleaning.

nan_column_counts = {}

for nan_column in nan_columns:
    nan_column_counts[nan_column] = sum(pd.isnull(df[nan_column]))

{':@computed_region_cbhk_fwbd': 867,
 ':@computed_region_nnqa_25f4': 866,
 'mass': 28,
 'reclat': 12,
 'reclong': 12,
 'year': 1}

We see here that number of NaN values ranges from as high as 867, to as low as 1. We recall that there are 1000 rows in this data set, so that means most of the rows have :@computed_region_cbhk_fwbd and :@computed_region_nnqa_25f4 as an NaN value.

We’ll have to handle these after performing some more data inspection.

Unique Values Inspection

We’ll now look at the unique values.

The following cell has been made a raw cell to avoid its large output from printing. for column in list(df): print(f”{column} has {df[column].nunique()} unique values:“) print(df[column].unique())

NaN Handling

Since we’re not building any specific model, we’re going to leave the NaN values as they are. I just want to note that usually you’ll have to handle the NaN values in a data set, or at the very least, be aware that they exist. There are many techniques for handling NaN values, but they won’t be disucssed here.

Geospatial Visualizations

Now we’re going to work on creating geospatial visualizations for our data set. These can be incredibly helpful for exploring your data, as well as when it comes time to present or share your work.

These visualizations can be handy as they can help you quickly answer questions. For example, currently we don’t know how many meteorites land in the oceans. We’d expect that many to, infact probably more often than land, but we don’t have an easy way to determine this. Once we have our visualizations created, we can quickly answer this question.

Data Preparation

First, we’ll need to prepare a dataframe of our latitude and longitude values

# Create a new dataframe of just the lat and long columns
geo_df = df.dropna(axis=0, how="any", subset=['reclat', 'reclong'])
geo_df = geo_df.set_index("id") # we'll preserve the id from the data set
:@computed_region_cbhk_fwbd :@computed_region_nnqa_25f4 fall mass name nametype recclass reclat reclong year
1 NaN NaN Fell 21.0 Aachen Valid L5 50.77500 6.08333 1880-01-01T00:00:00.000
2 NaN NaN Fell 720.0 Aarhus Valid H6 56.18333 10.23333 1951-01-01T00:00:00.000
6 NaN NaN Fell 107000.0 Abee Valid EH4 54.21667 -113.00000 1952-01-01T00:00:00.000
10 NaN NaN Fell 1914.0 Acapulco Valid Acapulcoite 16.88333 -99.90000 1976-01-01T00:00:00.000
370 NaN NaN Fell 780.0 Achiras Valid L6 -33.16667 -64.95000 1902-01-01T00:00:00.000

Creation of the Visualizations

Everything looks good.

Now we’ll create our visualizations. First lets make one with every row as a single marker. This may be overkill.

markers_map = folium.Map(zoom_start=4, tiles="CartoDB dark_matter")

# create an individual marker for each meteorite
for coord in [tuple(x) for x in geo_df.to_records(index=False)]:
    latitude = coord[7]
    longitude = coord[8]
    mass = coord[3]
    name = coord[4]
    rec_class = coord[6]
    index = geo_df[(geo_df["reclat"] == latitude) & (geo_df["reclong"] == longitude)].index.tolist()[0]    

    html = f"""
    <table border="1">
            <th> Index </th>
            <th> Latitude </th>
            <th> Longitude </th>
            <th> Mass </th>
            <th> Name </th>
            <th> Recclass </th>
            <td> {index} </td>
            <td> {latitude} </td>
            <td> {longitude} </td>
            <td> {mass} </td>
            <td> {name} </td>
            <td> {rec_class} </td>
    iframe = folium.IFrame(html=html, width=375, height=125)
    popup = folium.Popup(iframe, max_width=375)

    folium.Marker(location=[latitude, longitude], popup=popup).add_to(markers_map) + "markers_map.html")