The goal of this mini project was to analyze the crime records for the City of Virginia Beach and in doing so use tools that I have never used before. Data analysis often highlights information gaps, the strengths, the weakness, and pinpoints the way forward.
Data was sourced from the City of Virginia Beach website: https://data.vbgov.com/Public-Safety/Police-Incident-Reports/iqkq-gr5p and was saved in a csv file.
Before I dive into data; I normally ask myself, "What is that I want to achieve with this dataset?". I relate this process to Lo-Fi prototyping which I learned as a Master's student in my Mobile App development class. Similar principles can be applied here. With the crime statistics data, I wrote down a few basic questions that I hope would be answered.
Often considered tedious and boring, this is where most of the learning takes place. If you get this step right, there are a lot of fun things you can do with your data. First step in preprocessing is identifying the libraries that one would need and it is coorelated with Step 2. of Lo-Fi prototyping.
If you know what you want from your data, you can determine the libraries you need to get it done.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
%matplotlib inline
Now that we have imported the essential modules. Let's use the pandas built in function to read a csv file into a dataframe. This dataframe will help us see the content of the csv file.
#reading a csv file
file = pd.read_csv("./file.csv", encoding="unicode_escape")
#displaying the columns and the first five rows
file.head()
#in order to display the end of the file, you can use file.tail()
Let's look at the column names first and see which of these need to be changed. This is done in order to make sure the data is more readable by Python. I see there are two columns that I want to focus on for the purpose of this sample project i.e. Date Reported and Offense Description
#Display the names of the columns
file.columns
#Rename the columns
file = file.rename(columns = {"Date Reported": "Date_Reported", "Offense Description": "Offense_Description"})
#Display the new names of the columns.
file.columns
Next, I want to add a new column in the csv file called, "Year" so that it will be easier for me visualize some of the trends over the years.
# pandas .insert function will do the trick. We have to specify the place of the new column, the name,
# and the value it will hold.
file.insert(3, 'Year', 'Any')
# pandas datetime index will help us extract the year from the Date_Reported Column and .head() will display
# the changes we made.
file['Year'] = pd.DatetimeIndex(file['Date_Reported']).year
file.head()
#Seaborn is a data visualization library built on Matplotlib. I love
# that you can create simple count plots in a few lines of codes. Here's one below
crime = file.drop(columns='Offense_Description')
crime = file.dropna(axis=0)
sns.countplot(data=crime, x='Year')
# In the column name, "Offense_Description" we see a lot of different categories or names of crimes. Let's
# see how many kinds there are
print('There are {} types of crimes in this dataset'.format(len(set(crime['Offense_Description']))))
file['Offense_Description'].unique()
# Let's look at the top 10
file.Offense_Description.value_counts().head(10)
#Now I want to focus on only the current year i.e. 2019 and look at the 10 highest categories of crime rates.
file = file.loc[file['Year'].isin([2019])]
file.Offense_Description.value_counts().head(10)
file.Offense_Description.count()
# Visualizing the first five.
plt.subplots(figsize = (8,4))
sns.barplot(file.Offense_Description.value_counts()[:5].index,file.Offense_Description.value_counts()[:5])
plt.xticks(rotation = 90)
plt.xlabel('Type of Assault')
plt.ylabel('Number of Assault')
plt.title('Crime Record in Virginia Beach')
plt.show()
# Using the datetime index like earlier we can see the distribution month wise. Visualizing it in simple bar
# graphs tell us that the highest crime count was for the 5th month i.e. May.
file.insert(4,"Month","Any")
file['Month'] = pd.DatetimeIndex(file['Date_Reported']).month
file.head()
sns.countplot(data=file, x='Month')
file.insert(5,"Day","Any")
file['Day'] = pd.DatetimeIndex(file['Date_Reported']).day
file.head()
sns.countplot(data=file, x='Day')
plt.subplots(figsize = (10,6))
sns.barplot(file.Offense_Description.value_counts()[:5].index,file.Offense_Description.value_counts()[:5])
plt.xticks(rotation = 90)
plt.xlabel('Type of Assault')
plt.ylabel('Number of Assault')
plt.title('Crime Record in Virginia Beach 2019')
plt.show()
I was interested to see which crimes occured and where. I pulled out the most easily available shape file from the City of Virginia Beach GIS center. Link to GIS database: https://data-vbgov.opendata.arcgis.com/
In order to visualize the location and load those maps, I needed to import GeoPandas and few other dependencies.
import descartes
import geopandas as gpd
from shapely.geometry import Point, polygon
#Let's look at the shape files for VB.
map = gpd.read_file('./City_Boundary/City_Boundary.shp')
figure,ax = plt.subplots(figsize = [7,7])
map.plot(ax=ax)
#Let's focus on latitudes and longitudes.
geometry = [Point(xy) for xy in zip(file["lon"], file["lat"])]
geometry[:3]
crs = {'init': 'epsg:4326'}
geo_df = gpd.GeoDataFrame(file,crs=crs,geometry=geometry)
geo_df.head()
# From the,"Offense_Description" column I can choose to map any of the category names. I can have them
# layover the map/shape file we found. I have chose 'ASSAULT, SIMPLE' but you can change that to any other
# category and have it layover the
fig,ax = plt.subplots(figsize = [7,7])
map.plot(ax=ax, alpha=0.4, color='grey')
geo_df[geo_df['Offense_Description'] == 'ASSAULT, SIMPLE'].plot(ax=ax, markersize=10, color='blue',marker='o',label='OD')
plt.legend(prop={'size': 6})
At the beginning of this mini project I asked a few questions to which I hoped to seek answers. With these few techniques. I did get answers to my questions
Simple bar graphs can say a lot about data. Again this was a small afternoon project. I learned something new about GeoPandas, Open data GIS, and Google's GeoCode. I am going to write about Google's Geocode and using it with Microsoft Excel in a blog post. I found it really fascinating, there would be better/faster way to get Latitude's and Longitudes but the goal at the end of the day is to learn something new everyday.