Where Did She Go?

I wanted to tell her how much I liked her tagine yesterday. How cute she looked when cooking it. And how I wanted to stand behind her, put my hand on her hips, and kiss her cheek. I imagined the…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Getting to Know Rexburg

Where to live, where to eat, what to do.

Rexburg, Idaho, USA is a small college town in the state of Idaho in the United States’ northern intermountain west.

A close family member of mine is moving to Rexburg in a couple of months to attend college. He’s never been to Rexburg before. I’ll help him get to know the town and think about where to look for apartments by building and analyzing an unsupervised machine learning model to cluster areas of the town.

I’ll also pull information from the categories endpoint to recreate Foursquare’s category hierarchy for my analysis. The rationale for this will become clear in the Methodology section.

Since my objective is to help a prospective college student who’ll be moving to Rexburg get familiar with the town’s resdiential, food, and recreation options, we’ll arbitrarily choose a 4km by 4km square centered around the college campus.

The coordinates of the four corners of this square are:

Since each call to the Foursquare API is limited to 100 results, I want to search an area small enough to capture as many venues as possible. The “sandbox” version of the API is limited 950 “regular” API calls (of which explore is one) per day.

To balance the need for granularity with the number of daily API calls I can make, I split the area of interest into a 20 by 20 (200m by 200m) grid, and call the API from the center of each grid section. That should be a small enough grid section size to ensure I don’t hit the limit of 100 venues, and it uses less than half of my daily API quota.

The data required cleaning in three main areas.

API calls will overlap, so clean-up will be needed on aisle 4.

Duplicate and out-of-area venues. The radius I passed to the API needed to be large enough to inscribe each grid section in a circle. The searches will overlap, as shown below, so the initial dataset included a substantial number of duplicates, as well as venues outside the border, which needed to be eliminated.

Blank categories. Thirty venue records came back with a blank cateogry. I individually recategorized or deleted these based on whether the venues seemed relevant to our task and on what I could find out about the venue through internet searches.

In what could have been a histogram, this chart shows that a lot of categories only got to play with 1 or 2 venues.

Highly fragmented categories. A look at the distribution of categories in the Rexburg venues data revealed a very long tail: of the 213 unique categories, over 170 appeared in fewer than 5 venue records. This long tail would make it more difficult for the model to cluster venues in these low-frequency categories. Also, venue-level categories proved too granular for some of my analysis. For example, I wanted to see all places relating to food, but “Grocery Stores” and “Mexican Restaurants” were in separate categories, and even at different hierarchy levels.

Residential (blue), Food (red), and Recreation (green) in Rexburg

To address this issue, I retrieved the category hierarchy from the Foursquare API and added category groups (hierarchy level 1) and venue types (hierarchy level 0) to the dataset, enabling me to roll up the data. This allows visualizations like the one nearby, which shows residential venues in blue, food venues in red, and recreational venues in green.

We’ll employ the k-means algorithm to cluster areas of town into groups with similar venues. Since we don’t have neighborhoods, and since most towns in the intermountain west are laid out as a grid, we’ll divide Rexburg up into a grid and use the grid sections to represent “neighborhoods”.

After testing several grid sizes, I selected a 15 by 15 grid for modeling. I looked at inertia and silhouette score across many values of k.

Silhouette score dipped at 2 clusters, which felt like too few, and then again at 8 clusters.

The inertia plot was fairly smooth, with no obvious elbow, although if I squinted a bit, I saw one at k=5.

Since both metrics accommodated five clusters, I used that hyper-parameter in the production model.

Our production model produces the following clusters of grid sections, with a silhouette score of 0.587.

To suss out the distinguishing features of each cluster, I mapped average venue count per grid section against cluster and venue category types to produce a heatmap.

Combining the heatmap with a listing of the top 15 venue category groups per cluster, I was able to give the five clusters snappy names and descriptions:

It’s also worth noting that there is absolutely no night life in Rexburg, so you’d better have Netflix or really like homework.

Although we might have been able to get similar information from some Google maps searches, this model allows my family member to quickly see what’s going on in Rexburg and get a sense of where to live, where to eat, and what to do when he gets to college.

Add a comment

Related posts:

FIRESTORM RAGE REMISSION

The colloquial term we call bush, encompasses grasslands, scrub, saplings and dense forests. Most of the bush is located on the seaboard of our continent. So what’s the Firestorm Rage Remission in…

The Price of Exhaustion

I went to urgent care today because my supposed viral sinus/bronchial infection from two days ago has evolved into severe fatigue, fever, and seriously painful sinuses, in spite of the steroids they…

The star poker chips

Star Wars Credit Chip Theme Laser Cut Poker Chips, Game Chips Set, Double Sided — 84 Chips Ad by PintoDesigns Ad from shop PintoDesigns PintoDesigns From shop PintoDesigns. 5 out of 5 stars (81) $…