As a final for Social Data Analysis at NYU ITP, I wanted to examine some of our National Parks. I have always had a great appreciation for the outdoors, and for the effort made by the US to establish and preserve these areas for generations to visit and experience. Considering these are typically very scenic areas, Instagram was a great source of data. People take a lot of pictures in these parks, and I wanted to see where in the parks people were actually taking photos. I also wanted to try to get a glimpse of who is coming to the parks. People from all social classes within the US visit the parks, but I wanted to see which international groups visit parks, and if there were differences in demographics between the parks.
I chose Yellowstone, the Grand Canyon, and the Great Smoky Mountains parks. Yellowstone was the first National Park established in the US, and the Grand Canyon one of the most famous locations in the world (but not often thought of in terms of the park boundaries). I chose the Smoky Mountains after finding out they were the most-visited park in 2013, a fact that was pretty surprising to me.
After investigating the hashtags most used in each park, I ran a hashtag search through the Instagram API and pulled about 40,000 media items per park. That code is here. After perusing the data, I wanted to start with some timeseries analysis.
This is a graph of the frequency of posts per day using the hashtag #smokymountains. The frequency hovered around 30–40 posts per day, with expected spikes on the 4th of July (90 pics) and dips in the winter, even in the more mild climate of Tennessee. The long tail represents some sort of data leak from Instagram, where several photos appeared intermittently, up to two years back. That caused some confusion in further timeseries analysis, which in hindsight could be remedied by collecting data further back in time. For comparison, the same graph from Yellowstone produces this:
Of particular note is the frequency: 40,000 photos from the Smoky Mountains stretched over one full year, whereas Yellowstone stretched only one month (and note the drop in frequency as the year gets later and temperatures drop). The Smokys may have been the most-visited park, but the frequency of photos taken is drastically different.
Clearly people are taking a lot of photos related to the parks, but where in the parks are photos being taken? Using python and the basemap library, I plotted photos within geographic bounding boxes around the parks on top of historic maps.
This map shows activity around Grand Canyon National Park. The photos begin as route 180 comes North into the park, clustering at the visitors center. The other large cluster is around the Lake Meade Recreational Area, offset by various remote photos. Out of view on the map, you could see photos, trace along 180, with a huge cluster in Las Vegas, directly left of the map.
The Smoky Mountains were somewhat of an anomaly. The photos focus in the center of the park, around the the main visiting and camping areas, but the Appalachian Trail seems strangely underrepresented. It runs diagonally through the middle of the park, and I had expected to see photos all along the trail. It’s possible that is not the most picturesque section of the trail, but I doubt that. It’s something currently unexplainable to me on this map.
Yellowstone doesn’t offer a ton of surprises. It’s treated as a park many people drive through in a day, as there is a cluster of National Parks in the area. You can see photos come in straight up the road from Grand Teton National Park in the South. The photos continue to trace directly along the roads, especially the main loop in the middle of the park. Of note is Old Faithful, the dense cluster on the bottom left corner of the loop.
Going Forward ::
One of the biggest things I had hoped to discover was the nationalities of park visitors. Through some manual examination of the data, I found groups visiting from countries like Norway, Denmark, Spain, and Tibet, but I would like to take a more programmatic approach to uncovering that information. Collecting more data to perform comprehensive timeseries analysis is another goal, as well develop a more sophisticated method of data collection. Focusing on the geography from the beginning, instead of the hashtag, would produce more comprehensive reuslts. However, it is an exciting project, one which is near and dear to me, and I will continue to refine my techniques and move this project forward.