Conclusion
The Brewery Project takes an in-depth look at breweries in the US to determine what exactly makes a city a so-called brewery hotspot. As brewery lovers ourselves, we wanted to analyze the different features that contributed to the frequency of breweries in a given area and see if we could build a model to predict the frequency of breweries in a city given its specific features. We were also interested in building a recommendation model for potential brewery owners to assist them in determining a location for a future brewery. Initially, we theorized that certain nearby recreational areas such as national parks and ski resorts would influence the frequency of breweries in nearby cities. We also theorized that city features such as whether the city is considered a college town, major city, or tech hub and the demographics of the city would influence brewery frequency.
After building our dataset, we utilized data preprocessing and exploratory data analysis to visualize our data and see if we could identify any patterns or interesting features. This aspect of the project can be viewed at the ‘Data Exploration’ tab of our website. Next we trained and fine-tuned various models to assess our research questions. Our processes for building the models as well as the model results can be viewed at the ‘Models Implemented’ tab of our website.
The data for this project is compiled from various online resources. Our brewery data comes directly from an online database. City features that were close to national parks/ski resorts, considered major cities, tech hubs or college towns were web scraped from online articles. Lastly, city population data was pulled from the US Census’ public data base. Each of these data were combined in order to gain understanding about the key features that predict the number of breweries a city may have and tell us more about what classifies a brewery hotspot in the US. Initially, we focused on a limited data set with brewery information and city features. Then, we expanded our data set according to the results of each subsequent model.
As part of our model implementation process, we first executed PCA to check the relationships between our variables and identify any redundancy in our dataset. We noticed some pretty strong patterns in the PCA plots indicating that there was some clustering within the data as well as subsets of data that shared similar characteristics.
Next, we implemented linear regression on 5 selected city features- count of ski resorts in the state, count of national parks in the state, and whether the city was considered a tech hub, college town, or major city. This analysis helped us determine which of those features had a significant impact on the brewery frequency. Our model found that being a tech hub had the most significant impact on brewery frequency, where being a tech hub was associated with an increase of around 27 breweries on average. However, our best linear regression model indicated that the city features we identified as potentially impacting brewery count are likely to be influential, but do not capture the entire story.
Having an understanding of how these city features predict the count of breweries in each city, we then applied a ranking based on total brewery count to each city. These ranks (1-6) were then used to train models to classify brewery hotspots based on the basic city features and total population, using a few different methods. Our results found that the best model used a Naive Bayes Classification method. With this model, we were able to classify cities into brewery hotspot ranks reasonably well based on city characteristics and overall population.
In an attempt to create a model with better performance in classifying cities into our hotspot ranking system, we decided to use all available features. These included the number of breweries in a state, the type of breweries in a city, information on nearby attractions such as colleges, metropolitan areas, national parks, and ski resorts, as well as census data and region designations. We only excluded variables we knew would cause collinearity issues. After testing across different data transformations and model parameters, we were ultimately able to create a model with better performance than the prior model with reduced features.
This provides evidence that given more information about the demographics and existing brewery scene of a city does result in a better hotspot ranking classification. Although the number of features were notably greater than in the reduced model, they’re not too elusive for an invested business owner to find, or even estimate, given they were curious about placing a brewery in a given city.
We can conclude that our models performed better when information about the current breweries in the city and details about the population were incorporated. While this is significant, it may be unrealistic for a standard business owner to use such a complex and coded model. On that note, we produced a more user-friendly decision tree that is easily readable and actionable for business owners. This tree works with basic market research of existing breweries and local demographics to sort prospective business locations into tier 1-6 hotspots. Ranks 4 and 5 hotspots are strong contenders for brewery locations. Rank 6 is a demonstrated top tier hotspot meaning the city is supporting many breweries, although this may also mean more competition.
Even with decent performance in both the minimal feature and expanded feature models, we had several limitations due to availability of data that resulted in us using a more macro approach than we had previously expected. The brewery data itself contained information such as zip code, latitude and longitude that did not make the final modeling dataset as features. Cities, especially those on the larger end, have distinguishable pockets of demographics, industries, and customers. Even being able to discern our data at the granular level of zip codes would provide improved insight for potential business owners and brewery goers.
Part of this issue stemmed from aggregating the data between all of our sources in a way useful for modeling. An extension of this project would be to gather data in a manner conducive to providing finer details about different locales and communities within a city.
Another potential shortcoming of our model is the hotspot ranking system. Although we were overall satisfied with our method of creating hotspots through using the number of breweries in a city, we recognize that it wasn’t a perfect system. Another extension of this project could be refining the definition and ranking system of a brewery hotspot. This could be in line with using our preexisting algorithm on the more granular level of zip codes or locales mentioned above, including features such as metrics based on brewery ratings, or even a more complex ranking system. We were limited to cities we knew had breweries, so we weren’t able to include cities without breweries in our models.
To continue with the idea of granularity improving recommendations for multiple facets within the brewing community, further research could include data associated with pricing and information about ease of import and export to certain areas.
Limitations aside, we have created models which are usable for brewer and consumer to make informed decisions with. Even given limited information about a city, someone setting up shop or deciding where to grab a beer can dependably rely on this research.