NDA terms are in progress. Full post to follow.
With the rising popularity of the general public's need to be constantly "connected" and an ever-increasing number of digital interactions, an unthinkable amount of data is being generated every second of every day (think credit card transactions, geospatial location data, satellite imagery, Internet of Things (IoT), app "check-ins," Facebook likes/dislikes, etc). A mobile advertising company has amassed data ranging from geospatial location data to retail purchases/purchase frequency to hobbies/interests on households across the USA. Our job was to sift through the data to segment the customer base in order to gain insights into specific behavioral patterns and trends.
The data set provided was approximately 150 million observations with over 950 features. The data was sectioned into parquets and uploaded to an Amazon Web Services (AWS) S3 bucket. An AWS instance was set up through Databricks, and the AWS S3 bucket containing the parquets was linked to our Databricks account.