Recently, Facebook is in news almost daily for its inability to prevent Cambridge Analytica(CA) from gathering personal data from 87 million users. CA used the harvested data to profile users to predict voting patterns in 2016 US presidential elections. Some of the important dates for the event are:
- March 16, 2018: The news broke of the scandal
- March 21, 2018: Mark Zuckerberg (CEO of Facebook) accepted the responsibility of failing to protect user’s data and assured that it will never happen again.
- March 25, 2018: Facebook took out full-page ads in several major US and UK newspaper.
- March 26, 2018: The Federal Trade Commission (FTC) started an official investigation for breach of a 2011 settlement which required improvement of Facebook’s privacy settings.
- March 27, 2018: Mark refused an invitation to speak to UK parliament and offered to send one of his deputies.
- April 10,2018: Mark testified to the US Congress regarding illicit sharing of user data and taking personal responsibility.(Hicks, 2018)
These chains of events provided the motivation for this project. The aim of the project is to build a data warehouse (DW) to collect the tweets, mentions in social media and news articles, discussions in blogs about Facebook and find out the effect of the social media discussion on its stock price.
Data Warehouse Architecture
The DW architecture is structured into multiple layers to reduce the load on the data sources while optimizing the performance and support data migration to meet the business intelligence needs. It consists of five layers from data sources to dimension modelling to the layer for the analysts(Lih Ong et al., 2018).
DW is built using a bottoms-up or Kimball approach. DW is consolidated into integrated, time-variant, subject-oriented and non-volatile collection of data that supported BI queries. Data is cleaned and made consistent before storing in staging data marts. These are then divided into different dimensions based on subject and the sum of these alongside fact table constitutes DW. The process helps to measure the effect of social media on stock prices to support the decision making. The collected data is separated into key subject areas such as tweet, stock and mentions.
A detailed logical model is created for the reporting of the stock price, tweets, mentions with all the details about each entity. Data redundancy is also avoided to have an unambiguous and correct version of truth and ease in loading data. The timeframe is discrete with a snapshot in point of time. The structure of data is non-metric and transformed to meet varied and multiple information needs. DW is also able to scalable with growing scope and changing requirements e.g. The term “Facebook” can be replaced by any other technological company and still the DW will represent a complete view of information to understand the queries.
Further, the reporting requirement was a deciding factor. It is tactical measures with relative stable sources of data. Dimensions are built specific to different subjects with denormalized data to help with reporting.(Breslin, 2016)
Figure 1: Architecture
Data Source Layer
Source layer is the layer where all the data that is going to be stored in the data warehouse is assembled. It has data in its unprocessed form. The data is collected from three different sources namely Twitter, Yahoo Finance and Mention.com. The purpose is to have the datasets in a staging database with a variety of formats from different sources and held for a short period of time to analyse before migrating to the next layer. It is to establish a single point of extraction with operational fields such as timestamp and the data was structured to mirror the source including columns, repeating values and denormalized. It also enables the data to be retained for problem detection and audit. It holds data which is of value both while acquiring data and designing the star schema. For example, facts and statistics about Facebook are saved but not used later for dimension modelling.
Twitter is a social networking platform for exchanging short messages. Since its inception in 2006, it has become an extremely influential channel for real-time information exchange and broadcasting news. It offers Application Programming Interface (API) for retrieving data about its users and their tweets. Search API is used for querying and filtering the messaging content, and the REST API for accessing the core primitives of the Twitter platform.
Yahoo Finance offers news about stock quotes, financial reports and press releases.
Mention is a media monitoring tool which help businesses to track conversions in social media, blog comments and user reviews. For the project, an account with 14-day free trial was created to keep track of the mentions that matched the brand “Facebook”. The date range for collecting the data was from 25th March 2018 to 24th April 2018.
The data from Twitter and Yahoo Finance is extracted using REST (Representational State Transfer) API (Application Programming Interface) and python script to retrieve data programmatically.
For Twitter, an application is created and registered to interact with its API. After choosing a name and description for the app, twitter sent private consumer key, consumer secret, access token and access token secret. These have read-only permissions by default and should be shared with anyone. The extraction of tweets is limited to 15 requests per 15 minute window per access token as required by terms and conditions by Twitter(Bonzanini, 2015).
Figure 2: Excerpt of Script for Retrieving Tweets for Facebook
Figure 3: Excerpt of JSON file of Tweets
Yahoo finance does not allow EOD (End of Day) data without a “crumb” for authentication with a cookie. The script is written to download day stock quote with matching crumb and cookie. The historical data is downloaded in a CSV (Comma Separated File) in chronological order and ranges from 4th April 2016 to 3rd April 2018.
Data Integration Layer
Unprocessed data is loaded into the Integration layer from the source layer. The data is consolidated, checked for quality, rationalized for values, cleansed and prepared for downstream consumption. The purpose for this layer is to process data acquired from different sources to a format which is suitable for Microsoft SQL server (Relational Database Management System (RDBMS) that uses Structured Query Language (SQL)).
The retained data is assessed for its quality using scientific and statistical process to determine if it meets the quality required for the project and data-dependent processes work as expected. Checks are made to determine if the data is of the right type, sufficient quantity, accurate, reliable and able to support its intended use. It also helps to plan for data cleansing and enrichment strategies.
ETL (Extract, Transform and Load) is also done to translate data from different sources into useful content for data warehouse. For extracting, data is read into Jupyter Notebooks and then a logical data map is created to determine the relationship between the start and end points before the data is transformed. For transforming, data formats are changed to Excel format e.g. Tweets are extracted from JSON format and converted to CSV format for processing and then to Excel format for loading to match the standards of the SQL server. Other transformations included finding and correcting incorrect data, removing duplicated records, correcting misspellings, handling missing values and deriving new features. The data is subsequently loaded into a database in Microsoft SQL server. The strategy is explained further below in this report.
Third-Normal Form Data Model
The data is normalized into Third normal form (3NF) so that there is no transitive functional dependency. It is done to promote data integrity, have a high degree of flexibility, tuned for load performance and minimal redundancy. The data is processed to an atomic-level and well-defined state while retaining all features that would derive other values. The records are designed to be a single trusted source according to the business logic(Hay, 2011).
Further, tweets and data from mention.com consist of different components namely user, their updates or text, timeline and the relationships between and within these two components. The user is defined as someone who can writes a tweet or retweets another user’s text(Maha BEN KRAIEM, Jamel FEKI, Kaïs KHROUF,Franck RAVAT, 2015). The user has an accumulated view of their activities according to the timeline with the description about text, status counts and relationship with their friends and followers. Therefore, the semi-structured format of the files is not suitable for OLAP because the logical model, relationships between different dimensions and fact table to form a star schema requires few constraints such as no missing values atomicity and homogeneity.(Rehman et al., 2012).
However, the retained data sources are temporal, non-volatile and maintain accumulative features and therefore are arranged for multidimensional aggregation.
In this layer, the data is denormalized for re-use, ease of navigation and optimized for efficient performance for business intelligence. The tables form a “Star schema” and are populated using trusted and consistent data, derived dimensional attributes and aggregated and computed measures.
Business Intelligence Layer
This layer allows access to end users and query data and produce reports for decision making by displaying the information. The business intelligence tools are Tableau and SSAS (SQL Server Analysis Services). These provide OLAP (Online Analytical Processing) solutions and a secure interface that integrates data from different sources for faster and deeper analysis.
Data Warehouse Data Model
The schema is shown below with different dimension tables namely with a fact table containing a periodic snapshot. The schema represents a many-to-many and many-to-one relationships in the snapshot data.
The dimension tables are created after identifying the different subjects and are entry point for getting the facts e.g. stock price is one domain and tweet is another. Further, the dataset about tweets is a JSON object with a hierarchical structure. Hence, it is flattened to extract the attributes for the TweetDim. TweetDim and MentionDim contain surrogate keys which are converted from the primary keys of Tweet and Mentions table from the staging database. The schema is designed in such a way to decrease upkeep and increase user accessibility. However, data integrity is not enforced. DateDim is used to store record per day, month, year and quarter. It has the date column parsed in the same format as the other dimensions and from the tables in the entity relationship diagram.
The dimension table for Tweet and Mention does not contain all the attributes as listed in the ER diagram above because there are lots of missing values or are duplicate records e.g. geography and co-ordinates for tweets were collected only from USA and hence resulted in same value for every row. Information about the country from where mention was started is also a missing value. TweetDim, MentionDim and StockPriceDim contains fields from tweets, mention and stock price dataset. The names of the attributes of the dimensions are kept same as the source table because of these are the conventions used by Twitter, Yahoo Finance and Mention.com. The idea is to keep consistency and document the data dictionary from the three different sources.
For Twitter, the data dictionary address is available at https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object
For Yahoo Finance, the information about the attributes is available at https://www.nasdaq.com/investing/glossary/
For Mention.com, more information can be found at https://mention.com/en/enterprise/insights/
Figure 5: Star Schema
FactDim in the schema is a periodic snapshot table which capture the measures daily. The different measures such as number of tweets, number of mentions, count of retweets for each tweet are rolled up to present the occurrence at the lowest grain. It also contains records such as count of languages for the tweets to determine different nuances in the data. The fact table is designed in such a way because the data ranged from a predetermined point in time because of a specific event. The granularity for the stock price is daily, hence it is essential to roll-up the conversation in social media to daily as well. The measures are both additive and semi-additive with the ability to use aggregate functions such as SUM, Maximum or Average to answer queries about the trend over time. Unlike, answering “How many” questions in transaction table, it helps to answer the “How much” question after a good understanding of the sources .A row in the fact table summarizes or reports measurements daily and is not individual transaction and contains different facts such as the number of tweets and mentions, daily closing, opening and low share price (- page 44) (Kimball and Ross, 2013).The reason for precomputing different aggregations and store them in the fact table is to increase performance of BI tool and report counts, statistics and totals to provide insights. “Date” is also not the sole dimension for these semi-additive facts. These are used to compare against daily stock price. The fact table also contains all the surrogate keys and a decision is taken not to create a separate key look-up table because a new dimension entity and Type 2 (Slow changing Dimension) was not required for the project. Furthermore, having all the keys with measures will decrease the random access to any incoming fact record.
The process starts with extracting the data from Twitter, Yahoo Finance by using a python script. The extracted data is downloaded in JSON and CSV format respectively. These files are checked for data quality and usability using python libraries and Jupyter notebook. The format of the JSON file is also changed to CSV before exploration and cleansing. The data from mention.com is downloaded in csv format and all three files are stored in the same folder. Logical data map is created in excel file after the Twitter data file is formatted in CSV and is updated regularly as different findings were discovered.
Figure 6: Logical Data Map
Data cleaning involved the following:
Twitter: File is changed to csv and then user, entities and location are extracted from the data for more features and get a flatter file. The columns which are empty and cryptic, contain missing values, not confirm to rules are deleted. The timestamp (created_at) is then parsed to extract the date. It is also formatted from hyphen to backslash to be of the same format at the data extracted from Yahoo Finance. Some of the columns are also deleted because the records contain the same value. No imputation for missing records is done because of lack of the domain knowledge. The text of the tweets is also not altered and saved as attribute for TweetDim.
Figure 7: Jupyter Notebook for tweets
Yahoo Finance data had no “dirty data”. Mention.com is cleansed by checking for missing values and duplicates. The timestamp (published_at) is also parsed. The code from Jupyter notebooks are written back in python scripts so that the extracted files are cleansed and in the right format.
Figure 8: Script to extract and Transform data Programmatically
The cleansed files are stored in a staging database in MS SQL Server. CSV files are changed to excel files because of the Unicode and truncated errors while reading in the file to the OLE DB destination. The categorical variables are also converted to DTString as SQL server does not handle text and integer quite well. Special attention is paid to the data type for the columns and have to be changed due to the requirement of SSIS (SQL Server Integration Services) package.
Figure 9: Staging Table and Loading Tables in MS SQL Server
Figure 10: Staging Table for Tweets Data
Also, the size of certain record have to be set to maximum to meet the requirements of SSIS package. DateDim in the staging area was created using an XML script.
Figure 11: Data Conversion
Fact table contains repeating measurements described daily. For creating and loading the fact table, the data is extracted from the staging data base. The data from Tweets Table from ProjectDB (staging database) is extracted. The data is aggregated by grouping by “Date and “ID” and then averaging the count of retweets and favourite. The type of date has to be transformed to DT[Date] and then used to sort the rows before merge join with the mentions table.
Merge join by specifying “inner join” and using “Date” as a joint key relationship is used to sort data from more than one OLE DB and merge into one table before sending to OLE DB destination. It is used instead of lookup to return all the rows from each of the columns. Unlike lookup where find rows matching in one column based on input column (Only one match for every row and the data needs not be sorted), merge join returns all the rows.
Figure 12: Fact Table ETL
Data is extracted from the mentions table from the database. The data is aggregated by grouping using “Date” and “Mention_id” and then direct_reach (number of direct communication to the community) and score (measuring impact of the text of the mention) are averaged for the day. Type of date is also changed and then used to sort data and inner join with tweet data. Stock price is just sorted and then joined with the merged table of tweets and mentions. The resulting table is loaded back into another facts table in MS SQL server database called DimesionDB. The process is automatic and can be loaded daily on a rolling basis because the calculations for the fact table are not complex(Kimball and Ross, 2013). The different dimension tables are also loaded back into the DimensionDb by running the SSIS package.
From the staging database, there are 1299 records from Tweets (ranging from25th March 2018 to 2nd April 2018), 3287 records from DateDim (ranging from 1st January 2010 to 31st December 2018), 504 records from StockPrice (ranging from 4th April 2016 to 3rd April 2018) and 13007 rows from Mentions (ranging from 25th March 2018 to 24th April 2018). The fact table only has data relating to 4 days because of the unavailability of data for every day from every source. However, it is an automatic ETL framework which can be used in the future for collecting and analysis of data.
Figure 13: ETL in SSIS Package
The package allows extracting, transforming and loading data into multiple dimensions and fact table automatically alongside the python scripts. Cube is also created using SSAS to allow fast analytics using pre-calculated.
Application of Data Warehouse
The data is loaded in Tableau to gather the following information for decision making.
How much effect did the number of tweets had on the number of shares that changed was hands between the buyer and seller?
Figure 14: BI Query 1
The data sources for the above query are tweets and stock price. The above plot shows that as the number of tweets about Facebook increased the volume of traded shares dropped.
Figure 15: Table Query 1
27th March was the day when most of the tweets were communicated. It could be because on 26th March Facebook took advertisement in newspaper to apologize. The volume of shares also dropped nearly 37% from the previous day. This could be because of increase in number of tweets or due to investigation initiated by FTC(Hicks, 2018). However, as the time passed by the number of tweets decreased. 2nd April 2018 was a Monday and reduction in the number of shares traded could be because the investors wanted to wait and see the reaction of the public.
How much effect did Score had on the Opening and Closing Price on a given Day?
Figure 16: BI query No. 2
The above query is generated from mention and stock price data sources. The plot shows that the score which is the measuring impact of the text of the mention has no impact on the opening and closing price of the share of Facebook.
What was the sentiment of the social media discussion in the social media?
Figure 17: BI query No. 3
The above plot is a sample of a query to discover the sentiment of the social media mentions during the days when Facebook was in news almost daily. The plot uses data from mention and tweets to find that on these four days that tweets were made in 18 different languages and the sentiment was neutral. It also shows that the tweets in English were retweeted the most. On 27th March 2018, tweets which were retweeted were from those users who have registered their language as English from Great Britain.
Figure 18: Table for BI query 3
However, these users only retweeted on 27th March 2018 the most. The users who registered their language as English retweeted every day.
The analysis of the queries seems to suggest that the public was not interested in the scandal of the Facebook. This analysis is replicated through other reports which were printed recently. For example, Aylien reported that the volume of stories spiked when the initial scandal about Cambridge Analytica broke. Since then the discussion has fallen dramatically(Gannon, 2018). Facebook even posted a Q1 revenue and earnings higher than expected on 25th April 2018(Divine, 2018). This means that Facebook as a brand is strong and have eased the doubts of data privacy in the mind of the public after the Cambridge Analytica Scandal.
Extra material and code are available here.
Bonzanini, M. (2015) Mining Twitter Data with Python (Part 1: Collecting data) – Marco Bonzanini, marcobonzanini.com. Available at: https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/ (Accessed: 12 May 2018).
Breslin, M. (2016) Data Warehousing Battle of the Giants: Comparing the Basics of the Kimball and Inmon Models, bi-bestpractices.com. Available at: http://www.bi-bestpractices.com/view-articles/4768 (Accessed: 14 May 2018).
Divine, J. (2018) Facebook, Inc. (FB) Crushes Earnings, Stock Soars | Stock Market News | US News. Available at: https://money.usnews.com/investing/stock-market-news/articles/2018-04-25/facebook-inc-fb-stock (Accessed: 20 May 2018).
Gannon, W. (2018) Cambridge Analytica, Facebook, and user data – Monthly Media Review with the AYLIEN News API, April. Available at: https://flipboard.com/@flipboard/-cambridge-analytica-facebook-and-user-d/f-7d92ae3094%2Faylien.com (Accessed: 20 May 2018).
Hay, J. (2011) EDW Reference Architecture: Information Warehousing Layer | EDW Strategic Positions. Available at: https://justinhay.wordpress.com/2011/05/24/edw-reference-architecture-information-warehousing-layer/ (Accessed: 13 May 2018).
Hicks, M. (2018) The Cambridge Analytica and Facebook data scandal: how to tell if your data was shared | TechRadar, Techradar.com. Available at: https://www.techradar.com/news/us-uk-investigating-facebooks-role-in-cambridge-analytica-data-breach (Accessed: 4 May 2018).
Kimball, R. and Ross, M. (2013) The Data Warehouse Toolkit : The Definitive Guide to Dimensional Modeling. 3rd edn. Wiley.
Lih Ong, I. et al. (2018) A Five-Layered Business Intelligence Architecture.
Maha BEN KRAIEM, Jamel FEKI, Kaïs KHROUF,Franck RAVAT, O. T. (2015) ‘Modeling and OLAPing Social Media : The case of Twitter’. Available at: ftp://ftp.irit.fr/IRIT/SIG/2015_SNAM_BFKRT.pdf (Accessed: 24 April 2018).
Rehman, N. U. et al. (2012) ‘Building a Data Warehouse for Twitter Stream Exploration’, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Available at: https://pdfs.semanticscholar.org/104d/4e33f7f0776c15062babed8998e60f4625a8.pdf (Accessed: 24 April 2018).