This is repost of my article from Pellucid Tech Blog
Background
At Pellucid Analytics we we are building a platform that automates and simplifies the creation of data-driven chartbooks, so that it takes minutes instead of hours to get from raw data to powerful visualizations and compelling stories.
One of industries we are focusing on is Investment Banking. We are helping IB advisory professionals build pitch-books, and provide them with analytical and quantitative support to sell their ideas. Comparable Companies Analysis is central to this business.
Comparable company analysis starts with establishing a peer group consisting of similar companies of similar size in the same industry and region.
The problem we are faced with is finding a scalable solution to establish a peer group for any chosen company.
Approaches That We Tried
Company Industry
Data vendors provide industry classification for each company, and it helps a lot in industries like retail (Wal-Mart is good comparable to Costco), energy (Chevron and Exxon Mobil) but it stumbles with many other companies. People tend to compare Amazon with Google as a two major players in it business, but data vendors tend to put Amazon in retail industry with Wal-Mart/Costco as comparables.
Company Financials and Valuation Multiples
We tried cluster analysis and k-nearest neighbors to group companies based on their financials (Sales, Revenue) and valuation multiples (EV/EBIDTA, P/E). However assumptions that similar companies will have similar valuations multiples is wrong. People compare Twitter with Facebook as two biggest companies in social media, but based on their financials they don’t have too much in common. Facebook 2013 revenue is almost $8 billion and Twitter has only $600 million.
Novel Approach
We came up with an idea that if companies are often mentioned in news articles and tweets together, it’s probably a sign that people think about them as comparable companies. In this post I’ll show how we built proof of concept for this idea with Spark, Spark Streaming and Cassandra. We use only Twitter live stream data for now, accessing high quality news data is a bit more complicated problem.
Let’s take for example this tweet from CNN:
Trying to spot the next $FB or $TWTR? These 10 startups are worth keeping an eye on http://t.co/FEKNtm7QqB
— CNN Public Relations (@CNNPR) October 3, 2014
From this tweet we can derive 2 mentions for 2 companies. For Facebook it will be Twitter and vice-versa. If we collect tweets for all companies over some period of time, and take a ratio of joint appearance in same tweet as a measure of “similarity”, we can build comparable company recommendations based on this measure.
Data Model
We use Cassandra to store all mentions, aggregates and final recommendations. We use Phantom DSL for scala to define schema and for most of Cassandra operations (spark integration is not yet supported in Phantom).
Ingest Real-Time Twitter Stream
We use Spark Streaming Twitter integration to subscribe for real-time twitter updates, then we extract company mentions and put them to Cassandra. Unfortunately Phantom doesn’t support Spark yet, so we used Datastax Spark Cassandra Connector with custom type mappers to map from Phantom-record types into Cassandra tables.
Spark For Aggregation and Recommendation
To come up with comparable company recommendation we use 2-step process.
1. Count mentions for each pair of tickers
After Mentions
table loaded in Spark as RDD[Mention]
we extract pairs of tickers,
and it enables bunch of aggregate and reduce functions from Spark PairRDDFunctions
.
With aggregateByKey
and given combine functions we efficiently build counter map Map[Ticker, Long]
for each
ticker distributed in cluster. From single Map[Ticker, Long]
we emit multiple aggregates for each ticket pair.
2. Sort aggregates and build recommendations
After aggregates computed, we sort them globally and then group them by key (Ticker). After
all aggregates grouped we produce Recommendation
in single traverse distributed for each key.
Results
You can check comparable company recommendations build from Twitter stream using this link.
Cassandra and Spark works perfectly together and allows you to build scalable data-driven applications, that are super easy to scale out and handle gigabytes and terabytes of data. In this particular case, it’s probably an overkill. Twitter doesn’t have enough finance-related activity to produce serious load. However it’s easy to extend this application and add other streams: Bloomberg News Feed, Thompson Reuters, etc.
The code for this application app can be found on Github