NLP_0-project
Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures1. We are a "democratic" and collaborative group of five, and I mentioned our names based on our initial work division below
Here is the outline of our project:
Data collection.
@LeiyuanHuo, jyang130, FanFanShark, xdc1999, gaojiamin1116
- Based on file data-WRDS-list.csv, write a web-scraping algorithm to download all 10-Ks (html format) these companies filed to the SEC within 2010 to 2022 at Historical EDGAR documents, and rename them data-10K-COMPNAME-Year.html.
- Parse html files to extract Business and MD&A sections.
2
Text Processing: feature extraction- Part of Speech Tagging (POS) (mainly this method) to get product name, descriptions. Store these for each company.
- Named Entity Recognition (NER) (also mainly this method) to get mentioned competitor names. Store these for each company.
- Product texts: BoW and tf-idf for each company's product(s), and hopefully we have a term-product matrix then.
- Competitor texts: definitely BoW, as we care about the frequency of being mentioned.
-
‼️ We also need to combine sector and firm size/market power into competitor texts and re-count.
2
Text Processing: feature transformation and representation- Term-product matrix: calculate cosine similarity scores for products pairwise; use score threshold to cluster products into similar groups.
- Term-product matrix: directly apply clustering method (e.g., KMeans clustering) to product vectors, and cluster them.
2
Econometric Analysis and Hypothesis Testing- Multivariate regression: DV is profitability (e.g., sales, revenue, Tobin's q), IV is competition measures (one from similar product count, one from mentions as competitors), also include relevant control variables.
- Cross-section portfolios: our competition measures are cross-sectional (one for each year), so we can create long-short portfolios for both measures, and examine stock return effects.
Footnotes
-
Two papers inspired this project. Citations: Eisdorfer, A., Froot, K., Ozik, G., & Sadka, R. (2021). Competition Links and Stock Returns. The Review of Financial Studies, The Review of financial studies, 2021-12-20. && Hoberg, G., & Phillips, G. (2016). Text-Based Network Industries and Endogenous Product Differentiation. The Journal of Political Economy, 124(5), 1423-1465.
↩ -
Text processing processes are based on MFIN7036 Lecture_Notes and a review paper. Citation: Marty, T., Vanstone, B., & Hahn, T. (2020). News media analytics in finance: A survey. Accounting and Finance (Parkville), 60(2), 1385-1434.
↩ ↩ 2↩ 3