datapreprocessing_rosetta_parser
I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity, specifically targeting popular packages like pandas
, beautifulsoup
and spacy
.
The main idea of my project is to recreate Jelle Teijema's preprocessing pipeline and then try to run Dutch language model on each document to extract things of interest, such as emails, urls, organizations, people and dates. Maybe at this point, it shouldn't be considered just pre-processing, hmmm. Anyway, I've used nl_core_news_lg
model. It is not very reliable, especially for organization and person names, however, it still allows for interesting queries.
Moreover, I've decided to try to do a summarization and collection of the most frequent words in the documents. My script tries to find N_SUMMARY_SENTENCES
most important sentences and store it in the summary
column. Please note, my Dutch is not very strong, so I can't really judge how well it works :)
Finally, the script also saves cleaned title and file contents, as per track anticipated output.
Output file
generate.py
reads .csv files from input_data
folder and produces output .csv file with |
separator. It is pretty heavy (about x1.8 of input csv, ~75MB) and has a total of 15 columns:
Column name | Description |
---|---|
filename | Original filename provided in the input file |
file_content | Original file contents provided in the input file |
id | The dot separated numbers from the filename |
category | Type of a file |
filename_date | Date extracted from a filename |
parsed_date | Date extracted from file contents |
found_emails | Emails found in the file contents |
found_urls | URLs found in the file contents |
found_organizations | Organizations found in the file contents |
found_people | People found in the file contents |
found_dates | Dates found in the file contents |
summary | Summary of the document |
top5words | Top 5 most frequently used words in the file contents |
title | Somewhat cleaned title |
abstract | Somewhat cleaned file contents |
Some interesting queries that I could think of at 12pm
- Load the output processed .csv file:
import pandas as pd
df = pd.read_csv('./output_data/processed_data.csv', sep='|',
index_col=0, dtype=str)
- All unique emails found in the documents:
import ast
emails = sum([ast.literal_eval(x) for x in df['found_emails']], [])
unique_emails = set(emails)
- Top 10 communicated domains in the documents:
from collections import Counter
domains = [x.split('@')[1] for x in emails]
d_counter = Counter(domains)
print(d_counter.most_common(10))
- Top 10 organizations mentioned in the documents:
orgs = sum([ast.literal_eval(x) for x in df['found_organizations']], [])
o_counter = Counter(orgs)
print(o_counter.most_common(10))
- Find IDs of documents that contain word "confidential" in them:
df['id'][df['abstract'].str.contains('confidential')]
- How many documents and categories there are in the dataset:
print(f'Total number of documents: {len(df)}')
print('Documents by category:')
df['category'].value_counts()
and I am sure you can be significantly more creative with this :)
How to generate output data
- Install dependencies with conda and switch to the environment:
conda env create -f environment.yml
conda activate ftm_hackathon
Alternatively (not tested), you can install packages to your current environment manually:
pip install spacy tqdm pandas bs4
- Download Dutch spacy model, ~500MB:
python -m spacy download nl_core_news_lg
-
Put your raw .csv files into
input_data
folder. -
Run
generate.py
. On my 6yo laptop it takes ~17 minutes. -
The result will be written in
output_data/processed_data.csv