Weakly Supervised BertSum Text Summarizer

This repository contains scripts for a weak learning summarization pipeline in Arabic and English. See Conda Environment Setup below.

Overview

Generate stories (snorkel_compute_labels_languageagnostic.py)
Run snorkel pipeline to get labels (snorkel_compute_labels_languageagnostic.py). There is a hard-coded Boolean (arabic=True) to switch between Arabic and English.
Train XLMRSum on training data (see BertSum-XLMRoberta/README.md for instructions)
Download Arabic data from https://sourceforge.net/projects/easc-corpus/ and unzip. Generate test stories (generate_stories_stanfordtokenizer.py) for Arabic data only, since CNNDM stories are available for download
Test XLMRobertaSum on test data (see BertSum-XLMRoberta/README.md for instructions)
Assess output (test_bertsum_output.py)
Run weak learner assessment (test_summarizers.py)