Software
Weakly Supervised BertSum Text Summarizer
This repository contains scripts for a weak learning summarization pipeline in Arabic and English. See Conda Environment Setup below.
Overview
- Generate stories (snorkel_compute_labels_languageagnostic.py)
- Run snorkel pipeline to get labels (snorkel_compute_labels_languageagnostic.py). There is a hard-coded Boolean (arabic=True) to switch between Arabic and English.
- Train XLMRSum on training data (see BertSum-XLMRoberta/README.md for instructions)
- Download Arabic data from https://sourceforge.net/projects/easc-corpus/ and unzip. Generate test stories (generate_stories_stanfordtokenizer.py) for Arabic data only, since CNNDM stories are available for download
- Test XLMRobertaSum on test data (see BertSum-XLMRoberta/README.md for instructions)
- Assess output (test_bertsum_output.py)
- Run weak learner assessment (test_summarizers.py)