This repository contains scripts for a weak learning summarization pipeline in Arabic and English. See Conda Environment Setup below.

Overview

  • Generate stories (snorkel_compute_labels_languageagnostic.py)
  • Run snorkel pipeline to get labels (snorkel_compute_labels_languageagnostic.py). There is a hard-coded Boolean (arabic=True) to switch between Arabic and English.
  • Train XLMRSum on training data (see BertSum-XLMRoberta/README.md for instructions)
  • Download Arabic data from https://sourceforge.net/projects/easc-corpus/ and unzip. Generate test stories (generate_stories_stanfordtokenizer.py) for Arabic data only, since CNNDM stories are available for download
  • Test XLMRobertaSum on test data (see BertSum-XLMRoberta/README.md for instructions)
  • Assess output (test_bertsum_output.py)
  • Run weak learner assessment (test_summarizers.py)