If the format of your datasets has already been supported by the existing library, you can directly use it without any library-level modification
TaskType.text_classificationFileType.tsv
TaskType.named_entity_recognitionFileType.conll
TaskType.summarizationFileType.tsv
TaskType.extractive_qaFileType.json(same format with squad)
For example, suppose that you have a system output of the summarization task
in tsv format:
from explainaboard import TaskType, get_dataset_class, get_processor_class
dataset_path = "./integration_tests/artifacts/summarization/dataset.tsv"
output_path = "./integration_tests/artifacts/summarization/output.txt"
loader = get_dataset_class(TaskType.summarization)(
dataset_path,
output_path,
Source.local_filesystem,
Source.local_filesystem,
FileType.tsv,
FileType.text,
)
data = loader.load()
processor = get_processor_class(TaskType.summarization)()
analysis = processor.process()
analysis.write_to_directory("./")If your dataset is in a new format which the current SDK doesn't support, you can
-
(1) reformat your data into a format that the current library supports
-
(2) or re-write the
loader.load()function to make it support your format. Taking the summarization task for example, suppose that the existing SDK only supportstsvformat, we can makejsonformat supported by adding the following code insideloaders.summarization.TextSummarizationLoader.loader()def load(self) -> Iterable[Dict]: raw_data = self._load_raw_data_points() data: List[Dict] = [] if self._file_type == FileType.tsv: for id, dp in enumerate(raw_data): source, reference, hypothesis = dp[:3] data.append({"id": id, "source": source.strip(), "reference": reference.strip(), "hypothesis": hypothesis.strip()}) if self._file_type == FileType.json: # This function has been unittested for id, info in enumerate(raw_data): source, reference, hypothesis = info["source"], info["references"], info["hypothesis"] data.append({"id": id, "source": source.strip(), "reference": reference.strip(), "hypothesis": hypothesis.strip()}) else: raise NotImplementedError return data