Currently, datasets are given as markdown files with lots of unused columns:
Our dataset loader in fact only uses the project name and clone URL. Hence, dataset files and the loading should be simplified. The columns for Domain, and Repository URL are interesting but not essential. So maybe these could stay in the files but be the last two columns.
Also, except for line 2 of the file, markdown files with just a single table like this are actually CSV files with | as separator instead of , or ;.So maybe we could reuse our CSV IO classes here.
Currently, datasets are given as markdown files with lots of unused columns:
Our dataset loader in fact only uses the project name and clone URL. Hence, dataset files and the loading should be simplified. The columns for
Domain, andRepository URLare interesting but not essential. So maybe these could stay in the files but be the last two columns.Also, except for line 2 of the file, markdown files with just a single table like this are actually CSV files with
|as separator instead of,or;.So maybe we could reuse our CSV IO classes here.