Simpler Dataset Files

Currently, datasets are given as markdown files with lots of unused columns:

Project name       | Domain                  | Source code available (**y**es/**n**o)? | Is it a git repository (**y**es/**n**o)? | Repository URL                                               | Clone URL                                          | Estimated number of commits
-------------------|-------------------------|-----------------------------------------|------------------------------------------|--------------------------------------------------------------|----------------------------------------------------|-----------------------------
apache-httpd       | web server              | y                                       | y                                        | https://github.com/apache/httpd                              | https://github.com/DiffDetective/httpd.git         | 32,927
berkeley-db-libdb  | database system         | y                                       | y                                        | https://github.com/berkeleydb/libdb                          | https://github.com/DiffDetective/libdb.git         | 7

Our dataset loader in fact only uses the project name and clone URL. Hence, dataset files and the loading should be simplified. The columns for `Domain`, and `Repository URL` are interesting but not essential. So maybe these could stay in the files but be the last two columns.

Also, except for line 2 of the file, markdown files with just a single table like this are actually CSV files with `|` as separator instead of `,` or `;`.So maybe we could reuse our CSV IO classes here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simpler Dataset Files #120

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Project name	Domain	Source code available (yes/no)?	Is it a git repository (yes/no)?	Repository URL	Clone URL	Estimated number of commits
apache-httpd	web server	y	y	https://github.com/apache/httpd	https://github.com/DiffDetective/httpd.git	32,927
berkeley-db-libdb	database system	y	y	https://github.com/berkeleydb/libdb	https://github.com/DiffDetective/libdb.git	7

Simpler Dataset Files #120

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions