This research artifact accompanies our ICSE 2019 paper "Going Farther Together: The Impact of Social Capital on Sustained Participation in Open Source". If you use the artifact, please consider citing:
@inproceedings{QiuNBSV19,
author = {Qiu, Huilian Sophie and
Nolte, Alexander and
Brown, Anita and
Serebrenik, Alexander and
Vasilescu, Bogdan},
title = {Going Farther Together: The Impact of Social Capital on Sustained Participation in Open Source},
booktitle = {Proceedings of the 41st International Conference on Software Engineering (ICSE) 2019, Montreal, Canada},
note = {to appear},
organization = {IEEE},
year = {2019},
}
The artifact consists of three main parts:
-
Data collection scripts, written in Python.
The code can be used to select open source contributors, collect their GitHub projects, gather data such as contributors’ years of experience on GitHub, and projects’ age and size. The code also calculates social capital measures, including team familiarity, recurring cohesion, and heterogeneity of programming language expertise.
The final output is a csv file, each row of which is a data point used in our survival analysis. Each row consists of information per person per project per quarter (three-month time window), including all the social capital measures.
The code was implemented in Python 2 and tested on a Linux machine. Required dependencies:
pymysql,sqlalchemy,numpy,scipy,sklearn, andpandas. You also need to have (access to) a MySQL dump of GHTorrent. -
The survey instrument used in the paper.
-
Survey analysis scripts, written in R.
We give more details on data collection scripts next.
In addition to the standard GHTorrent tables, we created a table
ght_namsor_s, containing inferred gender data kindly provided by NamSor (thanks, Elian Carsenat!)
-
Use
MySQL_queries/filter_valid_usersto find valid users. -
Run
sample_user.pyto construct a balanced sample of male and female contributors. The result is saved indata/uid.list. In order to obtain a sample with equal number of men and women,sample_user.pycalls our gender classifier to determine users' genders. The code for the gender classifier is stored in thegender/folder. Details about these files are in the following section. -
Run
setup.py, which reads the filesdict/alias_map_b.dict,dict/reverse_alias_map_b.dict, anddata/uid.list, and generates filesdata/pid.list,data/all_contributors.list,data/watchers_monthly_counts_win.csv,dict/contr_projs.dict,data/all_projs.list, anddict/proj_contrs_count.dict. -
Run
get_user_info.py,get_proj_info.py, andget_user_proj_info.py. They write todata/results_users.csv,data/results_proj.csv, anddata/results_user_proj.csvrepectively. -
Run
merge_result.pyto combine these tables. The result will be saved indata/proj_user_proj.csv, which will be used for data analysis.
Our gender classifier uses names' n-grams as well as results from two other existing gender classifiers, NamSor and genderComputer as features.
In the gender/ folder are two Python files that demostrate how
our gender classifer works.
First, get_feature.py reads users'
information from the MySQL ght_namsor_s table, which contains
users' combined data from GHTorrent and origin and gender
information obtained from NamSor.
Then it gets classification results from genderComputer.
To get a better result from genderComputer, we need to know the
user's country.
For this, we use the data provided by Namsor on one's origin,
computed based on their names.
There are other gender classifiers one can use, e.g., genderize.io. To use them, simply make the result a new feature in the model.
In determine_gender.py, our
classifier divides the name into n-grams and uses them as
additional features.
The result will be written to data/gender.csv, which will later
be used in sample_user.py for balance sampling as described above.
The survey analysis script is survey.R. It contains
code to calculate reliability measures, correlations, plots, and conduct logistic regression analysis on the collected survey data.
The models as reported in the paper are created by the
survival_analysis.R R script.
We have also included an annonymous version of data we used for this paper
here.
Each row in the csv file is one data point in our model. It represents one user's activity in one project during one three-month window.
The csv file consists of 34 columns. Those with prefix "u_" are information about users and those start with "p_" are about projects:
u_ageis the number of three-month windows since the user's first activity.u_commits_to_dateis the number of commits made by this user across all projects up to that three-month window.u_emailis the md5 hash of the user's email address.u_followeris the number of followers the user had up to that three-month window.u_genderis the user's gender.u_idis the id of the user in the GHTorrent datasetuserstable.u_loginis the md5 has of the user's login.u_nichewidthis the number of programming languages that the user had used up to that three-month window.u_projects_to_dateis the number of projects to which the user had submitted commits up to that three-month window.u_temp_failureis a binary indicator of whether the user had been inactive for half a year (2 three-month windows).u_temp_failure_1_yearis a binary indicator of whether the user had been inactive for a year (4 three-month windows).u_window_active_to_dateis the number of three-month windows during which the user had submitted commits.window_numrepresents the current three-month window. 2008 Jan to 2008 Mar will bewindow_num = 1.owner_companyis a binary indicator of whether the owner of the repository displays their company in their profile.owner_genderis the repository owners' genders, with -1 representing male, 1 female, and 0 unknown.p_idis the id of the project in the GHTorrent datasetprojectstable.u_is_majoris a binary indicator of whether the user is a major contributor (more than 5% commits) to that project.u_is_owneris a binary indicator of whether the user is the owner of that project.u_pr_mergeis a binary indicator of whether the user can merge pull request in that project.p_ageis the number of three-month windows since the creation of that project.p_div_langdenomis the value of the programming language diversity of that projectp_idin that windowwindow_num.p_fam_no_decayis the value of the team familiarity of that project in that window.p_langis the major programming language of that project.p_num_commitsis the number of commits of that project in that window.p_num_commits_to_dateis the total number of commits of that project since is creation.p_num_starsis the project's number of stars in that window.p_num_users_to_dateis the number of users who had sent commits to the project up to that window.p_owneris the project owner's id in GHTorrent datasetuserstable.p_recurring_cois the value of the recurring cohesion of that project in that window.p_sharenewcomersis the percentage of new GitHub users out of the users who had sent commits to that project.p_sharenewcomersis the percentage of new users to that project out of the users who had sent commits to that project.p_team_sizep_windows_active_to_dateis the number of three-month windows that the project received commits.windowis the date format of three-month window.