Skip to content
This repository was archived by the owner on Dec 31, 2025. It is now read-only.

Commit 51c7377

Browse files
authored
Merge pull request #30 from poldrack/text/workflows-Dec27
Rename Python module from BetterCodeBetterScience to bettercode
2 parents 9c8f3ae + 43dc98b commit 51c7377

144 files changed

Lines changed: 4565 additions & 178 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ rm -rf book/_build
3131
pytest
3232

3333
# Run tests with coverage
34-
pytest --cov=src/BetterCodeBetterScience --cov-report term-missing
34+
pytest --cov=src/bettercode --cov-report term-missing
3535

3636
# Run specific test modules
3737
pytest tests/textmining/
@@ -62,7 +62,7 @@ pre-commit run --all-files
6262
## Project Structure
6363

6464
- `book/` - MyST markdown chapters (configured in myst.yml)
65-
- `src/BetterCodeBetterScience/` - Example Python code referenced in book chapters
65+
- `src/bettercode/` - Example Python code referenced in book chapters
6666
- `tests/` - Test examples demonstrating testing concepts from the book
6767
- `data/` - Data files for examples
6868
- `scripts/` - Utility scripts

book/AI_coding_assistants.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ def linear_regression_normal_eq(X: np.ndarray, y: np.ndarray) -> np.ndarray:
9696
```
9797

9898
Unlike the previous examples, the code now includes type hints.
99-
It's always a bad idea to generalize from a single result, so we ran these prompts through ChatGPT 10 times each (using the Openai API to generate them programmatically; see the [notebook](../src/BetterCodeBetterScience/incontext_learning_example.ipynb)).
99+
It's always a bad idea to generalize from a single result, so we ran these prompts through ChatGPT 10 times each (using the Openai API to generate them programmatically; see the [notebook](../src/bettercode/incontext_learning_example.ipynb)).
100100
Here are the function signatures generated for each of the 10 runs without mentioning type hints:
101101

102102
```
@@ -272,7 +272,7 @@ In addition to the time and labor of running things by hand, it is also a recipe
272272

273273
You might be asking at this point, "What's an API"? The acronym stands for "Application Programming Interface", which is a method by which one can programmatically send commands to and receive responses from a computer system, which could be local or remote[^1].
274274
To understand this better, let's see how to send a chat command and receive a response from the Claude language model.
275-
The full outline is in [the notebook](https://github.com/poldrack/BetterCodeBetterScience/blob/main/src/BetterCodeBetterScience/language_model_api_prompting.ipynb).
275+
The full outline is in [the notebook](https://github.com/poldrack/BetterCodeBetterScience/blob/main/src/bettercode/language_model_api_prompting.ipynb).
276276
Coding agents are very good at generating code to perform API calls, so I used Claude Sonnet 4 to generate the example code in the notebook:
277277

278278
```python
@@ -358,7 +358,7 @@ Let's see how we could get the previous example to return a JSON object containi
358358
Here we will use a function called `send_prompt_to_claude()` that wraps the call to the model object and returns the text from the result:
359359

360360
```python
361-
from BetterCodeBetterScience.llm_utils import send_prompt_to_claude
361+
from bettercode.llm_utils import send_prompt_to_claude
362362

363363
json_prompt = """
364364
What is the capital of France?

book/data_management.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -471,7 +471,7 @@ df_merged = pd.concat([df1, df2, df3], ignore_index=True)
471471

472472
The most common file formats are *comma-separated value* (CSV) or *tab-separated value* (TSV) files. Both of these have the benefit of being represented in plain text, so their contents can be easily examined without any special software. I generally prefer to use tabs rather than commas as the separator (or *delimiter*), primarily because they can more easily naturally represent longer pieces of text that may include commas. These can also be represented using CSV, but they require additional processing in order to *escape* the commas within the text so that they are not interpreted as delimiters.
473473

474-
Text file formats like CSV and TSV are nice for their ease of interpretability, but they are highly inefficient for large data compared to optimized file formats, such as the *Parquet* format. To see this in action, I loaded a brain image and saved all of the non-zero data points (857,785 to be exact) to a data frame, which I then saved to CSV and Parquet formats; see [the management notebook](src/BetterCodeBetterScience/data_management.ipynb) for details. Looking at the resulting files, we can see that the Parquet file is only about 20% the size of the CSV file:
474+
Text file formats like CSV and TSV are nice for their ease of interpretability, but they are highly inefficient for large data compared to optimized file formats, such as the *Parquet* format. To see this in action, I loaded a brain image and saved all of the non-zero data points (857,785 to be exact) to a data frame, which I then saved to CSV and Parquet formats; see [the management notebook](src/bettercode/data_management.ipynb) for details. Looking at the resulting files, we can see that the Parquet file is only about 20% the size of the CSV file:
475475

476476
```bash
477477
➤ du -sk /tmp/brain_tabular.*
@@ -718,7 +718,7 @@ In this section we discuss data organization. The most important principle of da
718718

719719
### File granularity
720720

721-
One common decision that we need to make when managing data is to save data in more smaller files versus fewer larger files. The right answer to this question depends in part on how we will have to access the data. If we only need to access a small portion of the data and we can easily determine which file to open to obtain those data, then it probably makes sense to save many small files. However, if we need to combine data across many small files, then it likely makes sense to save the data as one large file. For example, in the [data management notebook](src/BetterCodeBetterScience/data_management.ipynb) there is an example where we create a large (10000 x 100000) matrix of random numbers, and save them either to a single file or to a separate file for each row. When loading these data, the loading of the single file is about 5 times faster than loading the individual files.
721+
One common decision that we need to make when managing data is to save data in more smaller files versus fewer larger files. The right answer to this question depends in part on how we will have to access the data. If we only need to access a small portion of the data and we can easily determine which file to open to obtain those data, then it probably makes sense to save many small files. However, if we need to combine data across many small files, then it likely makes sense to save the data as one large file. For example, in the [data management notebook](src/bettercode/data_management.ipynb) there is an example where we create a large (10000 x 100000) matrix of random numbers, and save them either to a single file or to a separate file for each row. When loading these data, the loading of the single file is about 5 times faster than loading the individual files.
722722

723723
Another consideration about the number of files has to do with storage systems that are commonly used on high-performance computing systems. On these systems, it is common to have separate quotas for total space used (e.g., in terabytes) as well as for the number of *inodes*, which are structures that store information about files and folders on a UNIX filesystem. Thus, generating many small files (e.g., millions) can sometimes cause problems on these systems. For this reason, we generally err on the side of generating fewer larger files versus more smaller files when working on high-performance computing systems.
724724

@@ -1038,7 +1038,7 @@ unlock(ok): my_datalad_repo/data/demographics.csv (file)
10381038
We then use a Python script to make the change, which in this case is removing some columns from the dataset:
10391039

10401040
```bash
1041-
➤ python src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv
1041+
➤ python src/bettercode/modify_data.py my_datalad_repo/data/demographics.csv
10421042

10431043
```
10441044

@@ -1074,15 +1074,15 @@ nothing to save, working tree clean
10741074
Although the previous example was meant to provide background on how DataLad works, in practice there is actually a much easier way to accomplish these steps, which is by using the [`datalad run`](https://docs.datalad.org/en/stable/generated/man/datalad-run.html) command. This command will automatically take care of fetching and unlocking the relevant files, running the command, and then committing the files back in, generating a commit message that tracks the specific command that was used:
10751075

10761076
```bash
1077-
➤ datalad run -i my_datalad_repo/data/demographics.csv -o my_datalad_repo/data/demographics.csv -- uv run src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv
1077+
➤ datalad run -i my_datalad_repo/data/demographics.csv -o my_datalad_repo/data/demographics.csv -- uv run src/bettercode/modify_data.py my_datalad_repo/data/demographics.csv
10781078
[INFO ] Making sure inputs are available (this may take some time)
10791079
unlock(ok): my_datalad_repo/data/demographics.csv (file)
10801080
[INFO ] == Command start (output follows) =====
10811081
Built bettercodebetterscience @ file:///Users/poldrack/Dropbox/code/BetterCode
10821082
Uninstalled 1 package in 1ms
10831083
Installed 1 package in 1ms
10841084
[INFO ] == Command exit (modification check follows) =====
1085-
run(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience (dataset) [uv run src/BetterCodeBetterScience/modif...]
1085+
run(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience (dataset) [uv run src/bettercode/modif...]
10861086
add(ok): data/demographics.csv (file)
10871087
save(ok): my_datalad_repo (dataset)
10881088
add(ok): my_datalad_repo (dataset)
@@ -1095,12 +1095,12 @@ commit 3ef3b94a0abffec6a8db7570a97339f48ee728ed (HEAD -> text/datamgmt-Nov3)
10951095
Author: Russell Poldrack <poldrack@gmail.com>
10961096
Date: Mon Dec 15 13:28:06 2025 -0800
10971097

1098-
[DATALAD RUNCMD] uv run src/BetterCodeBetterScience/modif...
1098+
[DATALAD RUNCMD] uv run src/bettercode/modif...
10991099

11001100
=== Do not change lines below ===
11011101
{
11021102
"chain": [],
1103-
"cmd": "uv run src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv",
1103+
"cmd": "uv run src/bettercode/modify_data.py my_datalad_repo/data/demographics.csv",
11041104
"exit": 0,
11051105
"extra_inputs": [],
11061106
"inputs": [
@@ -1220,7 +1220,7 @@ The question that I will ask is as follows: How well can the biological similari
12201220
- A dataset of genome-wise association study (GWAS) results for specific traits obtained from [here](https://www.ebi.ac.uk/gwas/docs/file-downloads).
12211221
- Abstracts that refer to each of the traits identified in the GWAS result, obtained from the [PubMed](https://pubmed.ncbi.nlm.nih.gov/) database.
12221222
1223-
I will not present all of the code for each step; this can be found [here](src/BetterCodeBetterScience/database_example_funcs.py) and [here](src/BetterCodeBetterScience/database.py). Rather, I will show portions that are particularly relevant to the databases being used.
1223+
I will not present all of the code for each step; this can be found [here](src/bettercode/database_example_funcs.py) and [here](src/bettercode/database.py). Rather, I will show portions that are particularly relevant to the databases being used.
12241224
12251225
### Adding GWAS data to a document store
12261226
@@ -1236,7 +1236,7 @@ In this case, looking at the data we see that several columns contain multiple v
12361236
gwas_data = get_exploded_gwas_data()
12371237
```
12381238
1239-
We can now import the data from this data frame into a MongoDB collection, mapping each unique trait to the genes that are reported as being associated with it. First I generated a separate function that sets up a MongoDB collection (see `setup_mongo_collection` [here](src/BetterCodeBetterScience/database.py)). We can then use that function to set up our gene set collection:
1239+
We can now import the data from this data frame into a MongoDB collection, mapping each unique trait to the genes that are reported as being associated with it. First I generated a separate function that sets up a MongoDB collection (see `setup_mongo_collection` [here](src/bettercode/database.py)). We can then use that function to set up our gene set collection:
12401240
12411241
12421242
```python

book/project_organization.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -237,7 +237,7 @@ A final way that one might use notebooks is as a way to create standalone progra
237237

238238
It's very common for researchers to use different coding languages to solve different problems. A common use case is the Python user who wishes to take advantage of the much wider range of statistical methods that are implemented in R. There is a package called `rpy2` that allows this within pure Python code, but it can be cumbersome to work with, particularly due to the need to convert complex data types. Fortunately, Jupyter notebooks provide a convenient solution to this problem, via [*magic* commands](https://scipy-ipython.readthedocs.io/en/latest/interactive/magics.html). These are commands that start with either a `%` (for line commands) or `%%` for cell commands, which enable additional functionality.
239239

240-
An example of this can be seen in the [mixing_languages.ipynb](src/BetterCodeBetterScience/notebooks/mixing_languages.ipynb) notebook, in which we load and preprocess some data using Python and then use R magic commands to analyze the data using a package only available within R. In this example, we will work with data from a study published by our laboratory (Eisenberg et al., 2019), in which 522 people completed a large battery of psychological tests and surveys. We will focus here on the responses to a survey known as the "Barratt Impulsiveness Scale" which includes 30 questions related to different aspects of the psychological construct of "impulsiveness"; for example, "I say things without thinking" or "I plan tasks carefully". Each participant rated each of these statements on a four-point scale from 'Rarely/Never' to 'Almost Always/Always'; the scores were coded so that the number 1 always represented the most impulsive choice and 4 represented the most self-controlled choice.
240+
An example of this can be seen in the [mixing_languages.ipynb](src/bettercode/notebooks/mixing_languages.ipynb) notebook, in which we load and preprocess some data using Python and then use R magic commands to analyze the data using a package only available within R. In this example, we will work with data from a study published by our laboratory (Eisenberg et al., 2019), in which 522 people completed a large battery of psychological tests and surveys. We will focus here on the responses to a survey known as the "Barratt Impulsiveness Scale" which includes 30 questions related to different aspects of the psychological construct of "impulsiveness"; for example, "I say things without thinking" or "I plan tasks carefully". Each participant rated each of these statements on a four-point scale from 'Rarely/Never' to 'Almost Always/Always'; the scores were coded so that the number 1 always represented the most impulsive choice and 4 represented the most self-controlled choice.
241241

242242
In order to enable the R magic commands, we first need to load the rpy2 extension for Jupyter:
243243

@@ -526,7 +526,7 @@ test output from container
526526
To create a reproducible software execution environment, we will often need to create our own new Docker image that contains the necessary dependencies and application code. AI coding tools are generally quite good at creating the required `Dockerfile` that defines the image. We use the following prompt to Claude Sonnet 4:
527527

528528
```
529-
I would like to generate a Dockerfile to define a Docker image based on the python:3.13.9 image. The Python package wonderwords should be installed from PyPi. A local Python script should be created that creates a random sentence using wonderwords.RandomSentence() and prints it. This script should be the entrypoint for the Docker container. Create this within src/BetterCodeBetterScience/docker-example inside the current project. Do not create a new workspace - use the existing workspace for this project.
529+
I would like to generate a Dockerfile to define a Docker image based on the python:3.13.9 image. The Python package wonderwords should be installed from PyPi. A local Python script should be created that creates a random sentence using wonderwords.RandomSentence() and prints it. This script should be the entrypoint for the Docker container. Create this within src/bettercode/docker-example inside the current project. Do not create a new workspace - use the existing workspace for this project.
530530
```
531531

532532
Here is the content of the resulting `Dockerfile`:

book/software_engineering.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -758,7 +758,7 @@ C = 299792458
758758
We could then import this from our module within the iPython shell:
759759
760760
```
761-
In: from BetterCodeBetterScience.constants import C
761+
In: from bettercode.constants import C
762762
763763
In: C
764764
Out: 299792458
@@ -793,7 +793,7 @@ class Constants:
793793
Then within our iPython shell, we generate an instance of the Constants class, and see what happens if we try to change the value once it's instantiated:
794794
795795
```
796-
In: from BetterCodeBetterScience.constants import Constants
796+
In: from bettercode.constants import Constants
797797
798798
In: constants = Constants()
799799
@@ -806,7 +806,7 @@ AttributeError Traceback (most recent call last)
806806
Cell In[4], line 1
807807
----> 1 constants.C = 42
808808
809-
File ~/Dropbox/code/BetterCodeBetterScience/src/BetterCodeBetterScience/constants.py:11, in Constants.__setattr__(self, name, value)
809+
File ~/Dropbox/code/BetterCodeBetterScience/src/bettercode/constants.py:11, in Constants.__setattr__(self, name, value)
810810
10 def __setattr__(self, name, value):
811811
---> 11 raise AttributeError("Constants cannot be modified")
812812
@@ -847,8 +847,8 @@ We see that `ruff` detects both formatting problems (such as the lack of spaces
847847
We can also use `ruff` from the command line to detect and fix code problems:
848848
849849
```bash
850-
❯ ruff check src/BetterCodeBetterScience/formatting_example.py
851-
src/BetterCodeBetterScience/formatting_example.py:6:1: F403 `from numpy.random import *` used; unable to detect undefined names
850+
❯ ruff check src/bettercode/formatting_example.py
851+
src/bettercode/formatting_example.py:6:1: F403 `from numpy.random import *` used; unable to detect undefined names
852852
|
853853
4 | # Poorly formatted code for linting example
854854
5 |
@@ -858,7 +858,7 @@ src/BetterCodeBetterScience/formatting_example.py:6:1: F403 `from numpy.random i
858858
8 | mynum=randint(0,100)
859859
|
860860
861-
src/BetterCodeBetterScience/formatting_example.py:8:7: F405 `randint` may be undefined, or defined from star imports
861+
src/bettercode/formatting_example.py:8:7: F405 `randint` may be undefined, or defined from star imports
862862
|
863863
6 | from numpy.random import *
864864
7 |
@@ -872,12 +872,12 @@ Found 2 errors.
872872
Most linters can also automatically fix the issues that they detect in the code. `ruff` modifies the file in place, so we will first create a copy (so that our original remains intact) and then run the formatter on that copy:
873873
874874
```bash
875-
❯ cp src/BetterCodeBetterScience/formatting_example.py src/BetterCodeBetterScience/formatting_example_ruff.py
875+
❯ cp src/bettercode/formatting_example.py src/bettercode/formatting_example_ruff.py
876876
877-
❯ ruff format src/BetterCodeBetterScience/formatting_example_ruff.py
877+
❯ ruff format src/bettercode/formatting_example_ruff.py
878878
1 file reformatted
879879
880-
❯ diff src/BetterCodeBetterScience/formatting_example.py src/BetterCodeBetterScience/formatting_example_ruff.py
880+
❯ diff src/bettercode/formatting_example.py src/bettercode/formatting_example_ruff.py
881881
1,3d0
882882
<
883883
<

0 commit comments

Comments
 (0)