You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 31, 2025. It is now read-only.
Unlike the previous examples, the code now includes type hints.
99
-
It's always a bad idea to generalize from a single result, so we ran these prompts through ChatGPT 10 times each (using the Openai API to generate them programmatically; see the [notebook](../src/BetterCodeBetterScience/incontext_learning_example.ipynb)).
99
+
It's always a bad idea to generalize from a single result, so we ran these prompts through ChatGPT 10 times each (using the Openai API to generate them programmatically; see the [notebook](../src/bettercode/incontext_learning_example.ipynb)).
100
100
Here are the function signatures generated for each of the 10 runs without mentioning type hints:
101
101
102
102
```
@@ -272,7 +272,7 @@ In addition to the time and labor of running things by hand, it is also a recipe
272
272
273
273
You might be asking at this point, "What's an API"? The acronym stands for "Application Programming Interface", which is a method by which one can programmatically send commands to and receive responses from a computer system, which could be local or remote[^1].
274
274
To understand this better, let's see how to send a chat command and receive a response from the Claude language model.
275
-
The full outline is in [the notebook](https://github.com/poldrack/BetterCodeBetterScience/blob/main/src/BetterCodeBetterScience/language_model_api_prompting.ipynb).
275
+
The full outline is in [the notebook](https://github.com/poldrack/BetterCodeBetterScience/blob/main/src/bettercode/language_model_api_prompting.ipynb).
276
276
Coding agents are very good at generating code to perform API calls, so I used Claude Sonnet 4 to generate the example code in the notebook:
277
277
278
278
```python
@@ -358,7 +358,7 @@ Let's see how we could get the previous example to return a JSON object containi
358
358
Here we will use a function called `send_prompt_to_claude()` that wraps the call to the model object and returns the text from the result:
The most common file formats are *comma-separated value* (CSV) or *tab-separated value* (TSV) files. Both of these have the benefit of being represented in plain text, so their contents can be easily examined without any special software. I generally prefer to use tabs rather than commas as the separator (or *delimiter*), primarily because they can more easily naturally represent longer pieces of text that may include commas. These can also be represented using CSV, but they require additional processing in order to *escape* the commas within the text so that they are not interpreted as delimiters.
473
473
474
-
Text file formats like CSV and TSV are nice for their ease of interpretability, but they are highly inefficient for large data compared to optimized file formats, such as the *Parquet* format. To see this in action, I loaded a brain image and saved all of the non-zero data points (857,785 to be exact) to a data frame, which I then saved to CSV and Parquet formats; see [the management notebook](src/BetterCodeBetterScience/data_management.ipynb) for details. Looking at the resulting files, we can see that the Parquet file is only about 20% the size of the CSV file:
474
+
Text file formats like CSV and TSV are nice for their ease of interpretability, but they are highly inefficient for large data compared to optimized file formats, such as the *Parquet* format. To see this in action, I loaded a brain image and saved all of the non-zero data points (857,785 to be exact) to a data frame, which I then saved to CSV and Parquet formats; see [the management notebook](src/bettercode/data_management.ipynb) for details. Looking at the resulting files, we can see that the Parquet file is only about 20% the size of the CSV file:
475
475
476
476
```bash
477
477
➤ du -sk /tmp/brain_tabular.*
@@ -718,7 +718,7 @@ In this section we discuss data organization. The most important principle of da
718
718
719
719
### File granularity
720
720
721
-
One common decision that we need to make when managing data is to save data in more smaller files versus fewer larger files. The right answer to this question depends in part on how we will have to access the data. If we only need to access a small portion of the data and we can easily determine which file to open to obtain those data, then it probably makes sense to save many small files. However, if we need to combine data across many small files, then it likely makes sense to save the data as one large file. For example, in the [data management notebook](src/BetterCodeBetterScience/data_management.ipynb) there is an example where we create a large (10000 x 100000) matrix of random numbers, and save them either to a single file or to a separate file for each row. When loading these data, the loading of the single file is about 5 times faster than loading the individual files.
721
+
One common decision that we need to make when managing data is to save data in more smaller files versus fewer larger files. The right answer to this question depends in part on how we will have to access the data. If we only need to access a small portion of the data and we can easily determine which file to open to obtain those data, then it probably makes sense to save many small files. However, if we need to combine data across many small files, then it likely makes sense to save the data as one large file. For example, in the [data management notebook](src/bettercode/data_management.ipynb) there is an example where we create a large (10000 x 100000) matrix of random numbers, and save them either to a single file or to a separate file for each row. When loading these data, the loading of the single file is about 5 times faster than loading the individual files.
722
722
723
723
Another consideration about the number of files has to do with storage systems that are commonly used on high-performance computing systems. On these systems, it is common to have separate quotas for total space used (e.g., in terabytes) as well as for the number of *inodes*, which are structures that store information about files and folders on a UNIX filesystem. Thus, generating many small files (e.g., millions) can sometimes cause problems on these systems. For this reason, we generally err on the side of generating fewer larger files versus more smaller files when working on high-performance computing systems.
@@ -1074,15 +1074,15 @@ nothing to save, working tree clean
1074
1074
Although the previous example was meant to provide background on how DataLad works, in practice there is actually a much easier way to accomplish these steps, which is by using the [`datalad run`](https://docs.datalad.org/en/stable/generated/man/datalad-run.html) command. This command will automatically take care of fetching and unlocking the relevant files, running the command, and then committing the files back in, generating a commit message that tracks the specific command that was used:
1075
1075
1076
1076
```bash
1077
-
➤ datalad run -i my_datalad_repo/data/demographics.csv -o my_datalad_repo/data/demographics.csv -- uv run src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv
1077
+
➤ datalad run -i my_datalad_repo/data/demographics.csv -o my_datalad_repo/data/demographics.csv -- uv run src/bettercode/modify_data.py my_datalad_repo/data/demographics.csv
1078
1078
[INFO ] Making sure inputs are available (this may take some time)
[DATALAD RUNCMD] uv run src/BetterCodeBetterScience/modif...
1098
+
[DATALAD RUNCMD] uv run src/bettercode/modif...
1099
1099
1100
1100
=== Do not change lines below ===
1101
1101
{
1102
1102
"chain": [],
1103
-
"cmd": "uv run src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv",
1103
+
"cmd": "uv run src/bettercode/modify_data.py my_datalad_repo/data/demographics.csv",
1104
1104
"exit": 0,
1105
1105
"extra_inputs": [],
1106
1106
"inputs": [
@@ -1220,7 +1220,7 @@ The question that I will ask is as follows: How well can the biological similari
1220
1220
- A dataset of genome-wise association study (GWAS) results for specific traits obtained from [here](https://www.ebi.ac.uk/gwas/docs/file-downloads).
1221
1221
- Abstracts that refer to each of the traits identified in the GWAS result, obtained from the [PubMed](https://pubmed.ncbi.nlm.nih.gov/) database.
1222
1222
1223
-
I will not present all of the code for each step; this can be found [here](src/BetterCodeBetterScience/database_example_funcs.py) and [here](src/BetterCodeBetterScience/database.py). Rather, I will show portions that are particularly relevant to the databases being used.
1223
+
I will not present all of the code for each step; this can be found [here](src/bettercode/database_example_funcs.py) and [here](src/bettercode/database.py). Rather, I will show portions that are particularly relevant to the databases being used.
1224
1224
1225
1225
### Adding GWAS data to a document store
1226
1226
@@ -1236,7 +1236,7 @@ In this case, looking at the data we see that several columns contain multiple v
1236
1236
gwas_data = get_exploded_gwas_data()
1237
1237
```
1238
1238
1239
-
We can now import the data from this data frame into a MongoDB collection, mapping each unique trait to the genes that are reported as being associated with it. First I generated a separate functionthat sets up a MongoDB collection (see `setup_mongo_collection` [here](src/BetterCodeBetterScience/database.py)). We can then use that functiontoset up our gene set collection:
1239
+
We can now import the data from this data frame into a MongoDB collection, mapping each unique trait to the genes that are reported as being associated with it. First I generated a separate functionthat sets up a MongoDB collection (see `setup_mongo_collection` [here](src/bettercode/database.py)). We can then use that functiontoset up our gene set collection:
Copy file name to clipboardExpand all lines: book/project_organization.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -237,7 +237,7 @@ A final way that one might use notebooks is as a way to create standalone progra
237
237
238
238
It's very common for researchers to use different coding languages to solve different problems. A common use case is the Python user who wishes to take advantage of the much wider range of statistical methods that are implemented in R. There is a package called `rpy2` that allows this within pure Python code, but it can be cumbersome to work with, particularly due to the need to convert complex data types. Fortunately, Jupyter notebooks provide a convenient solution to this problem, via [*magic* commands](https://scipy-ipython.readthedocs.io/en/latest/interactive/magics.html). These are commands that start with either a `%` (for line commands) or `%%` for cell commands, which enable additional functionality.
239
239
240
-
An example of this can be seen in the [mixing_languages.ipynb](src/BetterCodeBetterScience/notebooks/mixing_languages.ipynb) notebook, in which we load and preprocess some data using Python and then use R magic commands to analyze the data using a package only available within R. In this example, we will work with data from a study published by our laboratory (Eisenberg et al., 2019), in which 522 people completed a large battery of psychological tests and surveys. We will focus here on the responses to a survey known as the "Barratt Impulsiveness Scale" which includes 30 questions related to different aspects of the psychological construct of "impulsiveness"; for example, "I say things without thinking" or "I plan tasks carefully". Each participant rated each of these statements on a four-point scale from 'Rarely/Never' to 'Almost Always/Always'; the scores were coded so that the number 1 always represented the most impulsive choice and 4 represented the most self-controlled choice.
240
+
An example of this can be seen in the [mixing_languages.ipynb](src/bettercode/notebooks/mixing_languages.ipynb) notebook, in which we load and preprocess some data using Python and then use R magic commands to analyze the data using a package only available within R. In this example, we will work with data from a study published by our laboratory (Eisenberg et al., 2019), in which 522 people completed a large battery of psychological tests and surveys. We will focus here on the responses to a survey known as the "Barratt Impulsiveness Scale" which includes 30 questions related to different aspects of the psychological construct of "impulsiveness"; for example, "I say things without thinking" or "I plan tasks carefully". Each participant rated each of these statements on a four-point scale from 'Rarely/Never' to 'Almost Always/Always'; the scores were coded so that the number 1 always represented the most impulsive choice and 4 represented the most self-controlled choice.
241
241
242
242
In order to enable the R magic commands, we first need to load the rpy2 extension for Jupyter:
243
243
@@ -526,7 +526,7 @@ test output from container
526
526
To create a reproducible software execution environment, we will often need to create our own new Docker image that contains the necessary dependencies and application code. AI coding tools are generally quite good at creating the required `Dockerfile` that defines the image. We use the following prompt to Claude Sonnet 4:
527
527
528
528
```
529
-
I would like to generate a Dockerfile to define a Docker image based on the python:3.13.9 image. The Python package wonderwords should be installed from PyPi. A local Python script should be created that creates a random sentence using wonderwords.RandomSentence() and prints it. This script should be the entrypoint for the Docker container. Create this within src/BetterCodeBetterScience/docker-example inside the current project. Do not create a new workspace - use the existing workspace for this project.
529
+
I would like to generate a Dockerfile to define a Docker image based on the python:3.13.9 image. The Python package wonderwords should be installed from PyPi. A local Python script should be created that creates a random sentence using wonderwords.RandomSentence() and prints it. This script should be the entrypoint for the Docker container. Create this within src/bettercode/docker-example inside the current project. Do not create a new workspace - use the existing workspace for this project.
530
530
```
531
531
532
532
Here is the content of the resulting `Dockerfile`:
Copy file name to clipboardExpand all lines: book/software_engineering.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -758,7 +758,7 @@ C = 299792458
758
758
We could then import this from our module within the iPython shell:
759
759
760
760
```
761
-
In: from BetterCodeBetterScience.constants import C
761
+
In: from bettercode.constants import C
762
762
763
763
In: C
764
764
Out: 299792458
@@ -793,7 +793,7 @@ class Constants:
793
793
Then within our iPython shell, we generate an instance of the Constants class, and see what happens if we try to change the value once it's instantiated:
794
794
795
795
```
796
-
In: from BetterCodeBetterScience.constants import Constants
@@ -858,7 +858,7 @@ src/BetterCodeBetterScience/formatting_example.py:6:1: F403 `from numpy.random i
858
858
8 | mynum=randint(0,100)
859
859
|
860
860
861
-
src/BetterCodeBetterScience/formatting_example.py:8:7: F405 `randint` may be undefined, or defined from star imports
861
+
src/bettercode/formatting_example.py:8:7: F405 `randint` may be undefined, or defined from star imports
862
862
|
863
863
6 | from numpy.random import *
864
864
7 |
@@ -872,12 +872,12 @@ Found 2 errors.
872
872
Most linters can also automatically fix the issues that they detect in the code. `ruff` modifies the file in place, so we will first create a copy (so that our original remains intact) and then run the formatter on that copy:
0 commit comments