Skip to content

Add framesets (.dtas)#28

Open
rpguiteras wants to merge 21 commits into
mainfrom
frames
Open

Add framesets (.dtas)#28
rpguiteras wants to merge 21 commits into
mainfrom
frames

Conversation

@rpguiteras

Copy link
Copy Markdown
Collaborator

Add support for Stata framesets (.dtas).

The main substantive changes are in pypkg/src/pystatacons/stata_utils.py and src/complete_datasignature.ado.

Checksum for .dtas is concatenated checksums of individual frames, in alphabetical order, for example

output/data/project-area.dtas: LGD_district=23:4(30770):1387127574:1572535773:1213271928|LGD_subdistrict=97:9(94209):517290162:1301972081:2412684702

Ignores the order in which frames are saved within frameset, ignores frame's date characteristic.

statacons.ado and stataconsign.ado are not substantively altered (just bump version number):

Also added examples in frames/examples and tests in frames/tests.

Still works on Stata 16 and 17 even though these cannot use framesets (.dtas) -- uses default scons checksum for .dtas. (Works in the sense that statacons will parse things properly, of course code using .dtas with Stata < 18 will fail -- gives warnings.)

rpguiteras and others added 18 commits May 19, 2026 15:55
Support signing Stata .dtas framesets end-to-end: add a frameset_file option to complete_datasignature.ado (bumped version to 3.0.4) which signs each frame, concatenates results, and ignores time-tainted frlink_* chars; add skip_char to control excluded characteristics and tolerate empty frames. Add dev helper dev_adopath_prefix to let tests point Stata at a local src/ during development. Extend Python SCons integration: accept a file_arg in get_datasign, emit the dev adopath prefix into generated recipes, add get_dtas_sign wrapper, and register .dtas in special_sig_fns so SCons pipelines can build/consume framesets. Add SCons test pipeline and several Stata smoke/unit test scripts (producer/consumer, smoke tests) and update tests/SConstruct and statacons_test.do to exercise the .dtas path.
Remove embedded program definitions from tests/statacons_test.do and add separate helper files: tests/store_modts.ado, tests/touch_dta.ado, and tests/write_txt.ado. This separates reusable test helper programs (store_modts, touch_dta, write_txt) into their own .ado files for clarity and reuse; no functional changes intended.
Both are local-only private files (project instructions and session log)
that should not appear in the public repo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Includes Stata-bundled datasets (auto, census) and Stata Press webuse
datasets (persons, txcounty, family, discharge1/2, hsng) along with
_refresh_datasets.do to re-download the webuse datasets from Stata's servers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four examples covering: frame create/change/copy/put/drop/reset (01),
frlink/frget/fralias/frval (02), frame post Monte Carlo pattern (03),
and frames save/use/modify/describe (04). Datasets loaded via
`local datasets "../datasets"` relative to frames/datasets/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three files: sources-format.md (bibliography of .dtas format sources),
sources-applications.md (bibliography of frames application sources),
and stata-features-frameset.md (format summary notes). Verbatim Stata
IP (.sthlp, PDFs, blog-post conversions) stays in a private repo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… signing

Tests: basic round-trip, empty default frame, name collision, frame count,
and signature stability across re-saves. All cases pass. Must be run by
opening Stata interactively and doing the file -- not with -e do.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce a comprehensive test harness for .dtas support under frames/tests: SConstruct files for scons-driven builds, producer/consumer Stata do-files for dtas and linked workflows, interactive and smoke test scripts (including SCons rerun checks), and a Python helper (make_malformed_dtas.py) to generate malformed .dtas fixtures. Also add testlib helpers, run_all wrapper, and binary fixtures for expected outputs. These tests validate signature stability across compression/topology/alias/linked workflows, error handling for malformed archives, and SCons no-rebuild behavior on identical reruns.
Move legacy small-data .dtas test scaffolding from the top-level tests/ folder into frames/tests. Added SConstruct-legacy, legacy producer/consumer do-files, legacy interactive round-trip and legacy smoke scripts, and integrated them into run_all.do and README-tests.md. Removed the old duplicates from tests/ to centralize the harness and keep branch-only legacy scripts out of the top-level tests tree.
Introduce a new 'frameset_signing' SCons config (auto|enabled|disabled) to control .dtas (frameset) signing on Stata <18. Implement logic in pystatacons/stata_utils: get_dtas_sign now detects Stata version once, delegates to get_datasign on Stata 18+, and in Stata <18 either falls back to MD5 with a one-time warning (auto), raises a hard error (enabled), or always use MD5 (disabled). init_env defaults frameset_signing to 'auto' and only registers the .dtas special signature function when not disabled. Also improve error reporting from get_datasign to include the Stata log when a batch run fails. Add tests and SCons test scaffolding for Stata 17, update config templates and docs, and bump package and ado/documentation versions to 3.1.0-alpha2. Note: switching between frameset-aware and MD5 signatures requires a one-time full rebuild due to incompatible sconsign entries.
…ple do-files

Add consistent header comments to example and test .do scripts clarifying the assumed working directory (frames/examples, frames/tests, tests/, etc.) and showing how to run each script interactively or in batch (StataMP-64.exe -e). Mark legacy/interactive roundtrip tests as interactive-only and adjust run_all.do header and ordering (place clear all/do testlib.do after header). Also update the datasets refresh script comment to use paths relative to frames/datasets/.
@rpguiteras rpguiteras requested a review from bquistorff May 22, 2026 18:55
@bquistorff

Copy link
Copy Markdown
Owner

A few comments:

  1. We should avoid committing data files when possible, so I'm concerned about the files in frames/datasets. Can we generate what we need for testing/docs on the fly using sysuse datasets?
  2. For complete_datasignature.ado in interactive mode, will the frame in use before the call to complete_datasignature be the same after the round-trip to the frames file?
  3. There can be many frames in dtas file and so the signature can grow quite long. Are there any signature length limits in cons that we should be aware of? If so, we could do a checksum of the long signature like I did in the dtas branch.

Other than that, I can test the frames code, so I'm assuming there's sufficient tests and that they pass on your machine.

@rpguiteras

Copy link
Copy Markdown
Collaborator Author

Ideas not implemented:

Allow SConscripts to refer to specific frames. Currently if one frame changes, the signature for the .dtas as a whole changes, so everything downstream is rebuilt. However, some tasks might use only a subset of frames and not need rebuilding if a different frame in the frameset changes. Since the signature for the .dtas concatenates signatures of individual frames, in principle this should be possible.

Stata 16 and 17 could unzip the .dtas file and calculate complete datasignatures for each component .dta file that does not contain any alias variables. 16/17 could not open a .dta file with alias variables, so we would have to fall back to a default MD5.

@rpguiteras

rpguiteras commented May 26, 2026

Copy link
Copy Markdown
Collaborator Author

A few comments:

  1. We should avoid committing data files when possible, so I'm concerned about the files in frames/datasets. Can we generate what we need for testing/docs on the fly using sysuse datasets?

Yes, these are all sysuse or webuse. the do-file _refresh_datasets.do grabs all the webuse ones.

I wasn't too worried about keeping them on the repo since they are <0.5 MB total, but we could .gitignore them.

  1. For complete_datasignature.ado in interactive mode, will the frame in use before the call to complete_datasignature be the same after the round-trip to the frames file?

Claude says no, and suggests a fix. I will give it a try and run the tests.

 - No --- the current frame is not restored. Here is why:

  The round-trip (lines 29 and 51) saves with frames save, frames(_all) and restores with frames use, clear. Neither command records
  nor replays which frame was active at save time. The .dtas format's .frameinfo manifest stores frame names and DTA filenames, but not
   the "current frame" pointer. After frames use the active frame is default (Stata's reset behavior on load), regardless of what the
  user was in before.

  The fix is to capture c(frame) before the save and frame change back after the restore:

  if !`is_batch' {
      loc saved_frame "`c(frame)'"          // <-- add
      qui frames save "`tempdtas'", frames(_all) replace emptyok
      loc need_restore = 1
  }

  if `need_restore' {
      qui frames use "`tempdtas'", clear
      cap erase "`tempdtas'"
      qui frame change `saved_frame'         // <-- add
  }

  The frame change should be guarded (e.g., cap) in case the frame name that was active before happens not to exist in the restored
  frameset, though for a well-formed user session that situation shouldn't arise.

Copilot (GPT 5.4) also says no:

Not guaranteed by complete_datasignature.ado itself.

  In the interactive frameset_file() path, the ado saves all frames to a temp .dtas, loads the target frameset, computes signatures, then restores with:

   - frames save "tempdtas'", frames(_all) ...`
   - frames use "tempdtas'", clear`

  That restores the set of frames and their contents, but the code never records the pre-call active frame name and never does a final frame change <oldframe> afterward. So the frame you end up in after the restore depends entirely on how frames use chooses the active frame when loading 
  the temp frameset, not on an explicit restoration step in the ado.

  So the safe answer is: the user state is restored, but the previously active frame is not explicitly preserved.
  1. There can be many frames in dtas file and so the signature can grow quite long. Are there any signature length limits in cons that we should be aware of? If so, we could do a checksum of the long signature like I did in the dtas branch.

Gemini says no

Custom Signatures and Database Storage

If you implement a custom Decider function in your SConstruct file, you can technically return a custom string or object to serve as the "signature" for a file.

Storage via Pickle: SCons saves its tracking data inside a database file (usually .sconsign.dblite). It handles this data using Python’s standard pickle serialization module.

No Arbitrary Cap: Because pickle stores arbitrary Python objects and strings natively, SCons will not reject a custom signature string just because it is exceptionally long. Your only real boundary is your system's available RAM and Python's theoretical string limit ($2^{31}-1$ bytes on 32-bit systems or $2^{63}-1$ bytes on 64-bit systems).

Other than that, I can test the frames code, so I'm assuming there's sufficient tests and that they pass on your machine.

Yes, these passed on Stata17 and 19 on my PC and Stata 19 on our cluster (only tested batch on the cluster I think).

It also worked well in an "out of sample" test - a project where I had written up an SConscript with framesets (.dtas), but was not consulted in designing the new code

Save and restore the previously-active frame when complete_datasignature loads a frameset in interactive mode. Implemented by capturing `c(frame)` before saving frames and calling `frame change` after restoring from the temporary .dtas. Add a legacy interactive test to verify restoration when calling from a non-default frame, and update documentation to mention that the previously-active frame is restored.
@rpguiteras

rpguiteras commented May 26, 2026

Copy link
Copy Markdown
Collaborator Author

I implemented a fix for your comment 2 above and ran the interactive and batch mode tests again successfully.

This is in a new commit 683c8f5, with tag 3.1.0-alpha3.

I'm not sure what the etiquette is, do I close this PR and open a new one from that commit? Actually it looks like those commits show up in here so I guess no action needed?

@rpguiteras

Copy link
Copy Markdown
Collaborator Author

@bquistorff should I have created a fork instead of a branch? So we can merge in milestone versions before publishing 3.1.0? If you think this is close to publication-ready then we can just work here, I guess, and I'll do a branch for future work (tutorial, frames within framesets).

Introduce several new frames/tests: smoke_dtas_frame_order, smoke_dtas_volatile_chars, smoke_dtas_fralias, smoke_dtas_degenerate, smoke_stata17_fallback, and interactive_collision. Update run_all.do to include the new non-interactive smoke tests and expand README-tests.md with descriptions, expected behavior, and diagnostics for each new test (including notes about interactive vs batch usage). The new tests cover frame ordering, volatile dataset characteristics and skip_char(), fralias aliasing behavior, single-/empty-frameset edge cases, Stata 17 version-guard behavior, and an interactive name-collision restoration scenario. Logs and cleanup steps are included in the test files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants