Thank you for your interest in contributing to dit!
git clone https://github.com/happyhackingspace/dit.git
cd dit
# Download training data and model from Hugging Face
go run ./cmd/dit data download
go build ./...
go test ./...dit is a Go port of Formasaurus with zero external ML dependencies.
Three-stage ML pipeline:
-
Form type detection -- Logistic regression (L-BFGS optimizer, L2 regularization) trained on features extracted from HTML forms: element counts, submit button text, input names, CSS classes, form action URL, label text, and link text.
-
Field type detection -- Linear-chain CRF (Conditional Random Field) with OWL-QN trainer (L1 support). Features include field tag/type/name/value/placeholder, CSS class/ID, label text, text before/after the field, and the form type predicted by stage 1.
-
Page type detection -- Logistic regression trained on page-level features: title, headings, meta description, CSS classes, nav text, URL patterns, page structure indicators, and form classification results from stage 1.
Accuracy is estimated via grouped k-fold cross-validation (grouped by domain using public suffix list).
dit.go, train.go Public SDK (dit.New, dit.Load, dit.Train, dit.Evaluate)
cmd/dit/ CLI tool
cmd/dit-collect/ Data collection tool for page annotations
classifier/ Form type (LogReg) + field type (CRF) + page type (LogReg) classifiers
formtype.go Form LogReg training and inference
fieldtype.go CRF wrapper for field classification
pagetype.go Page LogReg training and inference
formtype_features.go 9 form feature pipelines (FormElements, SubmitText, etc.)
fieldtype_features.go Per-field CRF features (ElemFeatures, GetFormFeatures)
pagetype_features.go 9 page feature pipelines (PageStructure, PageTitle, etc.)
model.go Serialization (SaveModel, LoadClassifier)
crf/ Standalone linear-chain CRF implementation
trainer.go OWL-QN optimizer (L1 regularization)
forward_backward.go Forward-backward algorithm
viterbi.go Viterbi decoding
feature.go Feature-to-attribute conversion
internal/htmlutil/ goquery-based HTML parsing, form/field/page extraction
internal/storage/ Annotation data loading (config.json, index.json, HTML files)
internal/textutil/ Tokenize, Ngrams, Normalize, NumberPattern
internal/vectorizer/ SparseVector, CountVectorizer, TfidfVectorizer, DictVectorizer
data/forms/ Annotated HTML forms + config
data/pages/ Annotated HTML pages + config
- CRF trainer uses manual OWL-QN (for L1 support) instead of gonum's Minimize
- Formasaurus hyperparameters preserved: c1=0.1655, c2=0.0236, max_iter=100 (CRF), C=5 with L2 penalty (LogReg)
char_wbanalyzer pads words with spaces and extracts char n-grams from padded words (matching sklearn)- sklearn smooth IDF formula:
log((1+n)/(1+df)) + 1 - GroupKFold by domain using
publicsuffixfor cross-validation - No external ML dependencies -- LogReg and CRF are self-contained
Full documentation is available at pkg.go.dev/github.com/happyhackingspace/dit.
// Load
func New() (*Classifier, error) // auto-finds model.json
func Load(path string) (*Classifier, error) // from specific path
// Classify forms
func (c *Classifier) ExtractForms(html string) ([]FormResult, error)
func (c *Classifier) ExtractFormsProba(html string, threshold float64) ([]FormResultProba, error)
// Classify page type
func (c *Classifier) ExtractPageType(html string) (*PageResult, error)
func (c *Classifier) ExtractPageTypeProba(html string, threshold float64) (*PageResultProba, error)
// Train
func Train(dataDir string, config *TrainConfig) (*Classifier, error)
func (c *Classifier) Save(path string) error
// Evaluate
func Evaluate(dataDir string, config *EvalConfig) (*EvalResult, error)- Fork the repository and create a feature branch from
main. - Write clear, minimal code that follows existing patterns.
- Add tests for new functionality.
- Run
go vet ./...andgo test ./...before submitting. - Open a pull request with a clear description of the change.
- Keep the public API surface small. Internal packages should stay internal.
- No external ML dependencies.
- Match Python Formasaurus behavior where possible for compatibility.
The training data is hosted on Hugging Face and consists of annotated HTML forms and pages from real websites. Run dit data download (or go run ./cmd/dit data download) to get the data locally.
- Add the HTML file to
data/forms/html/. - Update
data/forms/index.jsonwith the URL, form types, and field annotations. - Follow the type codes defined in
data/forms/config.json.
- Add the HTML file to
data/pages/html/. - Update
data/pages/index.jsonwith the URL and page type. - Follow the type codes defined in
data/pages/config.json.
Re-train and verify accuracy doesn't regress:
go run ./cmd/dit train model.json --data-folder data
go run ./cmd/dit evaluate --data-folder dataAfter updating annotations, upload to Hugging Face:
go run ./cmd/dit data uploadThis requires the Hugging Face CLI and being logged in (hf auth login).
Form annotations (data/forms/index.json): each entry maps an HTML file path to:
url-- the source URLforms-- list of form type codes (one per<form>in the HTML)visible_html_fields-- list of field annotation maps (field_name -> type_code)
Page annotations (data/pages/index.json): each entry maps an HTML file path to:
url-- the source URLpage_type-- page type code (e.g.lg,er,s4)
See data/forms/config.json for form/field type codes and data/pages/config.json for page type codes.
Open an issue with:
- What you expected vs what happened
- Minimal HTML that reproduces the issue (if classification-related)
- Go version and OS
By contributing, you agree that your contributions will be licensed under the MIT License.