Skip to content

Update profiles parser and related staged changes#156

Open
AJayaV06 wants to merge 1 commit intoUTDNebula:developfrom
AJayaV06:feat/profiles-parser-staged-updates-2026-04-09
Open

Update profiles parser and related staged changes#156
AJayaV06 wants to merge 1 commit intoUTDNebula:developfrom
AJayaV06:feat/profiles-parser-staged-updates-2026-04-09

Conversation

@AJayaV06
Copy link
Copy Markdown

@AJayaV06 AJayaV06 commented Apr 9, 2026

No description provided.

Copilot AI review requested due to automatic review settings April 9, 2026 20:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modernizes the UTD Profiles pipeline by switching the profiles scraper to use the official Profiles API (instead of ChromeDP HTML scraping) and adding a new parser path that consumes the raw API output while preserving a fallback to the legacy profiles.json format.

Changes:

  • Replaced ChromeDP-based profile scraping with net/http calls to the Profiles API, writing profiles_raw.json (+ index output) to disk.
  • Added optional PROFILE_SCHOOLS environment variable to scope profile detail requests by school code.
  • Added a new parser.LoadProfiles implementation for profiles_raw.json, updated loader fallback behavior, and introduced unit tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scrapers/profiles.go Reimplemented scraping via Profiles API; adds batching, school-scoped fetching, and JSON file outputs.
scrapers/profiles_test.go Adds unit coverage for delimiter parsing, request URL building, and dedupe logic.
parser/profiles.go New parser/loader for profiles_raw.json into schema.Professor records with heuristics for titles/URLs/location.
parser/profiles_test.go Adds unit tests for “best” selection heuristics and fallbacks.
parser/profileLoader.go Attempts profiles_raw.json load first, falls back to legacy profiles.json.
README.md Documents PROFILE_SCHOOLS usage and behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +18 to +20
apiPrimaryLocationRegex = regexp.MustCompile(`^(\w+)\s+(\d+\.\d{3}[A-z]?)$`)
apiFallbackLocationRegex = regexp.MustCompile(`^([A-z]+)(\d+)\.?([\d]{3}[A-z]?)$`)
)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location regexes use [A-z]/[A-z]+, which also matches non-letter ASCII characters (e.g., [, \, ], ^, _, `). This can lead to incorrect matches for building codes / room suffixes. Use [A-Za-z] (or a more specific allowed set) instead.

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +46
type profileDetailsResponse struct {
Information []profileInformation `json:"information"`
Areas []profileArea `json:"areas"`
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profileDetailsResponse is declared but never used in this package. Consider removing it (or using it) to avoid accumulating dead/unused types and reduce maintenance overhead.

Copilot uses AI. Check for mistakes.
Comment on lines +313 to +318
func firstInformationData(items []profileInformation) profileInformationData {
if len(items) == 0 {
return profileInformationData{}
}
return items[0].Data
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

firstInformationData is never called. If it’s not needed, removing it will keep the parser surface area smaller and make future refactors less error-prone.

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +116
if len(indexResponse.Profile) == 0 {
log.Print("Profile API returned no profiles.")
return
}

func parseLocation(text string) schema.Location {
var building string
var room string
schools := parseProfileSchoolsFromEnv()

submatches := primaryLocationRegex.FindStringSubmatch(text)
if submatches == nil {
submatches = fallbackLocationRegex.FindStringSubmatch(text)
if submatches == nil {
return schema.Location{}
} else {
building = submatches[1]
room = fmt.Sprintf("%s.%s", submatches[2], submatches[3])
slugs := make([]string, 0, len(indexResponse.Profile))
for _, row := range indexResponse.Profile {
slug := strings.TrimSpace(row.Slug)
if slug == "" {
continue
}
} else {
building = submatches[1]
room = submatches[2]
slugs = append(slugs, slug)
}
slugs = dedupeStrings(slugs)

return schema.Location{
Building: building,
Room: room,
Map_uri: fmt.Sprintf("https://locator.utdallas.edu/%s_%s", building, room),
if len(slugs) == 0 {
log.Print("Profile API index contained no valid slugs.")
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When PROFILE_SCHOOLS is set, the scraper still builds slugs from the index and returns early if the index is empty / contains no valid slugs. That prevents the school-scoped path from running even though it doesn’t require slugs. Consider branching on schools earlier so the school-based scrape can proceed even if the index has no/invalid slugs (and only enforce the slug checks in the slug-batch path).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants