Update profiles parser and related staged changes#156
Update profiles parser and related staged changes#156AJayaV06 wants to merge 1 commit intoUTDNebula:developfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR modernizes the UTD Profiles pipeline by switching the profiles scraper to use the official Profiles API (instead of ChromeDP HTML scraping) and adding a new parser path that consumes the raw API output while preserving a fallback to the legacy profiles.json format.
Changes:
- Replaced ChromeDP-based profile scraping with
net/httpcalls to the Profiles API, writingprofiles_raw.json(+ index output) to disk. - Added optional
PROFILE_SCHOOLSenvironment variable to scope profile detail requests by school code. - Added a new
parser.LoadProfilesimplementation forprofiles_raw.json, updated loader fallback behavior, and introduced unit tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
scrapers/profiles.go |
Reimplemented scraping via Profiles API; adds batching, school-scoped fetching, and JSON file outputs. |
scrapers/profiles_test.go |
Adds unit coverage for delimiter parsing, request URL building, and dedupe logic. |
parser/profiles.go |
New parser/loader for profiles_raw.json into schema.Professor records with heuristics for titles/URLs/location. |
parser/profiles_test.go |
Adds unit tests for “best” selection heuristics and fallbacks. |
parser/profileLoader.go |
Attempts profiles_raw.json load first, falls back to legacy profiles.json. |
README.md |
Documents PROFILE_SCHOOLS usage and behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| apiPrimaryLocationRegex = regexp.MustCompile(`^(\w+)\s+(\d+\.\d{3}[A-z]?)$`) | ||
| apiFallbackLocationRegex = regexp.MustCompile(`^([A-z]+)(\d+)\.?([\d]{3}[A-z]?)$`) | ||
| ) |
There was a problem hiding this comment.
The location regexes use [A-z]/[A-z]+, which also matches non-letter ASCII characters (e.g., [, \, ], ^, _, `). This can lead to incorrect matches for building codes / room suffixes. Use [A-Za-z] (or a more specific allowed set) instead.
| type profileDetailsResponse struct { | ||
| Information []profileInformation `json:"information"` | ||
| Areas []profileArea `json:"areas"` | ||
| } |
There was a problem hiding this comment.
profileDetailsResponse is declared but never used in this package. Consider removing it (or using it) to avoid accumulating dead/unused types and reduce maintenance overhead.
| func firstInformationData(items []profileInformation) profileInformationData { | ||
| if len(items) == 0 { | ||
| return profileInformationData{} | ||
| } | ||
| return items[0].Data | ||
| } |
There was a problem hiding this comment.
firstInformationData is never called. If it’s not needed, removing it will keep the parser surface area smaller and make future refactors less error-prone.
| if len(indexResponse.Profile) == 0 { | ||
| log.Print("Profile API returned no profiles.") | ||
| return | ||
| } | ||
|
|
||
| func parseLocation(text string) schema.Location { | ||
| var building string | ||
| var room string | ||
| schools := parseProfileSchoolsFromEnv() | ||
|
|
||
| submatches := primaryLocationRegex.FindStringSubmatch(text) | ||
| if submatches == nil { | ||
| submatches = fallbackLocationRegex.FindStringSubmatch(text) | ||
| if submatches == nil { | ||
| return schema.Location{} | ||
| } else { | ||
| building = submatches[1] | ||
| room = fmt.Sprintf("%s.%s", submatches[2], submatches[3]) | ||
| slugs := make([]string, 0, len(indexResponse.Profile)) | ||
| for _, row := range indexResponse.Profile { | ||
| slug := strings.TrimSpace(row.Slug) | ||
| if slug == "" { | ||
| continue | ||
| } | ||
| } else { | ||
| building = submatches[1] | ||
| room = submatches[2] | ||
| slugs = append(slugs, slug) | ||
| } | ||
| slugs = dedupeStrings(slugs) | ||
|
|
||
| return schema.Location{ | ||
| Building: building, | ||
| Room: room, | ||
| Map_uri: fmt.Sprintf("https://locator.utdallas.edu/%s_%s", building, room), | ||
| if len(slugs) == 0 { | ||
| log.Print("Profile API index contained no valid slugs.") |
There was a problem hiding this comment.
When PROFILE_SCHOOLS is set, the scraper still builds slugs from the index and returns early if the index is empty / contains no valid slugs. That prevents the school-scoped path from running even though it doesn’t require slugs. Consider branching on schools earlier so the school-based scrape can proceed even if the index has no/invalid slugs (and only enforce the slug checks in the slug-batch path).
No description provided.