🚗 NEV-Opinion-Mining

A Hybrid Topic–Semantic Framework Based on LDA and Large Language Models

1. Motivation

Understanding user sentiment in the New Energy Vehicle (NEV) market presents unique challenges. User-generated reviews are highly unstructured, noisy, and rich in domain-specific expressions.

While Large Language Models (LLMs) excel at semantic understanding, fully replacing statistical models with LLM-based classification often sacrifices comparability, transparency, and structural interpretability.

The Research Question:

Can LLMs be used not as black-box classifiers, but as semantic operators that enhance traditional topic models while preserving their statistical structure?

2. Research Objective

The core objective is to design a hybrid opinion mining framework that enables:

Structurally interpretable topic modeling.
Fine-grained semantic interpretation of latent topics.
Comparative analysis between BEV (Battery Electric Vehicle) and PHEV (Plug-in Hybrid Electric Vehicle) user groups.

Rather than predicting sentiment alone, the framework aims to uncover structural differences in user concerns (e.g., Intelligence experience vs. Range anxiety) to support data-driven business decisions.

3. Methodological Positioning

This framework positions itself between Classical Unsupervised Learning (LDA) and Generative AI (LLMs).

Approach	Pros	Cons
Pure LDA	Probabilistic structure, Stable	Low semantic interpretability, Sensitive to noise
Pure LLM	High semantic understanding	High cost, Black-box nature, Hard to quantify structurally
Our Hybrid Framework	Interpretable + Semantic-aware	Complex pipeline design

Design Philosophy:

LDA provides the probabilistic backbone for topic stability.
LLMs act as "Semantic Operators" for noise filtering and meaning extraction.

4. Core Innovations

4.1 LLM-Driven Noise Modeling

Inspired by recent computational social science literature (e.g., "Jobless Growth"), we leverage LLMs to identify recurrent noise patterns in raw UGC and automatically generate high-precision Regular Expressions. This "Human-in-the-loop" approach enables scalable cleaning that surpasses manual keyword matching.

4.2 LLM-Assisted Semantic Interpretation

Traditional LDA outputs unordered keyword lists. In this framework, LLMs serve as interpreters to bridge probabilistic outputs with human understanding, automatically generating coherent topic labels.

4.3 Comparative Topic–Sentiment Analysis

Unlike generic sentiment analysis, this framework explicitly designs for Group-Level Comparison (BEV vs. PHEV), identifying asymmetric pain points across different powertrain technologies.

5. System Architecture

The project follows a hybrid pipeline:

Data Acquisition: Multi-source scraping from Chinese automobile platforms.
LLM-Enhanced Preprocessing: Noise pattern discovery & Regex generation.
Topic Modeling: Gensim-based LDA with Coherence optimization.
Semantic Enrichment: LLM-based topic labeling & sentiment interpretation.
Comparative Analysis: BEV vs. PHEV topic distribution visualization.

6. Research Stage Clarification ⚠️

Current Status: Methodological Design & Prototyping

This repository currently serves as a Research Proposal and System Design demonstration.

The pipeline architecture is fully specified.
Core algorithms (LDA + LLM interaction logic) are defined in the src/ directory.
Large-scale quantitative evaluation is planned for the next phase.

7. Project Roadmap

The project is currently in the active prototyping phase. ( Legend: ✅ Completed | 🔄 In Progress | ⏳ Planned )

Phase 1: Research Design & Infrastructure

✅ Literature Review: Methodological validation based on "Jobless Growth" (LLM+Regex) & specialized NEV studies.
✅ System Architecture: Designing the hybrid pipeline of "LDA Structure + LLM Semantics".
✅ Environment Setup: Repository initialization and dependency management (requirements.txt).

Phase 2: Data & Preprocessing (The Hybrid Engine)

🔄 Data Acquisition: Developing multi-threaded scrapers for Autohome/Dongchedi.
🔄 LLM-Based Noise Modeling: Designing prompt engineering for noise pattern discovery.
⏳ Regex Generation Pipeline: Implementing the automated Regex construction module.

Phase 3: Modeling & Interpretation

⏳ LDA Model Training: Optimization of K topics via Coherence Score.
⏳ Semantic Injection: Integration of GPT-4/DeepSeek API for topic labeling.
⏳ Experimentation: Comparative analysis between BEV and PHEV datasets.

Phase 4: Visualization & Reporting

⏳ Result Visualization: Interactive pyLDAvis charts and Radar plots.
⏳ Final Report: Drafting the methodological paper.

8. Directory Structure

NEV-Opinion-Mining/
├── data/                # Raw and processed datasets
├── src/
│   ├── scraper.py       # Data collection scripts
│   ├── preprocessor.py  # LLM + Regex cleaning logic (Core Innovation)
│   ├── lda_model.py     # Topic modeling core
│   └── llm_agent.py     # LLM interface for semantic interpretation
├── notebooks/           # Experimental designs
├── requirements.txt     # Dependencies
└── README.md            # Project documentation

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚗 NEV-Opinion-Mining

A Hybrid Topic–Semantic Framework Based on LDA and Large Language Models

1. Motivation

2. Research Objective

3. Methodological Positioning

4. Core Innovations

4.1 LLM-Driven Noise Modeling

4.2 LLM-Assisted Semantic Interpretation

4.3 Comparative Topic–Sentiment Analysis

5. System Architecture

6. Research Stage Clarification ⚠️

7. Project Roadmap

Phase 1: Research Design & Infrastructure

Phase 2: Data & Preprocessing (The Hybrid Engine)

Phase 3: Modeling & Interpretation

Phase 4: Visualization & Reporting

8. Directory Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚗 NEV-Opinion-Mining

A Hybrid Topic–Semantic Framework Based on LDA and Large Language Models

1. Motivation

2. Research Objective

3. Methodological Positioning

4. Core Innovations

4.1 LLM-Driven Noise Modeling

4.2 LLM-Assisted Semantic Interpretation

4.3 Comparative Topic–Sentiment Analysis

5. System Architecture

6. Research Stage Clarification ⚠️

7. Project Roadmap

Phase 1: Research Design & Infrastructure

Phase 2: Data & Preprocessing (The Hybrid Engine)

Phase 3: Modeling & Interpretation

Phase 4: Visualization & Reporting

8. Directory Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages