Understanding user sentiment in the New Energy Vehicle (NEV) market presents unique challenges. User-generated reviews are highly unstructured, noisy, and rich in domain-specific expressions.
While Large Language Models (LLMs) excel at semantic understanding, fully replacing statistical models with LLM-based classification often sacrifices comparability, transparency, and structural interpretability.
The Research Question:
Can LLMs be used not as black-box classifiers, but as semantic operators that enhance traditional topic models while preserving their statistical structure?
The core objective is to design a hybrid opinion mining framework that enables:
- Structurally interpretable topic modeling.
- Fine-grained semantic interpretation of latent topics.
- Comparative analysis between BEV (Battery Electric Vehicle) and PHEV (Plug-in Hybrid Electric Vehicle) user groups.
Rather than predicting sentiment alone, the framework aims to uncover structural differences in user concerns (e.g., Intelligence experience vs. Range anxiety) to support data-driven business decisions.
This framework positions itself between Classical Unsupervised Learning (LDA) and Generative AI (LLMs).
| Approach | Pros | Cons |
|---|---|---|
| Pure LDA | Probabilistic structure, Stable | Low semantic interpretability, Sensitive to noise |
| Pure LLM | High semantic understanding | High cost, Black-box nature, Hard to quantify structurally |
| Our Hybrid Framework | Interpretable + Semantic-aware | Complex pipeline design |
Design Philosophy:
- LDA provides the probabilistic backbone for topic stability.
- LLMs act as "Semantic Operators" for noise filtering and meaning extraction.
Inspired by recent computational social science literature (e.g., "Jobless Growth"), we leverage LLMs to identify recurrent noise patterns in raw UGC and automatically generate high-precision Regular Expressions. This "Human-in-the-loop" approach enables scalable cleaning that surpasses manual keyword matching.
Traditional LDA outputs unordered keyword lists. In this framework, LLMs serve as interpreters to bridge probabilistic outputs with human understanding, automatically generating coherent topic labels.
Unlike generic sentiment analysis, this framework explicitly designs for Group-Level Comparison (BEV vs. PHEV), identifying asymmetric pain points across different powertrain technologies.
The project follows a hybrid pipeline:
- Data Acquisition: Multi-source scraping from Chinese automobile platforms.
- LLM-Enhanced Preprocessing: Noise pattern discovery & Regex generation.
- Topic Modeling: Gensim-based LDA with Coherence optimization.
- Semantic Enrichment: LLM-based topic labeling & sentiment interpretation.
- Comparative Analysis: BEV vs. PHEV topic distribution visualization.
Current Status: Methodological Design & Prototyping
This repository currently serves as a Research Proposal and System Design demonstration.
- The pipeline architecture is fully specified.
- Core algorithms (LDA + LLM interaction logic) are defined in the
src/directory. - Large-scale quantitative evaluation is planned for the next phase.
The project is currently in the active prototyping phase. ( Legend: ✅ Completed | 🔄 In Progress | ⏳ Planned )
- ✅ Literature Review: Methodological validation based on "Jobless Growth" (LLM+Regex) & specialized NEV studies.
- ✅ System Architecture: Designing the hybrid pipeline of "LDA Structure + LLM Semantics".
- ✅ Environment Setup: Repository initialization and dependency management (
requirements.txt).
- 🔄 Data Acquisition: Developing multi-threaded scrapers for Autohome/Dongchedi.
- 🔄 LLM-Based Noise Modeling: Designing prompt engineering for noise pattern discovery.
- ⏳ Regex Generation Pipeline: Implementing the automated Regex construction module.
- ⏳ LDA Model Training: Optimization of K topics via Coherence Score.
- ⏳ Semantic Injection: Integration of GPT-4/DeepSeek API for topic labeling.
- ⏳ Experimentation: Comparative analysis between BEV and PHEV datasets.
- ⏳ Result Visualization: Interactive pyLDAvis charts and Radar plots.
- ⏳ Final Report: Drafting the methodological paper.
NEV-Opinion-Mining/
├── data/ # Raw and processed datasets
├── src/
│ ├── scraper.py # Data collection scripts
│ ├── preprocessor.py # LLM + Regex cleaning logic (Core Innovation)
│ ├── lda_model.py # Topic modeling core
│ └── llm_agent.py # LLM interface for semantic interpretation
├── notebooks/ # Experimental designs
├── requirements.txt # Dependencies
└── README.md # Project documentation