Skip to content

xxICEY/NEV-Sentiment-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

🚗 NEV-Opinion-Mining

A Hybrid Topic–Semantic Framework Based on LDA and Large Language Models

Status Methodology Domain License

1. Motivation

Understanding user sentiment in the New Energy Vehicle (NEV) market presents unique challenges. User-generated reviews are highly unstructured, noisy, and rich in domain-specific expressions.

While Large Language Models (LLMs) excel at semantic understanding, fully replacing statistical models with LLM-based classification often sacrifices comparability, transparency, and structural interpretability.

The Research Question:

Can LLMs be used not as black-box classifiers, but as semantic operators that enhance traditional topic models while preserving their statistical structure?

2. Research Objective

The core objective is to design a hybrid opinion mining framework that enables:

  1. Structurally interpretable topic modeling.
  2. Fine-grained semantic interpretation of latent topics.
  3. Comparative analysis between BEV (Battery Electric Vehicle) and PHEV (Plug-in Hybrid Electric Vehicle) user groups.

Rather than predicting sentiment alone, the framework aims to uncover structural differences in user concerns (e.g., Intelligence experience vs. Range anxiety) to support data-driven business decisions.

3. Methodological Positioning

This framework positions itself between Classical Unsupervised Learning (LDA) and Generative AI (LLMs).

Approach Pros Cons
Pure LDA Probabilistic structure, Stable Low semantic interpretability, Sensitive to noise
Pure LLM High semantic understanding High cost, Black-box nature, Hard to quantify structurally
Our Hybrid Framework Interpretable + Semantic-aware Complex pipeline design

Design Philosophy:

  • LDA provides the probabilistic backbone for topic stability.
  • LLMs act as "Semantic Operators" for noise filtering and meaning extraction.

4. Core Innovations

4.1 LLM-Driven Noise Modeling

Inspired by recent computational social science literature (e.g., "Jobless Growth"), we leverage LLMs to identify recurrent noise patterns in raw UGC and automatically generate high-precision Regular Expressions. This "Human-in-the-loop" approach enables scalable cleaning that surpasses manual keyword matching.

4.2 LLM-Assisted Semantic Interpretation

Traditional LDA outputs unordered keyword lists. In this framework, LLMs serve as interpreters to bridge probabilistic outputs with human understanding, automatically generating coherent topic labels.

4.3 Comparative Topic–Sentiment Analysis

Unlike generic sentiment analysis, this framework explicitly designs for Group-Level Comparison (BEV vs. PHEV), identifying asymmetric pain points across different powertrain technologies.

5. System Architecture

The project follows a hybrid pipeline:

  1. Data Acquisition: Multi-source scraping from Chinese automobile platforms.
  2. LLM-Enhanced Preprocessing: Noise pattern discovery & Regex generation.
  3. Topic Modeling: Gensim-based LDA with Coherence optimization.
  4. Semantic Enrichment: LLM-based topic labeling & sentiment interpretation.
  5. Comparative Analysis: BEV vs. PHEV topic distribution visualization.

6. Research Stage Clarification ⚠️

Current Status: Methodological Design & Prototyping

This repository currently serves as a Research Proposal and System Design demonstration.

  • The pipeline architecture is fully specified.
  • Core algorithms (LDA + LLM interaction logic) are defined in the src/ directory.
  • Large-scale quantitative evaluation is planned for the next phase.

7. Project Roadmap

The project is currently in the active prototyping phase. ( Legend: ✅ Completed | 🔄 In Progress | ⏳ Planned )

Phase 1: Research Design & Infrastructure

  • Literature Review: Methodological validation based on "Jobless Growth" (LLM+Regex) & specialized NEV studies.
  • System Architecture: Designing the hybrid pipeline of "LDA Structure + LLM Semantics".
  • Environment Setup: Repository initialization and dependency management (requirements.txt).

Phase 2: Data & Preprocessing (The Hybrid Engine)

  • 🔄 Data Acquisition: Developing multi-threaded scrapers for Autohome/Dongchedi.
  • 🔄 LLM-Based Noise Modeling: Designing prompt engineering for noise pattern discovery.
  • Regex Generation Pipeline: Implementing the automated Regex construction module.

Phase 3: Modeling & Interpretation

  • LDA Model Training: Optimization of K topics via Coherence Score.
  • Semantic Injection: Integration of GPT-4/DeepSeek API for topic labeling.
  • Experimentation: Comparative analysis between BEV and PHEV datasets.

Phase 4: Visualization & Reporting

  • Result Visualization: Interactive pyLDAvis charts and Radar plots.
  • Final Report: Drafting the methodological paper.

8. Directory Structure

NEV-Opinion-Mining/
├── data/                # Raw and processed datasets
├── src/
│   ├── scraper.py       # Data collection scripts
│   ├── preprocessor.py  # LLM + Regex cleaning logic (Core Innovation)
│   ├── lda_model.py     # Topic modeling core
│   └── llm_agent.py     # LLM interface for semantic interpretation
├── notebooks/           # Experimental designs
├── requirements.txt     # Dependencies
└── README.md            # Project documentation

About

A hybrid framework combining LDA topic modeling and LLMs for interpretable opinion mining. Comparing BEV vs. PHEV user sentiments through a decision-science lens.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages