Skip to content
This repository was archived by the owner on May 20, 2025. It is now read-only.

NhanPhamThanh-IT/Deepnews-Summarizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated News Web Scraper and Summarizer

Overview

This project is a web application designed to automatically scrape news articles from web pages, summarize their content, and present them to the user. It features both a local development version with more direct control over scraping and summarization models, and a deployed version with a FastAPI backend and a Streamlit frontend.

Features

Project Structure

NLPApplication/
├── .gitignore
├── LICENSE
├── README.md
├── deploy/                  # Files for the deployed version
│   ├── backend/             # FastAPI backend
│   │   ├── main.py          # FastAPI application
│   │   ├── requirements.txt
│   │   └── scraper.py       # Backend scraping and summarization logic
│   └── frontend/            # Streamlit frontend for deployed version
│       ├── Home.py
│       ├── requirements.txt
│       ├── components/      # UI components for Streamlit
│       ├── pages/           # Streamlit pages
│       ├── tests/           # Frontend tests
│       └── utils/           # Frontend utilities (e.g., API calls)
├── local/                   # Files for local development and testing
│   ├── Home.py              # Main Streamlit app for local version
│   ├── config.json          # Configuration for local app
│   ├── requirements.txt
│   ├── config/              # Configuration utilities for local UI
│   ├── pages/               # Streamlit pages for local version
│   ├── ui/                  # UI layout utilities for local version
│   └── utils/               # Core utilities for local version (scraping, summarization)
└── models/                  # Machine learning models and scripts
    ├── finetuning/          # Scripts for fine-tuning models
    ├── preprocessing/       # Data preprocessing scripts
    └── testing/             # Scripts for testing models

Technologies Used

  • Python
  • Local Version:
    • Streamlit: For the user interface.
    • Crawl4AI, BeautifulSoup4: For web scraping.
    • Transformers (Hugging Face): For using the fine-tuned BART summarization model.
    • NLTK, Pandas, Scikit-learn: For text processing and data handling.
  • Deployed Version:
    • FastAPI: For the backend API.
    • Streamlit: For the frontend user interface.
    • HTTPX, BeautifulSoup4: For backend web scraping.
    • OpenAI API: For summarization.
  • General: Git, Docker (implied for potential deployment).

Setup and Installation

Prerequisites

  • Python 3.8+
  • pip

Local Development Version

  1. Clone the repository:
    git clone <repository-url>
    cd NLPApplication/local
  2. Create and activate a virtual environment (recommended):
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies: The local/requirements.txt file appears to have some encoding issues. Ensure it's a plain text file with one package per line, then run:
    pip install -r requirements.txt
    You may need to manually clean up local/requirements.txt first. It should look something like:
    fastapi==0.115.1
    httpx==0.28.1
    openai==1.75.0
    # ... and so on for all packages
  4. Configure local/config.json with your desired URLs and CSS selectors.
  5. Run the Streamlit application:
    streamlit run Home.py

Deployed Version (Conceptual - requires backend to be running)

Backend (deploy/backend)

  1. Navigate to the backend directory:
    cd NLPApplication/deploy/backend
  2. Create and activate a virtual environment.
  3. Install dependencies: Clean up deploy/backend/requirements.txt if it has encoding issues, then:
    pip install -r requirements.txt
  4. Set up environment variables:
    • OPENAI_API_KEY: Your OpenAI API key (required for summarization).
  5. Run the FastAPI application:
    uvicorn main:app --reload

Frontend (deploy/frontend)

  1. Navigate to the frontend directory:
    cd NLPApplication/deploy/frontend
  2. Create and activate a virtual environment.
  3. Install dependencies: Clean up deploy/frontend/requirements.txt if it has encoding issues, then:
    pip install -r requirements.txt
  4. Ensure the backend is running and accessible. The frontend makes API calls to https://nlpapplication-0xrw.onrender.com.
  5. Run the Streamlit application:
    streamlit run Home.py

Usage

Local Version

  • Open the Streamlit application in your browser (usually http://localhost:8501).
  • Daily News: Navigate to the "Daily News" page. It will display news from sources configured in local/config.json. Click "More details" to scrape and summarize an article.
  • Custom Fetch: Navigate to the "Custom Fetch" page. Enter a URL to scrape and summarize its content directly.

Deployed Version

  • Access the Streamlit frontend URL (once deployed).
  • Daily News: Select a news category (e.g., Politics, Sports). Articles will be fetched from the backend. Click "Read more" to view the summarized content.
  • Scrape Specific Article: Enter a direct CNN article URL to fetch and display its summarized content.

Models

The models/ directory contains scripts related to:

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A feature-rich web application for automated news scraping and summarization. It allows users to enter article URLs, fetches the full content, and generates concise summaries. The system supports both local inference with custom models and remote deployment via FastAPI or Streamlit interfaces.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages