Real-time-relevant-news-on-multi-source-websites

A big data project to obtain real time relevant news from multiple websites.

Computing Framework- Apache Spark

Language- Python

API- PySpark

The project involves the following steps:-

1.Web Scraping

Scraping of real time news from multiple sources related to a specific topic.

2.Spark Streaming

The real time news scraped from multiple websites in the previous step is passed to spark streaming via TCP sockets. Spark streaming reads in the live data as discretized streams(DStreams).

3.Filtering

Spark Streaming filters out the input DStreams using transformation functions filter() and union() to obtain the relevant news.

To run

1.Run news.py in a terminal using python news.py {topic} | nc -lk 9999

2.Run streaming.py in another terminal using spark-submit streaming.py | cat > {filename}

Note: cat > {filename} is used to store the results in a file.

eg:

python news.py kerala | nc -lk 9999 in one terminal

spark-submit streaming.py | cat > file.txt in another terminal

The filtering keywords can be changed in streaming.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
news.py		news.py
streaming.py		streaming.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-time-relevant-news-on-multi-source-websites

1.Web Scraping

2.Spark Streaming

3.Filtering

To run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Real-time-relevant-news-on-multi-source-websites

1.Web Scraping

2.Spark Streaming

3.Filtering

To run

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages