Skip to content

SharmisthaWWE/Crawl-Clean-and-Learn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawl Clean and Learn

Description

Crawl Clean and Learn is a basic Data crawling, Data cleaning and Hadoop implementation. It has 3 parts.

In Part 1, data is crawled from WikiCFP website (http://www.wikicfp.com/cfp/) for the conferences on Data mining, Machine Learning, Database and Artificial Intelligence. For each of this category of conferences, the conference acronym, conference name and conference location is obtained for further processing. Moreover, for each category up to 20 pages are crawled.

In Part 2, data is cleaned using the OpenRefine tool and various inconsistencies are removed.

In Part 3, data is analyzed and different statistics are computed to get an insight of the data. For this, Hadoop MapReduce is used. In this part, total 4 different computations are performed.

A report is also included with the project code. More details can be found there.

Disclaimer

The sourcecode does not come with any kind of warrenty or support, though you can contact me if you need.

Copyright

Sharmistha Bardhan

About

Crawl Clean and Learn is a basic data mining project in which data is crawled, cleaned and analyzed to get an insight into the data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages