Crawl Clean and Learn

Description

Crawl Clean and Learn is a basic Data crawling, Data cleaning and Hadoop implementation. It has 3 parts.

In Part 1, data is crawled from WikiCFP website (http://www.wikicfp.com/cfp/) for the conferences on Data mining, Machine Learning, Database and Artificial Intelligence. For each of this category of conferences, the conference acronym, conference name and conference location is obtained for further processing. Moreover, for each category up to 20 pages are crawled.

In Part 2, data is cleaned using the OpenRefine tool and various inconsistencies are removed.

In Part 3, data is analyzed and different statistics are computed to get an insight of the data. For this, Hadoop MapReduce is used. In this part, total 4 different computations are performed.

A report is also included with the project code. More details can be found there.

Disclaimer

The sourcecode does not come with any kind of warrenty or support, though you can contact me if you need.

Copyright

Sharmistha Bardhan

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Part 1 - Data Crawling		Part 1 - Data Crawling
Part 2 - Data Cleaning		Part 2 - Data Cleaning
Part 3 - Hadoop		Part 3 - Hadoop
.gitignore		.gitignore
README.md		README.md
Report.pdf		Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl Clean and Learn

Description

Disclaimer

Copyright

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crawl Clean and Learn

Description

Disclaimer

Copyright

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages