Skip to content

metagrove/IMDB_web_scaping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

IMDB_web_scaping

Python code is designed to extract information about the top movies listed on IMDb and store this data in an Excel spreadsheet using web scraping.

Importing Libraries:

The code begins by importing the necessary libraries for this task: BeautifulSoup: This library is a powerful tool for parsing HTML and XML documents. It allows you to navigate through the structure of a webpage and extract specific elements and data. requests: This library simplifies the process of making HTTP requests. It allows you to send requests to websites and retrieve the responses, which is essential for web scraping. openpyxl: This library is designed for working with Excel files. It allows you to create, read, and modify Excel spreadsheets in Python.

Setting up the Excel Workbook: excel = openpyxl.Workbook() creates a new Excel workbook object. sheet = excel.active gets the currently active worksheet within the workbook. sheet.title = "Movie_list" sets the title of the worksheet to "Movie_list". sheet.append(["Rank", "Name", "Rating", "Vote", "Year"]) adds a header row to the worksheet with the specified column names. This header row provides labels for the data that will be extracted.

Web Scraping with Requests and BeautifulSoup: URL = "https://www.imdb.com/chart/top/" defines the URL of the IMDb Top 250 movies page. This is the target website from which the code will extract data. USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0" defines a user agent string. This string is included in the HTTP request to make the request appear as if it's coming from a web browser (Firefox in this case). This can be helpful to avoid being blocked by websites that restrict access from non-browser clients. headers = {"user-agent": USER_AGENT} creates a dictionary to store the user agent information, which will be included in the request headers. resp = requests.get(URL, headers=headers) sends a GET request to the specified URL, including the user agent in the headers. This retrieves the HTML content of the IMDb Top 250 page. soup = BeautifulSoup(resp.content, "html.parser") creates a BeautifulSoup object to parse the HTML content of the page. The html.parser argument specifies the parser to be used.

Extracting Movie Data: movies = soup.find('ul', class_="ipc-metadata-list ...").find_all('li') uses BeautifulSoup to locate the unordered list (ul) element with the specified class that contains the list of movies. Then, it finds all the list items (li) within that ul element, each representing a movie. The code then iterates through each movie in the movies list. Inside the loop, it extracts specific details for each movie using find() and .text methods of BeautifulSoup: movie_rank: The rank of the movie in the Top 250 list. movie_names: The title of the movie. movie_rating: The IMDb rating of the movie. movie_vote_cnt: The number of votes the movie has received. movie_year: The year the movie was released. print(movie_names) prints the name of each movie to the console. You could replace this with sheet.append([movie_rank, movie_names, movie_rating, movie_vote_cnt, movie_year]) to add the extracted data as a new row in the Excel spreadsheet.

Error Handling: The try...except block is used for error handling. If any exception occurs during the web scraping process, the code will catch the exception and print the error message to the console.

Saving the Excel File: excel.save("Scrape_movie_data.xlsx") saves the Excel workbook with the extracted movie data to a file named "Scrape_movie_data.xlsx". This code provides a basic example of web scraping using Python. It demonstrates how to retrieve data from a website, parse the HTML content, extract specific information, and store it in a structured format like an Excel spreadsheet. Remember to always respect website terms of service and robots.txt

About

Python code is designed to extract information about the top movies listed on IMDb and store this data in an Excel spreadsheet using web scraping.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors