diff --git a/README.md b/README.md index 24cef390..54d32498 100644 --- a/README.md +++ b/README.md @@ -1,47 +1,131 @@ -

Apify SDK for Python

+

Apify SDK for Python

- PyPI package version - PyPI package downloads - Codecov report - PyPI Python version - Chat on Discord + The official Python SDK for building Apify Actors.

-The Apify SDK for Python is the official library to create [Apify Actors](https://docs.apify.com/platform/actors) -in Python. It provides useful features like Actor lifecycle management, local storage emulation, and Actor -event handling. +

+ PyPI version + PyPI downloads + Python versions + Build status + Coverage + License + Chat on Discord +

+ +`apify` is the official SDK for building [Apify Actors](https://docs.apify.com/platform/actors) in Python. It handles the Actor lifecycle, [storage](https://docs.apify.com/platform/storage) access, platform events, [Apify Proxy](https://docs.apify.com/platform/proxy), pay-per-event charging, and more. -If you just need to access the [Apify API](https://docs.apify.com/api/v2) from your Python applications, -check out the [Apify Client for Python](https://docs.apify.com/api/client/python) instead. +> If you only need to **consume** the [Apify API](https://docs.apify.com/api/v2) from Python (running Actors, reading datasets, managing storages) rather than building Actors, use the [Apify API client for Python](https://docs.apify.com/api/client/python) instead. It comes bundled with this SDK. + +## Table of contents + +- [Installation](#installation) +- [Quick start](#quick-start) +- [What are Actors?](#what-are-actors) +- [Features](#features) +- [What you can build](#what-you-can-build) +- [Usage examples](#usage-examples) +- [Documentation](#documentation) +- [Related projects](#related-projects) +- [Support and community](#support-and-community) +- [Contributing](#contributing) +- [License](#license) ## Installation -The Apify SDK for Python is available on PyPI as the `apify` package. -For default installation, using Pip, run the following: +The Apify SDK for Python requires **Python 3.11 or higher**. It is published on [PyPI](https://pypi.org/project/apify/) as the `apify` package and can be installed with [pip](https://pip.pypa.io/): ```bash pip install apify ``` -For users interested in integrating Apify with Scrapy, we provide a package extra called `scrapy`. -To install Apify with the `scrapy` extra, use the following command: +or with [uv](https://docs.astral.sh/uv/): ```bash -pip install apify[scrapy] +uv add apify ``` -## Documentation +To use the Scrapy integration, install the `scrapy` extra: + +```bash +pip install 'apify[scrapy]' +``` + +## Quick start + +An Actor is a Python program that runs inside the `async with Actor:` context. The context initializes the Actor when it starts and tears it down when it finishes. Here's a minimal Actor that reads its input and stores a result: + +```python +from apify import Actor + + +async def main() -> None: + async with Actor: + actor_input = await Actor.get_input() + Actor.log.info('Actor input: %s', actor_input) + await Actor.set_value('OUTPUT', 'Hello, world!') +``` + +The quickest way to scaffold a full Actor project, with the `.actor` configuration, input schema, and Dockerfile already in place, is the [Apify CLI](https://docs.apify.com/cli): + +1. Install the CLI: + + ```bash + npm install -g apify-cli + ``` + +2. Create a new Actor from the Python "getting started" template: + + ```bash + apify create my-actor --template python-start + ``` -For usage instructions, check the documentation on [Apify Docs](https://docs.apify.com/sdk/python/). +3. Run it locally: -## Examples + ```bash + cd my-actor + apify run + ``` -Below are few examples demonstrating how to use the Apify SDK with some web scraping-related libraries. +To create, run, and deploy your first Actor step by step, see the [Quick start guide](https://docs.apify.com/sdk/python/docs/quick-start). -### Apify SDK with HTTPX and BeautifulSoup +## What are Actors? + +Actors are serverless cloud programs that can do almost anything a human can do in a web browser. They range from small tasks, such as filling in forms or unsubscribing from online services, all the way up to scraping and processing vast numbers of web pages. + +They run either locally or on the [Apify platform](https://docs.apify.com/platform/), where you can run them at scale, monitor them, schedule them, or publish and monetize them. If you're new to Apify, learn [what Apify is](https://docs.apify.com/platform/about) in the platform documentation. + +## Features + +- Run the full Actor lifecycle inside `async with Actor:`, covering init, exit, failures, status messages, and reboots ([Actor lifecycle](https://docs.apify.com/sdk/python/docs/concepts/actor-lifecycle)). +- Read Actor input validated against your input schema with `Actor.get_input()` ([Actor input](https://docs.apify.com/sdk/python/docs/concepts/actor-input)). +- Read and write datasets, key-value stores, and request queues, locally or on the platform ([Working with storages](https://docs.apify.com/sdk/python/docs/concepts/storages)). +- React to platform events such as system info, migration, and abort ([Actor events](https://docs.apify.com/sdk/python/docs/concepts/actor-events)). +- Route requests through Apify Proxy with group selection, country targeting, and rotation ([Proxy management](https://docs.apify.com/sdk/python/docs/concepts/proxy-management)). +- Start, call, abort, and metamorph other Actors and tasks, and attach webhooks to run events ([Interacting with other Actors](https://docs.apify.com/sdk/python/docs/concepts/interacting-with-other-actors), [Webhooks](https://docs.apify.com/sdk/python/docs/concepts/webhooks)). +- Monetize your Actor with pay-per-event charging ([Pay-per-event](https://docs.apify.com/sdk/python/docs/concepts/pay-per-event)). +- Reach the full [Apify API](https://docs.apify.com/api/v2) through a preconfigured `ApifyClient` ([Accessing the Apify API](https://docs.apify.com/sdk/python/docs/concepts/access-apify-api)). + +## What you can build + +Almost any Python project can become an Actor, including projects for: + +- **Web scraping and crawling** — The SDK is fully compatible with [Crawlee](https://crawlee.dev/python), which makes Apify a natural place to deploy and scale your crawlers (see the [Crawlee guide](https://docs.apify.com/sdk/python/docs/guides/crawlee)). It also works with other popular scraping libraries, such as [Scrapy](https://docs.apify.com/sdk/python/docs/guides/scrapy), [Scrapling](https://docs.apify.com/sdk/python/docs/guides/scrapling), or [Crawl4AI](https://docs.apify.com/sdk/python/docs/guides/crawl4ai). +- **Browser automation** — Drive a real browser with [Playwright](https://docs.apify.com/sdk/python/docs/guides/playwright) or [Selenium](https://docs.apify.com/sdk/python/docs/guides/selenium), or with higher-level tools such as [Browser Use](https://docs.apify.com/sdk/python/docs/guides/browser-use). +- **Web servers and APIs** — Run a [web server](https://docs.apify.com/sdk/python/docs/guides/running-webserver) inside an Actor to serve HTTP requests, for example to expose your scraper as a live API. +- **AI agents** — Host agents built with your framework of choice. Ready-made Actor templates cover [PydanticAI](https://apify.com/templates/python-pydanticai), [CrewAI](https://apify.com/templates/python-crewai), [LangGraph](https://apify.com/templates/python-langgraph), [LlamaIndex](https://apify.com/templates/python-llamaindex-agent), and [Smolagents](https://apify.com/templates/python-smolagents). +- **MCP servers** — Deploy a Python MCP server as an Actor and make its tools available to any MCP client. See [MCP server](https://apify.com/templates/python-mcp-empty) and [MCP proxy](https://apify.com/templates/python-mcp-proxy) templates + +Whatever you build, the Apify SDK doesn't lock you into a particular framework. Bring the libraries you already use, and let Apify run your project in the cloud. + +## Usage examples + +The examples below show two common setups, but the same `async with Actor:` pattern works with any stack. For more, see the [guides](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx). + +### HTTPX with BeautifulSoup -This example illustrates how to integrate the Apify SDK with [HTTPX](https://www.python-httpx.org/) and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) to scrape data from web pages. +Scrape pages with [HTTPX](https://www.python-httpx.org/) and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/), using the Actor's request queue to track URLs: ```python from bs4 import BeautifulSoup @@ -52,45 +136,31 @@ from apify import Actor async def main() -> None: async with Actor: - # Retrieve the Actor input, and use default values if not provided. actor_input = await Actor.get_input() or {} start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}]) - # Open the default request queue for handling URLs to be processed. + # Enqueue the start URLs into the default request queue. request_queue = await Actor.open_request_queue() - - # Enqueue the start URLs. for start_url in start_urls: - url = start_url.get('url') - await request_queue.add_request(url) + await request_queue.add_request(start_url['url']) - # Process the URLs from the request queue. + # Process the queue until it's empty. while request := await request_queue.fetch_next_request(): Actor.log.info(f'Scraping {request.url} ...') - - # Fetch the HTTP response from the specified URL using HTTPX. async with AsyncClient() as client: response = await client.get(request.url) - - # Parse the HTML content using Beautiful Soup. soup = BeautifulSoup(response.content, 'html.parser') - # Extract the desired data. - data = { + # Push the extracted data to the default dataset. + await Actor.push_data({ 'url': request.url, - 'title': soup.title.string, - 'h1s': [h1.text for h1 in soup.find_all('h1')], - 'h2s': [h2.text for h2 in soup.find_all('h2')], - 'h3s': [h3.text for h3 in soup.find_all('h3')], - } - - # Store the extracted data to the default dataset. - await Actor.push_data(data) + 'title': soup.title.string if soup.title else None, + }) ``` -### Apify SDK with PlaywrightCrawler from Crawlee +### Crawlee with Playwright -This example demonstrates how to use the Apify SDK alongside `PlaywrightCrawler` from [Crawlee](https://crawlee.dev/python) to perform web scraping. +Scrape pages with [Crawlee](https://crawlee.dev/python)'s `PlaywrightCrawler`, which handles queueing, concurrency, and the browser for you: ```python from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext @@ -100,83 +170,61 @@ from apify import Actor async def main() -> None: async with Actor: - # Retrieve the Actor input, and use default values if not provided. actor_input = await Actor.get_input() or {} - start_urls = [url.get('url') for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])] + start_urls = [url['url'] for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])] - # Exit if no start URLs are provided. - if not start_urls: - Actor.log.info('No start URLs specified in Actor input, exiting...') - await Actor.exit() + crawler = PlaywrightCrawler(max_requests_per_crawl=50, headless=True) - # Create a crawler. - crawler = PlaywrightCrawler( - # Limit the crawl to max requests. Remove or increase it for crawling all links. - max_requests_per_crawl=50, - headless=True, - ) - - # Define a request handler, which will be called for every request. @crawler.router.default_handler - async def request_handler(context: PlaywrightCrawlingContext) -> None: - url = context.request.url - Actor.log.info(f'Scraping {url}...') - - # Extract the desired data. - data = { + async def handler(context: PlaywrightCrawlingContext) -> None: + Actor.log.info(f'Scraping {context.request.url} ...') + await context.push_data({ 'url': context.request.url, 'title': await context.page.title(), - 'h1s': [await h1.text_content() for h1 in await context.page.locator('h1').all()], - 'h2s': [await h2.text_content() for h2 in await context.page.locator('h2').all()], - 'h3s': [await h3.text_content() for h3 in await context.page.locator('h3').all()], - } - - # Store the extracted data to the default dataset. - await context.push_data(data) - - # Enqueue additional links found on the current page. + }) + # Follow links found on the page. await context.enqueue_links() - # Run the crawler with the starting URLs. await crawler.run(start_urls) ``` -## What are Actors? +## Documentation -Actors are serverless cloud programs that can do almost anything a human can do in a web browser. -They can do anything from small tasks such as filling in forms or unsubscribing from online services, -all the way up to scraping and processing vast numbers of web pages. +The full SDK documentation lives at **[docs.apify.com/sdk/python](https://docs.apify.com/sdk/python)**. For the Apify platform itself, see the [Apify documentation](https://docs.apify.com/). -They can be run either locally, or on the [Apify platform](https://docs.apify.com/platform/), -where you can run them at scale, monitor them, schedule them, or publish and monetize them. +| Section | What you'll find | +|---|---| +| [Overview](https://docs.apify.com/sdk/python/docs/overview) | What the SDK is, what Actors are, and how the pieces fit together. | +| [Quick start](https://docs.apify.com/sdk/python/docs/quick-start) | Create, run, and deploy your first Python Actor. | +| [Concepts](https://docs.apify.com/sdk/python/docs/concepts/actor-lifecycle) | Actor lifecycle, input, storages, events, proxy management, interacting with other Actors, webhooks, accessing the Apify API, logging, configuration, and pay-per-event. | +| [Guides](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx) | Integrations with BeautifulSoup, Parsel, Playwright, Selenium, Crawlee, Scrapy, Crawl4AI, and Browser Use, plus running a web server and using uv. | +| [Upgrading](https://docs.apify.com/sdk/python/docs/upgrading/upgrading-to-v4) | Migrating between major versions. | +| [API reference](https://docs.apify.com/sdk/python/reference) | Generated reference for every class and method. | +| [Changelog](https://docs.apify.com/sdk/python/docs/changelog) | Release history and breaking changes. | -If you're new to Apify, learn [what is Apify](https://docs.apify.com/platform/about) -in the Apify platform documentation. +## Related projects -## Creating Actors +- **[Apify API client for Python](https://docs.apify.com/api/client/python)** — talk to the Apify API directly from Python (bundled with this SDK). +- **[Crawlee for Python](https://crawlee.dev/python)** — web scraping and browser automation framework; fully compatible with this SDK. +- **[Apify SDK for JavaScript / TypeScript](https://docs.apify.com/sdk/js)** — the equivalent SDK for Node.js. +- **[Apify API client for JavaScript / TypeScript](https://docs.apify.com/api/client/js)** — the equivalent API client for Node.js. +- **[Crawlee for JavaScript / TypeScript](https://crawlee.dev)** — the original Node.js implementation of Crawlee. +- **[Apify CLI](https://docs.apify.com/cli)** — command-line tool for creating, running, and deploying Actors locally and on the platform. -To create and run Actors through Apify Console, -see the [Console documentation](https://docs.apify.com/academy/getting-started/creating-actors#choose-your-template). +## Support and community -To create and run Python Actors locally, check the documentation for -[how to create and run Python Actors locally](https://docs.apify.com/sdk/python/docs/quick-start). +- **Discord** — chat with the team and other users on the [Apify Discord server](https://discord.gg/jyEM2PRvMU). +- **GitHub issues** — report a bug or request a feature in the [issue tracker](https://github.com/apify/apify-sdk-python/issues). -## Guides +## Contributing -To see how you can use the Apify SDK with other popular libraries used for web scraping, -check out our guides for using -[BeautifulSoup with HTTPX](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx), -[Parsel with Impit](https://docs.apify.com/sdk/python/docs/guides/parsel-impit), -[Playwright](https://docs.apify.com/sdk/python/docs/guides/playwright), -[Selenium](https://docs.apify.com/sdk/python/docs/guides/selenium), -[Crawlee](https://docs.apify.com/sdk/python/docs/guides/crawlee), -or [Scrapy](https://docs.apify.com/sdk/python/docs/guides/scrapy). +Bug reports, fixes, and improvements are welcome! See [CONTRIBUTING.md](./CONTRIBUTING.md) for the development setup, coding standards, testing, and release process. The project uses [uv](https://docs.astral.sh/uv/) for project management and [Poe the Poet](https://poethepoet.natn.io/) as a task runner; the typical loop is: + +```bash +uv run poe install-dev # install dev dependencies and git hooks +uv run poe check-code # lint, type-check, and unit tests +``` -## Usage concepts +## License -To learn more about the features of the Apify SDK and how to use them, -check out the Usage Concepts section in the sidebar, -particularly the guides for the [Actor lifecycle](https://docs.apify.com/sdk/python/docs/concepts/actor-lifecycle), -[working with storages](https://docs.apify.com/sdk/python/docs/concepts/storages), -[handling Actor events](https://docs.apify.com/sdk/python/docs/concepts/actor-events) -or [how to use proxies](https://docs.apify.com/sdk/python/docs/concepts/proxy-management). +Released under the [Apache License 2.0](./LICENSE).