Skip to content

feat: add lesson about using the platform #1424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/styles/config/vocabularies/Docs/accept.txt
Original file line number Diff line number Diff line change
Expand Up @@ -87,19 +87,20 @@ preconfigured
[Mm]ultiselect

asyncio
Langflow
backlinks?
captchas?
Chatbot
combinator
deduplicating
dev
Fakestore
Fandom('s)?
IMDb
influencers
iPads?
iPhones?
jQuery
Langflow
learnings
livestreams
outro
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,11 +44,11 @@ Successfully installed Jinja2-0.0.0 ... ... ... crawlee-0.0.0 ... ... ...

## Running Crawlee

Now let's use the framework to create a new version of our scraper. In the same project directory where our `main.py` file lives, create a file `newmain.py`. This way, we can keep peeking at the original implementation while working on the new one. The initial content will look like this:
Now let's use the framework to create a new version of our scraper. Rename the `main.py` file to `oldmain.py`, so that we can keep peeking at the original implementation while working on the new one. Then, in the same project directory, create a new, empty `main.py`. The initial content will look like this:

```py title="newmain.py"
```py
import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.crawlers import BeautifulSoupCrawler

async def main():
crawler = BeautifulSoupCrawler()
Expand All @@ -74,8 +74,8 @@ In the code, we do the following:
Don't worry if this involves a lot of things you've never seen before. For now, you don't need to know exactly how [`asyncio`](https://docs.python.org/3/library/asyncio.html) works or what decorators do. Let's stick to the practical side and see what the program does when executed:

```text
$ python newmain.py
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
$ python main.py
[BeautifulSoupCrawler] INFO Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished │ 0 │
│ requests_failed │ 0 │
Expand All @@ -91,7 +91,7 @@ $ python newmain.py
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
Sales
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
[BeautifulSoupCrawler] INFO Final request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished │ 1 │
│ requests_failed │ 0 │
Expand Down Expand Up @@ -122,7 +122,7 @@ For example, it takes a single line of code to extract and follow links to produ

```py
import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.crawlers import BeautifulSoupCrawler

async def main():
crawler = BeautifulSoupCrawler()
Expand Down Expand Up @@ -152,8 +152,8 @@ Below that, we give the crawler another asynchronous function, `handle_detail()`
If we run the code, we should see how Crawlee first downloads the listing page and then makes parallel requests to each of the detail pages, printing their URLs along the way:

```text
$ python newmain.py
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
$ python main.py
[BeautifulSoupCrawler] INFO Current request statistics:
┌───────────────────────────────┬──────────┐
...
└───────────────────────────────┴──────────┘
Expand All @@ -164,7 +164,7 @@ https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-s
https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable
...
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
[BeautifulSoupCrawler] INFO Final request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished │ 25 │
│ requests_failed │ 0 │
Expand Down Expand Up @@ -232,7 +232,7 @@ Finally, the variants. We can reuse the `parse_variant()` function as-is, and in
```py
import asyncio
from decimal import Decimal
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.crawlers import BeautifulSoupCrawler

async def main():
crawler = BeautifulSoupCrawler()
Expand Down Expand Up @@ -309,7 +309,7 @@ async def main():
await context.push_data(item)
```

That's it! If you run the program now, there should be a `storage` directory alongside the `newmain.py` file. Crawlee uses it to store its internal state. If you go to the `storage/datasets/default` subdirectory, you'll see over 30 JSON files, each representing a single item.
That's it! If you run the program now, there should be a `storage` directory alongside the `main.py` file. Crawlee uses it to store its internal state. If you go to the `storage/datasets/default` subdirectory, you'll see over 30 JSON files, each representing a single item.

![Single dataset item](images/dataset-item.png)

Expand All @@ -335,7 +335,7 @@ Crawlee gives us stats about HTTP requests and concurrency, but we don't get muc
```py
import asyncio
from decimal import Decimal
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.crawlers import BeautifulSoupCrawler

async def main():
crawler = BeautifulSoupCrawler()
Expand Down Expand Up @@ -398,7 +398,7 @@ if __name__ == '__main__':

Depending on what we find helpful, we can tweak the logs to include more or less detail. The `context.log` or `crawler.log` objects are just [standard Python loggers](https://docs.python.org/3/library/logging.html).

Even with the extra logging we've added, we've managed to cut at least 20 lines of code compared to the original program. Throughout this lesson, we've been adding features to match the old scraper's functionality, but the new code is still clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about, while the framework manages the rest.
If we compare `main.py` and `oldmain.py` now, it's clear we've cut at least 20 lines of code compared to the original program, even with the extra logging we've added. Throughout this lesson, we've introduced features to match the old scraper's functionality, but the new code is still clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about, while the framework manages the rest.

In the next lesson, we'll use a scraping platform to set up our application to run automatically every day.

Expand Down Expand Up @@ -454,7 +454,7 @@ Hints:
import asyncio
from datetime import datetime

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.crawlers import BeautifulSoupCrawler

async def main():
crawler = BeautifulSoupCrawler()
Expand Down Expand Up @@ -554,7 +554,7 @@ When navigating to the first search result, you might find it helpful to know th
from urllib.parse import quote_plus

from crawlee import Request
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.crawlers import BeautifulSoupCrawler

async def main():
crawler = BeautifulSoupCrawler()
Expand Down
Loading
Loading