You can do anything at Zombocom

Gork@sopuli.xyz · 23 hours ago

You can do anything at Zombocom

yetAnotherUser@lemmy.ca · 22 hours ago

Hey, you guys got any cool tips for website scraping?

MonkderVierte@lemmy.zip · 3 hours ago

Consider free API first if possible.

irelephant [he/him]@lemmy.dbzer0.com · 8 hours ago

what do you want to scrape.

MalReynolds@piefed.social · 17 hours ago

Beautiful Soup (python library, bs4) is also fren

luciole (he/him)@beehaw.org · 21 hours ago

They’re gonna tell not to parse HTML with regular expressions. Heed this warning, and do it anyways.

yetAnotherUser@lemmy.ca · 13 hours ago

Thanks for your reply. What are your arguments in favour of parsing HTML with regex instead of using another method?

luciole (he/him)@beehaw.org · 2 hours ago

You have basically two options: treat HTML as a string or parse it then process it with higher level DOM features.

The problem with the second approach is that HTML may look like an XML dialect but it is actually immensely quirky and tolerant. Moreover the modern web page is crazy bloated, so mass processing pages might be surprisingly demanding. And in the end you still need to do custom code to grab the data you’re after.

On the other hand string searching is as lightweight as it gets and you typically don’t really need to care about document structure as a scraper anyways.

lime!@feddit.nu · 12 hours ago

it’s quick, it’s easy and it’s free

MonkderVierte@lemmy.zip · 12 hours ago

Are you a LLM?

TropicalDingdong@lemmy.world · 22 hours ago

Selenium is your fren