Dominik Bieber @dbeaver

1 Beitrag1 Beteiligte*r0 Beiträge heute

Fortgeführter Thread

**Allanon** @allanon@mastodon.uno · 5 T. *

5 T. *

Allanon @allanon@mastodon.uno

Just for the record one of the most aggressive are those from #Microsoft & #Bing :

BingBot : 52.167.144.*
BingBot : 40.77.167.*

I had some intense visits from #OpenAI too:
OpenAI : 52.255.111.84-87

...at least from their #useragent

#bot #crawler #scraper

**César Pose** @cesarpose@infosec.exchange · 3. Apr.

3. Apr.

César Pose @cesarpose@infosec.exchange

#cats #catsofmastodon #cat #scraper

**Michael Ditsch** @midide@digitalcourage.social · 2. Apr.

2. Apr.

Michael Ditsch @midide@digitalcourage.social

#Digitalisierung #KI #Scraper #Wikipedia - 50 Prozent mehr Bandbreite für Multimedia-Abrufe - "Die Online-Enzyklopädie Wikipedia und damit verbundene Bibliotheken haben im vergangenen Jahr einen drastischen Anstieg der Bandbreite für Downloads von Multimedia-Inhalten registriert und schieben das auf Scraper fürs Training von KI. [...] Der Traffic durch die KI-Scraper sei 'beispiellos' und bedeute 'wachsende Risiken und Kosten', schreibt die Foundation noch. Im Gegenzug gebe es gleichzeitig keinen Mehrwert, etwa durch mehr Sichtbarkeit für die Wikipedia und mehr Besuche von Menschen." - von Martin Holland - Eventl. € https://www.heise.de/news/KI-Scraper-belasten-Wikipedia-50-Prozent-mehr-Bandbreite-fuer-Multimedia-Abrufe-10336776.html?wt_mc=sm.red.ho.mastodon.mastodon.md_beitraege.md_beitraege&utm_source=mastodon

heise online · 2. Apr.KI-Scraper belasten Wikipedia: 50 Prozent mehr Bandbreite für Multimedia-Abrufe

Mehr von

Martin Holland

**Halbheld** @hlbhld@mas.to · 2. Apr.

2. Apr.

Halbheld @hlbhld@mas.to

#KI #Scraper belasten Wikipedia: 50 Prozent mehr Bandbreite für Multimedia-Abrufe

https://www.heise.de/news/KI-Scraper-belasten-Wikipedia-50-Prozent-mehr-Bandbreite-fuer-Multimedia-Abrufe-10336776.html

heise online · 2. Apr.KI-Scraper belasten Wikipedia: 50 Prozent mehr Bandbreite für Multimedia-Abrufe

Mehr von

Martin Holland

Antwortete im Thread

**Kevin Karhan** @kkarhan@infosec.space · 9. Feb.

9. Feb.

Kevin Karhan @kkarhan@infosec.space

@stefanmuelller die robots.txt ist eine Bitte, KEIN gesetzlich verpflichtender Opt-Out!

Wenn dir #Scraper und anderer shice bekannt ist sag' bescheid und ich pack die öffentliche Blocklist die ich maintaine...

www.robotstxt.orgThe Web Robots Pages

#scraper

**Kevin Karhan** @kkarhan@infosec.space · 20. Jan. *

20. Jan. *

Kevin Karhan @kkarhan@infosec.space

The whole #AI #Enshittification #shitshow is so bad, at work entire address blocks as big as /12 have to be blocklisted because they basically #DDoS clients unless we'd want to bankrupt customers for #bezos' #scraper #bots!

If it was my decision the entire #aws #ASN16509 would've been blocked!

#Sysadmin #SysadminProblems #Amazon

**PaulaToThePeople** @PaulaToThePeople@climatejustice.social · 17. Jan. *

17. Jan. *

PaulaToThePeople @PaulaToThePeople@climatejustice.social

FediBlock newsmast

**matthieu_coulanges** @matthieu_coulanges@mastoart.social · 15. Jan.

15. Jan.

matthieu_coulanges @matthieu_coulanges@mastoart.social

Etching point and drypoints

Www.matthieucoulanges.fr

#printmaking #printmakersofinstagram #printwork

Antwortete im Thread

**Kevin Karhan** @kkarhan@infosec.space · 27. Dez. 2024

27. Dez. 2024

Kevin Karhan @kkarhan@infosec.space

@khobochka guess why I maintain a #Scraper #blocklist?

In fact I know multiple people and organizations that decide to basically redirect #ValueRemoving #Scrapers like #GPTbot, #ByteSpider (which literally #DDoS'd #MattKC because #ClownFlare are a criminally incompetent #RogueISP!) to #Hetzner's 10GB Speedtest file which can be found at http://hil-speed.hetzner.com/10GB.bin as an extra middlefinger!

GitHublists.d/scrapers.ipv4.block.list.tsv at main · greyhat-academy/lists.dList of useful things. Contribute to greyhat-academy/lists.d development by creating an account on GitHub.

#Cloudflare #hetznered #ByteDance

**Seirdy** @Seirdy@pleroma.envs.net · 2. Nov. 2024

2. Nov. 2024

Seirdy @Seirdy@pleroma.envs.net

Anybody know anything about the following User Agent strings?

ReplicantReaderBot: “Replicant” isn’t an entirely unique brand name. I hope this is unrelated to the Replicant LLM chatbots. If it is, is it used to train or is it just a client of the chatbots?
ArenaBot/1.0 (+<https://arena.im/bot/;> contact@arena.im) (page is a 404; is this used to train LLMs or does an LLM use this as a client to fetch data?)
SocialBeeAgent: again, used to train LLMs or a client of an LLM?
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5 Tencent/BrandProtection. Does this obey robots.txt or am I gonna have to add another Nginx rule? I normally block brand-protection bots.

#bot #scraper #LazyWeb

**vanta rainbow black** @vantablack@cyberpunk.lol · 11. Sept. 2024

11. Sept. 2024

vanta rainbow black @vantablack@cyberpunk.lol

found another scraper indexer thingy

https://mastogizmos.com

mastogizmos.comMastoGizmos - Mastodon Tools and Searches

#scraper #indexer #fediblock

**BeAware** @BeAware@social.beaware.live · 30. Aug. 2024

30. Aug. 2024

BeAware @BeAware@social.beaware.live

I'm sure it's not massively known quite yet, so I'll mention them again:

There's a Fedi scraper on multiple instances that's been here for about 6 months now. It won't go away.

Search for "Awakari" on your instance and block/report every account with the same profile pic, as a data scraper.

I've posted about them 7 or 8 times now because they've made multiple instances to ban evade. Those instances include Awakari.com, Awakari.app, and Indy.rest

The owner is Akurilov@mastodon.social

Just thought I should keep mentioning them for the new users that haven't blocked them yet, as Mastodon Gmbh seems content with allowing them to operate from their instance.

#Fediblock #Fedi #Fediverse

**Webrocker** @blog@webrocker.de · 21. Aug. 2024 *

21. Aug. 2024 *

Webrocker @blog@webrocker.de

A(I)le bekloppt

Drüben im Blog der Uberspace-Betreiber findet sich ein sehr interessanter Artikel dazu, was die (mittlerweile anscheinend komplett hohldrehenden) Bots der AI-Firmen ohne Rücksicht auf Verluste so auslösen:

(…) Zusammenfassend lässt sich sagen, dass nach unserer Beobachtung rund 30 %-50 % aller Anfragen für kleine Seiten inzwischen von Bots generiert werden. Für große Seiten schwankt diese Zahl sogar zwischen 20 % und 75 %. In unseren Augen und mit Ignorieren der robots.txt ist damit inzwischen ein Punkt erreicht, an dem dieses Verhalten von Bots nicht mehr akzeptabel ist und unserem Betrieb schadet.
blog.uberspace.de

Bei meinen unregelmässigen Ausflügen in die Serverlogs meiner eigenen Seiten, aber auch von Auftritten meiner Kunden ist das genauso: Die bot-Zugriffe haben überproportional zugenommen und es ist teilweise wirklich heftig, mit welcher Frequenz und mit wieviel wechselnden IPs die Dinger auf die Site hämmern. >:-(

#Bots #DigitaleSelbstVerteidigung #robotsTxt #Scraper #WildWest

https://webrocker.de/?p=29216

blog.uberspace.deBad Robots

**tomate** @jascha@ohai.social · 11. Aug. 2024

11. Aug. 2024

tomate @jascha@ohai.social

Die AI-Scraper vpn Blogmojo haben immer noch nicht auf meine Mail geantwortet. Ich habe sie freundlich erinnert.

https://jascha.wtf/blogmojo-ai-plagiat-im-jahr-2023-wenn-kuenstliche-intelligenz-texte-klaut/

jascha.wtf · 4. Aug. 2024Blogmojo.ai: Plagiat im Jahr 2023 (sic!) - Wenn künstliche Intelligenz Texte klaut!Ich habe einen AI Bot in meinen Logfiles gefunden. Er gehört zu einem Dienst, der für einen Blogposts schreibt. Ich hatte ein paar Fragen und bekam (unbefreidigende) Antworten

#ai #robots #scraper

**Seirdy** @Seirdy@pleroma.envs.net · 6. Aug. 2024 *

6. Aug. 2024 *

Seirdy @Seirdy@pleroma.envs.net

Another new LLM scraper just dropped: AI2 Bot.

First-party documentation does not list any way to opt-out except filtering the user-agent on your server/firewall. The docs list the following User-Agent to filter:

Mozilla/5.0 (compatible) AI2Bot (+https://www.allenai.org/crawler)

My server logs contained the following string:

Mozilla/5.0 (compatible) Ai2Bot-Dolma (+https://www.allenai.org/crawler)

That appears to be for Ai2’s Dolma product.

159 hits came from 174.174.51.252, a Comcast-owned IP in Oregon.

I recommend adding ai2bot to your server’s user-agent matching rules if you don’t want to be in the Dolma dataset; unlike Common Crawl, this seems tailored specifically for training LLMs with few other users.

allenai.orgCrawling notice | Ai2Explanation and technical details of Ai2's web crawler.

#Scraper

**Feilner IT** @FeilnerIT@mastodon.social · 5. Aug. 2024

5. Aug. 2024

Feilner IT @FeilnerIT@mastodon.social

Ich war Gast im TechnikTechnik Podcast gestern - vielen Dank dafür an @MariusQuabeck@mastodon.rocks! und Kollegen!
Thema: Unbefugte Datennutzung durch KI-Scraper und -Crawler. Die verursachen massive Kosten, halten sich nicht an Regeln und robots.txt und ihre Tech-Billionäre vertreten die Position, all your data is belong to us.
Tools wie #konterfAI erlauben es auch einfachen Webseitenbetreibern, sich zu wehren.
https://techniktechnik.de/189/

TechnikTechnikTT189 Gruscheln für Fortgeschrittene - TechnikTechnikAnna hat Foodsharing ausprobiert, Peter hat Android auf seinem Rabbit R1 installiert und Marius hat seine Audible-Hörbücher befreit. Außerdem: Marius muss sich seine Schuhe wieder selber binden, Google schaltet Shortlinks ab, Google Chrome Manifest v3 & uBlock Origin, KI-Datenverwendung, Kosten & konterfAI, Friend.com und vieles mehr!

#KI #crawler #scraper

Antwortete im Thread

**Kevin Karhan** @kkarhan@infosec.space · 2. Aug. 2024

2. Aug. 2024

Kevin Karhan @kkarhan@infosec.space

@xogium this issue of excessive crawlers is sadly nothing new. @MattKC / #MattKC experienced the same with #ByteSpider, the #Scraper used by #TikTok which results basically in his site getting #DDoS'd despite #ClownFlare being tasked to prevent it!

https://youtu.be/Hi5sd3WEh0c

Personally, I've run out of patience and tolerance for such actions by #GAFAMs and #TechBros and I'm so close to just blocklist their entire ASN as a matter of principle!

It's just that I'd likely have to make an entire dedicaded blocklist and toolup some script to pull a BGP feed or rather IP assignments data for their entire ASN and submit these in #git as branch updates, merge that and block said network in it's entirely as I did with the DoD networks.

www.youtube.com - YouTubeAuf YouTube findest du die angesagtesten Videos und Tracks. Außerdem kannst du eigene Inhalte hochladen und mit Freunden oder gleich der ganzen Welt teilen.

#git

Fortgeführter Thread

**Metin Seven** @metin@graphics.social · 25. Juli 2024

25. Juli 2024

Metin Seven @metin@graphics.social

And another AI scraping case (also see my previous post)…

AI video startup Runway reportedly trained on ‘thousands’ of YouTube videos without permission

https://www.engadget.com/ai-video-startup-runway-reportedly-trained-on-thousands-of-youtube-videos-without-permission-182314160.html

Engadget · 25. Juli 2024AI video startup Runway reportedly trained on ‘thousands’ of YouTube videos without permissionVon Will Shanklin

#AI #GenAI #ArtificialIntelligence

**Metin Seven** @metin@graphics.social · 25. Juli 2024

25. Juli 2024

Metin Seven @metin@graphics.social

Noo… Really?!

Anthropic’s crawler is ignoring websites’ anti-AI scraping policies…

https://www.theverge.com/2024/7/25/24205943/anthropic-ai-web-crawler-claudebot-ifixit-scraping-training-data

The Verge · 25. Juli 2024Anthropic’s crawler is ignoring websites’ anti-AI scraping policiesVon Jess Weatherbed

#AI #GenAI #ArtificialIntelligence

**Austin Huang** @austin@mstdn.party · 24. Juli 2024 *

24. Juli 2024 *

Austin Huang @austin@mstdn.party

With regards to the utoots.com #scraper:
1. It currently depends on a Mastodon instance flashist[.]video; it is recommended to block the instance. flashist.(me|health) and previously flashist.(org|vip|live) is also operated by the same person. Ban evasion is to be expected.
2. I wrote a GitHub issue about it, archived at https://archive.ph/8ynKh. However he has chosen to cover up his GitHub profile instead.

Update: https://cyberpunk.lol/@vantablack/112849043193285926 (tldr: it's gone)

#FediBlock #MastoAdmin #FediAdmin

Frühere Suchanfragen

Suchoptionen

Verwaltet von:

Serverstatistik:

#scraper