Nostr Spam Filter

Slides for the presentation on Fighting Spam & Abuse on Nostr:

Recently (mid Feb) Nostr was being hit with ~500k daily spam messages - ads for spam services, scams, nsfw content and other stuff. Some relays installed a paywall, which helped them, but then spammers just shifted to other relays. Open relays try to fight it by blocking IPs or pubkeys, but then spammers just adapt and keep going.

Nostr.Band is a nostr search engine that collects events from all relays. So it receives all the spam from the whole network, and can't even block spammers by IP, as data is received from relays.

We've used trust rank from day 1 to suppress spam from search results - you might have noticed that there is no garbage showing up on the website. However, we still had to intake all the spam, store it, and filter it out. Sometimes it's useful - it's a signal that drags spammers' trust rank to zero. But other times it's just a waste of resource.

So now we've added another, content-based spam filter. Nostr.Band uses it to dismiss the spam that we get from other relays. Since most spam right now is ads - meaningful highly repetitive commercial messages - a simple clustering by content similarity works well.

We take all events for the last hour, convert content into a list of words/ngrams, then group messages by how similar their word lists are. If some group gets larger than 100 (meaning a similar message was posted 100 times last hour), we assume it's spam, and start dismissing similar new messages.

This isn't a perfect solution - it still allows some volume of spam before it starts rejecting it, and it only works with highly repetitive content. But it's simple, and it's saving us a couple gigabytes of database space daily.

If you're facing similar issues, you can try applying the same logic at your relay. Or you can use the APIs below to at least check how well this approach could work for you.

  1. List of stop-words. We block an event if it contains all strings from a single word cluster (case-insensitive):
  2. List of blocked pubkeys. You can block these, if string matching isn't an option, or you can drop them from your database if they already got in:
  3. List of blocked events. You could use it to retro-actively drop events from your database, or for research purposes:

The data above represents the last hour of events, and is updated every several minutes.

The API is free, although be prepared that one day we might start charging sats if it gets popular.

To discuss this, contact on Nostr.