feed: enable ignoring duplicate textnotes option #54

Open
offbyn wants to merge 2 commits from reduce-duplicates into master
offbyn commented 1 year ago
Owner

This option drops textNotes with exactly the same content after,
it was already shown 5 times.

This option drops textNotes with exactly the same content after, it was already shown 5 times.
offbyn added 1 commit 1 year ago
ci/woodpecker/push/woodpecker Pipeline was successful Details
ci/woodpecker/pr/woodpecker Pipeline was successful Details
0fecfc837c
feed: enable ignoring duplicate textnotes option
This option drops textNotes with exactly the same content after,
it was already shown 5 times.
offbyn requested review from x1ddos 1 year ago
x1ddos reviewed 1 year ago
@ -262,0 +269,4 @@
function isValidNote(evt) {
if (dropDuplicate) {
const similarEvents = textNoteList.filter(({content}) => content === evt.content);
x1ddos commented 1 year ago
Owner

all looks good except this. i see a couple potential issues:

  1. the complexity is O(N). as the number of notes increases, so is the average run time of the loop and it's run for each text note...; how about using an object or something in this pseudo-code: localStorage.setItem(content_hash, content_hash_counter+1). the "// could use indexDB" comment on textNoteList = [] is a good hint :)

  2. there's no eviction of "old" items. suppose one says "gm" once a day. after 5th day, their "gm" text note won't make it anymore, assuming the browser window stays open for 5 days. so, maybe have some kind of a queue where old items are expunged after some time?

all looks good except this. i see a couple potential issues: 1. the complexity is O(N). as the number of notes increases, so is the average run time of the loop and it's run for each text note...; how about using an object or something in this pseudo-code: `localStorage.setItem(content_hash, content_hash_counter+1)`. the "// could use indexDB" comment on `textNoteList = []` is a good hint :) 2. there's no eviction of "old" items. suppose one says "gm" once a day. after 5th day, their "gm" text note won't make it anymore, assuming the browser window stays open for 5 days. so, maybe have some kind of a queue where old items are expunged after some time?
x1ddos commented 1 year ago
Owner

where old items are expunged after some time?

i actually meant, where a counter of duplicate notes resets or decreases

> where old items are expunged after some time? i actually meant, where a counter of duplicate notes resets or decreases
offbyn commented 1 year ago
Poster
Owner
  1. I guess it's time to try indexDB, should have done earlier. I think it makes more sense to finally go indexDB instead of content hash workaround.

  2. yep, already seen dropped events with the content "hi" loading only last 24h. the check could only take events into account that are younger than < 1h, this should work ok-ish for keeping the client open for days, I guess + with this change it wouldn't need to check all events but just the newest.

1. I guess it's time to try indexDB, should have done earlier. I think it makes more sense to finally go indexDB instead of content hash workaround. 2. yep, already seen dropped events with the content "hi" loading only last 24h. the check could only take events into account that are younger than < 1h, this should work ok-ish for keeping the client open for days, I guess + with this change it wouldn't need to check all events but just the newest.
offbyn added 1 commit 1 year ago
ci/woodpecker/push/woodpecker Pipeline was successful Details
ci/woodpecker/pr/woodpecker Pipeline was successful Details
3b99dfac79
only take events within the last hour into account
offbyn commented 1 year ago
Poster
Owner

pushed a propsal in the 2nd commit, technically it's even worse now, but this takes only events within the last hour into account.

i'll take a look into indexdb in another pr.

pushed a propsal in the 2nd commit, technically it's even worse now, but this takes only events within the last hour into account. i'll take a look into indexdb in another pr.
x1ddos reviewed 1 year ago
@ -262,0 +272,4 @@
function isValidNote(evt) {
if (dropDuplicate) {
const similarEvents = [
...textNoteList.filter(({content}) => content === evt.content),
x1ddos commented 1 year ago
Owner

maybe i'm missing something but looks like textNoteList contains all events received since a browser window is open, can easily be older than 1h because textNoteList.push(evt) happens in handleTextNote every time, and so this filter here:

textNoteList.filter(({content}) => content === evt.content)

loops over all text notes regardless of the timestamp, and only after is it filtered by ({created_at}) => ((created_at + 3600) > evt.created_at), but this last filter is run after looping over all textNoteList items.

or is it not how it works?

maybe i'm missing something but looks like `textNoteList` contains all events received since a browser window is open, can easily be older than 1h because `textNoteList.push(evt)` happens in `handleTextNote` every time, and so this filter here: > textNoteList.filter(({content}) => content === evt.content) loops over all text notes regardless of the timestamp, and only after is it filtered by `({created_at}) => ((created_at + 3600) > evt.created_at)`, but this last filter is run after looping over all `textNoteList` items. or is it not how it works?
offbyn commented 1 year ago
Poster
Owner

textNoteList.push(evt) is in a condition when the event
has some event tags it goes to handleReply, at the time I thought better have 2 different array, so they don't get too big and it is always clear if it is an reply or not.

89c54ac08f/src/main.js (L267)

but now thinking indexDB, I guess it should just all be in 1 talbe in indexDB, wdyt?

`textNoteList.push(evt)` is in a condition when the event has some event tags it goes to `handleReply`, at the time I thought better have 2 different array, so they don't get too big and it is always clear if it is an reply or not. https://git.qcode.ch/nostr/nostrweb/src/commit/89c54ac08fb67cb9f0040812fd63a3a1917e79dc/src/main.js#L267 but now thinking indexDB, I guess it should just all be in 1 talbe in indexDB, wdyt?
x1ddos commented 1 year ago
Owner

re: indexeddb. yeah, i guess 2 additional columns, some in_reply_to and root? then sounds like it could work with a single table.

one other benefit of indexeddb imho is it works in web workers - related to push notifications in #25.

of course, indexeddb doesn't seem as easy as localstorage, judging by an example in https://github.com/mdn/dom-examples/blob/main/to-do-notifications/scripts/todo.js (from https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API)

re: indexeddb. yeah, i guess 2 additional columns, some `in_reply_to` and `root`? then sounds like it could work with a single table. one other benefit of indexeddb imho is it works in web workers - related to push notifications in https://git.qcode.ch/nostr/nostrweb/issues/25. of course, indexeddb doesn't seem as easy as localstorage, judging by an example in https://github.com/mdn/dom-examples/blob/main/to-do-notifications/scripts/todo.js (from https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API)
x1ddos commented 1 year ago
Owner

but even with indexeddb... suppose you store all events in an events table. then how would it work to count duplicate messages. i imaging something like this:

select count(*) from events where content = <some-content>

not sure if it'll actually help because:

  • content can be quite large
  • you wouldn't want to index on content column which means the sql query above will be slow, maybe even slower that looping over the textNoteList array

wouldn't it make more sense to compare some hash(content)?

but even with indexeddb... suppose you store all events in an `events` table. then how would it work to count duplicate messages. i imaging something like this: ```sql select count(*) from events where content = <some-content> ``` not sure if it'll actually help because: - content can be quite large - you wouldn't want to index on content column which means the sql query above will be slow, maybe even slower that looping over the `textNoteList` array wouldn't it make more sense to compare some `hash(content)`?
offbyn commented 1 year ago
Poster
Owner

makes sense, since nostr-tools is already using noble hases anyway

is sha256 overkill?

import { sha256 } from '@noble/hashes/sha256';
console.log(sha256('abc'))); 
├─┬ nostr-tools@0.24.1
│ ├── @noble/hashes@0.5.

https://www.npmjs.com/package/@noble/hashes

I'll look also in cleaning up whitespaces before hashing.

makes sense, since nostr-tools is already using noble hases anyway is sha256 overkill? ```js import { sha256 } from '@noble/hashes/sha256'; console.log(sha256('abc'))); ``` ``` ├─┬ nostr-tools@0.24.1 │ ├── @noble/hashes@0.5. ``` https://www.npmjs.com/package/@noble/hashes I'll look also in cleaning up whitespaces before hashing.
x1ddos commented 1 year ago
Owner

here's an alternative. haven't measured but i think it would be the fastest:

const dups = {};

function countEventDup(content) {
  const key = hash(content);
  let elem = dups[key];
  if (!elem) {
    dups[key] = {n: 1, lastseen: Date.now()};
    return 1;
  }
  elem.n += 1;
  return elem.n;
}

function isValidNote(evt) {
  return countEventDup(evt.content) < 5;
}

however, even that doesn't work. i mean, it'll produce too many false negatives. imagine 10 people replying "hi" within a minute. first 5 pass, the rest won't be shown even though they should. i still think #51 is a better way.

here's an alternative. haven't measured but i think it would be the fastest: ```js const dups = {}; function countEventDup(content) { const key = hash(content); let elem = dups[key]; if (!elem) { dups[key] = {n: 1, lastseen: Date.now()}; return 1; } elem.n += 1; return elem.n; } function isValidNote(evt) { return countEventDup(evt.content) < 5; } ``` however, even that doesn't work. i mean, it'll produce too many false negatives. imagine 10 people replying "hi" within a minute. first 5 pass, the rest won't be shown even though they should. i still think https://git.qcode.ch/nostr/nostrweb/issues/51#issuecomment-519 is a better way.
x1ddos commented 1 year ago
Owner

imagine 10 people replying "hi" within a minute. first 5 pass, the rest won't be shown even though they should

actually, they don't have to be replies. just random people saying "hi" or "gm", on different relays. why should they suffer?

> imagine 10 people replying "hi" within a minute. first 5 pass, the rest won't be shown even though they should actually, they don't have to be replies. just random people saying "hi" or "gm", on different relays. why should they suffer?
offbyn commented 1 year ago
Poster
Owner

10 people replying "hi" within a minute. first 5 pass, the rest won't be shown even though they should

true then it will drop the new ones, 5 is arbitrary ofcorse.

also after reloading the newest 5 will render and not show the first 5.

i still think #51 is a better way.

they are very different, it's not mutually exclusive or do you think conceptually it's better to have no options in the client itself?

> 10 people replying "hi" within a minute. first 5 pass, the rest won't be shown even though they should true then it will drop the new ones, 5 is arbitrary ofcorse. also after reloading the newest 5 will render and not show the first 5. > i still think #51 is a better way. they are very different, it's not mutually exclusive or do you think conceptually it's better to have no options in the client itself?
x1ddos commented 1 year ago
Owner

also after reloading the newest 5 will render and not show the first 5.

i guess i'm hoping one day i won't have to reload the page :)

they are very different, it's not mutually exclusive or do you think conceptually it's better to have no options in the client itself?

but the same motivation, right?

i'm just thinking. imagine i'm connected to a relay and follow alice. waiting for her to say "hi". but there are other 100 people saying "hi" all the time. so, i'll never see the "hi" from alice.

with that scenario, i think there should be more in the algorithm. for example, count duplicates only from pubkeys not in the contacts list, so only the public feed. otherwise, the risk is high "important" messages are swallowed.

either way, let's add this. even temporary, maybe it'll help a bit.

> also after reloading the newest 5 will render and not show the first 5. i guess i'm hoping one day i won't have to reload the page :) > they are very different, it's not mutually exclusive or do you think conceptually it's better to have no options in the client itself? but the same motivation, right? i'm just thinking. imagine i'm connected to a relay and follow alice. waiting for her to say "hi". but there are other 100 people saying "hi" all the time. so, i'll never see the "hi" from alice. with that scenario, i think there should be more in the algorithm. for example, count duplicates only from pubkeys not in the contacts list, so only the public feed. otherwise, the risk is high "important" messages are swallowed. either way, let's add this. even temporary, maybe it'll help a bit.
offbyn commented 1 year ago
Poster
Owner

i guess i'm hoping one day i won't have to reload the page :)

indeed I should keep that more in mind, it's shouldn't just explode after some days, currently memory prob just grows forever.

indexdb could drop old events maybe 🤔 , (array.unshift()).

> i guess i'm hoping one day i won't have to reload the page :) indeed I should keep that more in mind, it's shouldn't just explode after some days, currently memory prob just grows forever. indexdb could drop old events maybe 🤔 , (`array.unshift()`).
x1ddos commented 1 year ago
Owner

drop old events

definitely. well, from the page memory. then i'd imagine in the future, i can scroll back and forth and the client requests older/newer events and re-render.

> drop old events definitely. well, from the page memory. then i'd imagine in the future, i can scroll back and forth and the client requests older/newer events and re-render.
offbyn commented 1 year ago
Poster
Owner

n the future, i can scroll back and forth and the client requests older/newer events and re-render.

this is the way 👍

> n the future, i can scroll back and forth and the client requests older/newer events and re-render. this is the way 👍

Reviewers

x1ddos was requested for review 1 year ago
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
This pull request has changes conflicting with the target branch.
  • src/cards.css
  • src/form.css
  • src/index.html
  • src/main.css
  • src/main.js
  • src/tabs.css
You can also view command line instructions.

Step 1:

From your project repository, check out a new branch and test the changes.
git checkout -b reduce-duplicates master
git pull origin reduce-duplicates

Step 2:

Merge the changes and update on Gitea.
git checkout master
git merge --no-ff reduce-duplicates
git push origin master
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: nostr/nostrweb#54
Loading…
There is no content yet.