Galaxy Database Correctness

athulin · Oct 21, 2024

I've posted a more detailed question in the Spansh thread concerning the Spansh database, but I suspect it may be a general thing. I see much the same issue with the Canonn Signals web site -- but I can't decide if it is a separate issue or not: it might rely on Spansh data.

Here's an observation and a question.

I visit Pueliae MJ-H d10-1 , and from direct observation (FSS) I learn that AB 1a has 1 bio and 3 geo signals, and that there are no other bio signals in the system..

When I check that with some databases (see above), I see that the data updates almost immediately.

But I also find that the AB 1b entry says '5 bio signals', which is not matched by current game information. According to the Spansh entry this data also appears to be much older.

I've seen this before, but I've never been entirely certain of the details that involve my own actions. But it seems like old information that has been changed (i.e. game info is no longer the same as the data stored in the database) doesn't get updated. Some seems to change: the AB 1a entry is said to have been updated (Spansh) shortly after my FSS, but the AB 1b does not appear to have changed.
Something like this -- possible the same thing -- has been noted for Bark Mounds, which have seen some extensive changes in the past.

Question: If this is already known, how great deal of data is affected? I suppose I'm asking if there are any known error rates. That is, if I tried to count the number of known Electricae Radielem is some part of the galaxy, any result would need to be stated as 'X bodies ± Y bodies', where Y is the estimated number of changes that didn't lead to a database modification (or something on those lines).

Factabulous · Oct 21, 2024

Things move around and appear / disappear when there is a major change to the game. The out-of-game records won't change until someone updates them.

Ian Doncaster · Oct 21, 2024

In this case, as well, it's possible that some clients send information about bio+geo signals, and some don't.

So if one client sends "5 bio signals" and another client sends "nothing", that might not mean that the planet no longer has 5 bio signals, and it's safer to keep the old data rather than flip back and forth depending on who scanned it last.

athulin · Oct 21, 2024

Factabulous said:
Things move around and appear / disappear when there is a major change to the game. The out-of-game records won't change until someone updates them.

So ... is there a log of those somewhere? That at least would make it possible (sort of) to say that there may have been at least N changes since this record was entered, and assume that N changes implies this or that probability of incorrect data.

athulin · Oct 21, 2024

Ian Doncaster said:
In this case, as well, it's possible that some clients send information about bio+geo signals, and some don't.

So if one client sends "5 bio signals" and another client sends "nothing", that might not mean that the planet no longer has 5 bio signals, and it's safer to keep the old data rather than flip back and forth depending on who scanned it last.

OK, I probably need to read up on the EDDN protocol. In log entries, it seems that a Detailed scan record that does not have a corresponding SAASignalsFound would imply that DSS didn't identify any signals. But if a client sends DSS records but doesn't send SAASignalsFound even if it is present ... that would be a problem. Only work around that would be client-side analyzer that eats log files, checks corresponding database entries, and perhaps screams when it finds a discrepancy. OK, that's more complex.

Thanks for that note. I'll need to do some code diving as well.

Ian Doncaster · Oct 21, 2024

Oh, a further complexity: there's no need to upload journals "live". While most EDDN receivers discard stale data - if someone uploads market data from three weeks ago, most apps won't care - that's less normal for body scan data where it (probably!) won't have changed much, and explorers getting into third party tools later and wanting to upload their old records is more common. So the order events are passed through EDDN won't necessarily be anything like the order they happened in, and different receiving apps might have different strategies for dealing with that.

marx · Oct 21, 2024

athulin said:
Question: If this is already known, how great deal of data is affected? I suppose I'm asking if there are any known error rates. That is, if I tried to count the number of known Electricae Radielem is some part of the galaxy, any result would need to be stated as 'X bodies ± Y bodies', where Y is the estimated number of changes that didn't lead to a database modification (or something on those lines).

It depends on what your data source is - in general, I'd recommend EDAstro, because it's the most comprehensive, and automatically checks for a number of errors - and also what species exactly you are looking at. As it has already been mentioned, some legacy Horizons stuff like Bark Mounds have seen changes in Odyssey (for example, many BMs disappeared because in the new spawning system they got a new requirement for volcanism), others haven't. As far as I know, none of the Odyssey (thin atmo) bios have had any changes where they disappeared from bodies they were previously at.
Then there are of course faulty uploads from clients sometime, and so on.

The "Pueliae MJ-H d10-1 AB 1 b" entry having 5 bio signals uploaded sometime sounds like it might have had five bio POIs to drop down at in Horizons, back when the game only generated those. Thankfully, that's gone now. Since it's a non-atmo planet, at that location it couldn't have five bio signals today.

If you want to look at observational error, then you could count the N1 bodies where your target(s) are known to be, and also count the N2 bodies which should be suitable candidates but the presence of bios there hasn't been confirmed. There you have the two ends then. Of course, the difficulty lies in determining the candidates, since with what's available from the journals, we can't predict with (near) certainty exactly what bios a body is going to have, just what it could have - but sometimes, they might not have all of them. And of course, if you are looking inside specific galactic regions, there's also some uncertainty at the borders, and so on.

Ian Doncaster · Oct 21, 2024

marx said:
As far as I know, none of the Odyssey (thin atmo) bios have had any changes where they disappeared from bodies they were previously at.

One very minor exception I can think of - at least some planets started out in Odyssey with multiple bio types, some of which would give a Coloured Snake error if you tried to sample them. Those then got removed in a very early hotfix patch, but there's presumably still a few planets in the data which were mapped in late May 2021 and never since which might have a higher-than-observed count now.

marx · Oct 21, 2024

Hm, I don't remember that one, but if it was due to multiples when there should have been only one, that should be easy enough to check for in the data.

Ian Doncaster · Oct 21, 2024

marx said:
Hm, I don't remember that one, but if it was due to multiples when there should have been only one, that should be easy enough to check for in the data.

No, different types. The only one I remember seeing personally was at Eol Prou RS-T d3-686 B6a, which started out with Bacterium (scannable) and something else (not another Bacterium, definitely) which crashed the game ... and then it got patched and now it just has the Bacterium.

athulin · Oct 22, 2024

marx said:
The "Pueliae MJ-H d10-1 AB 1 b" entry having 5 bio signals uploaded sometime sounds like it might have had five bio POIs to drop down at in Horizons, back when the game only generated those. Thankfully, that's gone now. Since it's a non-atmo planet, at that location it couldn't have five bio signals today.

And so a database that does report 5 bio signals may be correct as of a time the original observation, but it is no longer correct as of the date of any report produced (part of the reason may very well be differences in ED release for the original and current observations). The reason for any errors may be interesting for an error reduction program, but any results produced from that data should preferably come with an error estimate.

marx said:
If you want to look at observational error, then you could count the N1 bodies where your target(s) are known to be, and also count the N2 bodies which should be suitable candidates but the presence of bios there hasn't been confirmed. There you have the two ends then.

That seems to be something else. It may be useful in some context, but it doesn't reflect the error measure I'm looking for, which focuses on a database, or a subset of a database, and incorporates all these source of errors in the chain that you and others have mentioned, and probably more.

It seems one way to go is to compare current logs (which may itself come with error) with database content, ensuring some reasonable minimum number of points of comparison. That probably means sampling some region (galaxy, region, sector, whatever), and probably selecting sample space after actual number of observations (the more observations, the greater the probability of error; unobserved systems do not contribute to this error).