Release EDDN - Elite Dangerous Data Network. Trading tools sharing info in an unified way.

jamesremuscat · Dec 19, 2014

An excellent post with a lot of good points; please bear with me as I address them one by one

kfsone said:
One of the problems I experienced with EMDN was the commodity-per-message. It's just not a good pattern. Among the problems is the simple matter of disambiguating a gap in data from loss vs omission vs absence.

Code:

+ Chemicals Explosives 177 190 ? 21561H 2014-12-17 13:16:47 Mineral Oil 247 0 ? - 2014-12-17 13:16:47

Was Hydrogen:
- not updated?
- unavailable?
- lost data?

There are perf considerations behind having messages as update batches rather than single commodities, too.

Good points, though others have expressed a preference for the single-commodity approach. I'm definitely erring towards just allowing an array of commodities in a single message.

kfsone said:
But the really big problem that EDDN is going to experience is that there is no central authority for Systems, Stations, etc, so the network will rapidly fill up with data from overlapping but not equal source sets. Especially with divergent tools involved, you'll be getting "THINGY PORT", "Thingy Port" and "thingyport".

Nail head, hammer. That's a problem for every single third-party tool for Elite: Dangerous right now. It's also a problem that I'm loathe to try and solve in EDDN - however horrible using strings as keys might be, frankly the thought of creating and maintaining a shared mapping from numeric ID to commodity type is terrifying! If FD released a static data dump that contained such a mapping, I would jump at the chance to use it. For now, I don't know that there's much any of us can practically do; CMDRKNac suggests some validation against "known good" lists from TCG, but that of course has the risk of being incomplete or inaccurate...

kfsone said:
You can also anticipate getting really awful time data. You'll be receiving random timezones, random accuracy (expect future timestamps and ancient timestamps on a regular basis), and the occasional human error in the timestamp field.

Oh, yes. Already people are not specifying timezones without normalising to UTC, so we're seeing a wide range of interpretations of what "now" is... That's part of the reason the gateway adds a gatewayTimestamp - so there's at least one reliable timestamp within the message.

kfsone said:
You've also chosen fairly wordy json. I did some preliminary work creating a shareable json format for TD (http://kfs.org/td/source -> misc/prices-json-expr.py), which has a dictionary of the item table first (so I can use a locally-scoped set of IDs for items that the recipient can easily translate once, saving a ton of text processing work). But I trimmed hundreds of kb of a relatively small prices file by using terse names for the fields and by judicious use of arrays.

It's not just bandwidth I'm concerned about there; as the system grows and more users begin submitting prices, the cost of processing all those json fields goes up too. It's easy to think of this as a distributed system, but it's not distributed processing - each endpoint is doing a complete workload.

The current format was designed to be largely compatible with the short-lived predecessor of EDDN, to try and minimise the amount of work existing client implementations would need to do to get started. I'm totally not averse to working towards a v2 of the format, along with schemas for other messages as the demand arises. I'm not too worried about bandwidth, since everything from the gateway onwards is compressed.

kfsone said:
With regards to author IDs ... You could just provide a token mechanism; user goes to the website, provides their email address, you send that email address a token, store email + token to a database, and when a submission is sent with that token, you put the database ID (i.e. neither of the other two fields) into the outgoing zmq messages.

Managing such a database isn't something I want to do, plus that would make any future replication/distribution far more hassle than it needs to be. The only use-case I'm envisaging here is for clients that want to impose selective filtering on the messages they receive (e.g. to filter out any from a source they perceive as untrustworthy) - though I'm not sure that in practice anyone would actually bother. It sounds like an application-specific uploaderID plus an IP-based hash would cover CMDRKNac's case of a central server posting messages from multiple users, as well as the case of individual uploaders.

kfsone said:
Have you considered using ZeroMQ to receive the messages?

I hadn't. Is that something that would be useful?

CMDRKNac · Dec 19, 2014

jamesremuscat said:
CMDRKNac suggests some validation against "known good" lists from TCG, but that of course has the risk of being incomplete or inaccurate...

While we have found some inaccuracies, these have been due to mismatches between the different dumps and are minor problems. Current system list is a direct dump from FD and they have said they won't provide more (for now at least). Manual input passes a strong verification process so are beyond verified systems.

The systems in TGC are all the systems that 'have an economy' so largely all relevant systems are already in there. At least on the 'systems' front this is more or less covered (do we have a better option anyway right now?) Now on stations information data... that's a different problem.

But anyway this problem is not only EDDN related, is 'every application/source' related because is relay on manual user input (even if it's from OCR scans it still is manual to a good degree). The only way to solve it is through administration and moderating efforts. The question is then what checks who carries the more weight and burden when 'validating' data, as I understand it (correct me if I'm wrong) EDDN is mainly supposed to be a data 'relayer' and while some validation is being included in the form of schemas and some more can be done (checking against local dictionaries before relaying data etc.) in the end is going be the consumer of data who decides what to do with it (including further exhaustive validation against it's own sources).

With the current data sources we have to be realistic on what we can achieve regarding consistency and reliance of the information sent.

CMDRKNac · Dec 19, 2014

Ok after some work and adding a couple things my website is completely hooked on EDDN and compatible with last version of EliteOCR (includes system files), meaning that is updating automatically from data uploaded to the network (as long as it pass some validation) and posting valid data (I added the TZ info to the timestamp although it was UTC). Just observed some users using EliteOCRReader are terrible at not adding proper system names, ofc that data will be useless and ignored. Next up IMO should be adding validation to itemName field and only accept properly written commodities.

If you are curious about what people has been uploading lately you can check here: http://www.elitedangerouscentral.com/systems and find by date (as long it passes the bare minimum validation, that is complying with the schema and existing 'systemNames').

jamesremuscat · Dec 20, 2014

CMDRKNac said:
Next up IMO should be adding validation to itemName field and only accept properly written commodities.

Is there a canonical list of all commodities (preferably with categories) somewhere?

CMDRKNac · Dec 20, 2014

Maybe this works:

https://bitbucket.org/kfsone/traded...47f8492bb1580bd250f1b/data/Item.csv?at=master

kfsone · Dec 20, 2014

jamesremuscat said:
Nail head, hammer.

Hugely relieved to hear you say that

jamesremuscat said:
That's a problem for every single third-party tool for Elite: Dangerous right now. It's also a problem that I'm loathe to try and solve in EDDN - however horrible using strings as keys might be, frankly the thought of creating and maintaining a shared mapping from numeric ID to commodity type is terrifying! If FD released a static data dump that contained such a mapping, I would jump at the chance to use it. For now, I don't know that there's much any of us can practically do; CMDRKNac suggests some validation against "known good" lists from TCG, but that of course has the risk of being incomplete or inaccurate...

I'd gotten half-way to implementing a system that produced a base64-encoding of star positions to solve the problem at least as far as stars went, along with a best-match algorithm to match stations, but then in the very next patch they introduced two stars with the same x,y,z, and while we've only discovered two so far, I'm fairly sure there are more.

jamesremuscat said:
Oh, yes. Already people are not specifying timezones without normalising to UTC, so we're seeing a wide range of interpretations of what "now" is... That's part of the reason the gateway adds a gatewayTimestamp - so there's at least one reliable timestamp within the message.

PC clocks are just generally awful; companies like Amazon and Google have their own atomic clocks for good reasons; and those are industrial grade systems. The average home PC can keep time about as well as Putin can run an economy.

jamesremuscat said:
The current format was designed to be largely compatible with the short-lived predecessor of EDDN, to try and minimise the amount of work existing client implementations would need to do to get started. I'm totally not averse to working towards a v2 of the format

I'd leveled the same complaint at them

jamesremuscat said:
along with schemas for other messages as the demand arises. I'm not too worried about bandwidth, since everything from the gateway onwards is compressed.

But you might become worried about the CPU to achieve the compression as time goes on, and, as I say, it's going to affect end-users when the flow picks up.

jamesremuscat said:
Managing such a database isn't something I want to do,

Hehe, BTDT

jamesremuscat said:
I hadn't. Is that something that would be useful?

It might streamline building both sides of a service for some people since they won't have to write one half as an HTTP service and one as a ZMQ service.

-Oliver

- - - - - Additional Content Posted / Auto Merge - - - - -

CMDRKNac said:
Maybe this works:

https://bitbucket.org/kfsone/traded...47f8492bb1580bd250f1b/data/Item.csv?at=master

I don't provide any guarantees about the order or spelling and I don't have fixed unique IDs, which is going to be an ongoing problem.

CMDRKNac · Dec 20, 2014

I've a question, there is a more elegant way than using (zmq.RCVTIMEO, 62000) to close the socket after X time? Due to host limitations I have to refresh the script periodically through a cron job.

I'm using it like this:

Code:

subscriber.setsockopt(zmq.SUBSCRIBE, b"")
subscriber.setsockopt(zmq.RCVTIMEO, 62000)
subscriber.connect("tcp://eddn-relay.elite-markets.net:9500")
    
while True:
        live_market_json = zlib.decompress(subscriber.recv())
        live_market_data = json.loads(live_market_json)

...

After 62 seconds it will close the connection but will raise an exception. It works in practice because when that happens an other script instance has taken over and is listening but is not very elegant. For the SUB type socket I haven't found an other way to do it after reading a bit through zeromq docs and searching on Google.

p.s: Yes I can use try/except/else (and that's what I'm doing) but the question remains.

Andargor · Dec 20, 2014

If anyone wants to play with a node.js listener for EDDN, here's the minimal example:

Code:

var zmq = require('zmq');
var zlib = require('zlib');

var zmqsock = zmq.socket('sub');
zmqsock.subscribe("");
zmqsock.on('message', function(message) {
	zlib.unzip(message, function(err, buffer) {
		if (!err) {
			//Do something with the message
			console.log(buffer.toString());
		} else {
			console.error("ERROR: " + err);
		}
	}); 
});

zmqsock.connect('tcp://eddn-gateway.elite-markets.net:9500');

kfsone · Dec 20, 2014

CMDRKNac said:
I've a question, there is a more elegant way than using (zmq.RCVTIMEO, 62000) to close the socket after X time? Due to host limitations I have to refresh the script periodically through a cron job.

I'm using it like this:

Code:

subscriber.setsockopt(zmq.SUBSCRIBE, b"") subscriber.setsockopt(zmq.RCVTIMEO, 62000) subscriber.connect("tcp://eddn-relay.elite-markets.net:9500") while True: live_market_json = zlib.decompress(subscriber.recv()) live_market_data = json.loads(live_market_json) ...

After 62 seconds it will close the connection but will raise an exception. It works in practice because when that happens an other script instance has taken over and is listening but is not very elegant. For the SUB type socket I haven't found an other way to do it after reading a bit through zeromq docs and searching on Google.

p.s: Yes I can use try/except/else (and that's what I'm doing) but the question remains.

ZMQ supports multiple connections per "socket", so you add a local listen port that your client pings on startup to let previous listeners know it's time to quit:

Code:

subscriber.setsockopt(zmq.SUBSCRIBE, b"")
subscriber.setsockopt(zmq.RCVTIMEO, 62000)
subscriber.connect("tcp://eddn-relay.elite-markets.net:9500")
# using '1.0.1' helps avoid conflicts with stuff using localhost.
subscriber.connect("tcp://127.1.0.1:19500")
    
while True:
        msg = subscriber.recv()
        if msg == b"DISCONNECT":
            print("Received DISCONNECT")
            break
        continue
        #live_market_json = zlib.decompress(subscriber.recv())
        #live_market_data = json.loads(live_market_json)

You can then use this to send yourself a note to disconnect when your new client launches.

Code:

import time
import zmq

ctx = zmq.Context()
sock = ctx.socket(zmq.PUB)
sock.bind("tcp://127.1.0.1:19500")
sock.send(b"DISCONNECT")
time.sleep(0.1)

kfsone · Dec 21, 2014

For a whole-station update, I'm considering the following for TD:

Code:

{
  'cmdr': <commander name> or 'unknown',
  'src': <name of the tool that generated the json>,
  'sys': <system description>,
  'stn': <station description>,
  'm': <last modified timestamp>,
  'items': <item list>
}

system description :-
{
  'name': <name of system>,
  'pos': [ x, y, z ],
},

stn description :-
{
  'name': <name of station>,
  'ls': <ls from nav/star>,
}

item list :-
{
  <item name>: <item data>
}

item data :-
{
   optional: 'm': <modified timestamp if not lastModified>,
   optional: 'b': <price or [ price, units, level ]>,
   optional: 's': <price or [ price, units, level ]>
}

An item entry can provide its own timestamp ('m'), the update has an "m" (modified) field, which is the default timestamp for item updates.

Item entries can also have 'b' (BUY) and 's' (SELL) values which are either an integer (price) or an array of [ price, demand units, demand level ].

TD can keep track of supply. It uses '-1' to indicate "don't know" and 0 to indicate "unavailable". For level 1 = Low, 2 = Med, 3 = High, as per EMDN.

I also make the assumption that if level is 0, so is units and price. So, valid values are:

/0,0,0 | \d+,-1,-1 | \d+,0,1 | \d+,\d+,[123] / where \d+ evaluates to a non-zero integer.

The third pattern there - \d+,0,1, is for items that have a BUY price but "0L" as the demand.

Code:

"Algae": { 'b': 123, 's': [ 130, 1102, 2 ] }

Station is buying algae for 123 credits, demand/level unknown; selling for 130 cr and has 1102 units at MED. Timestamp is the default.

Code:

"Fish": { 's': [ 300, 1, 1 ] },
"Coffee": { 'b': [ 1332, 0, 1 ], 's': [ 1360, 10, 1 ] },

Station is not buying fish but sells them for 300 cr, there's only one available at low. The station buys coffee but is currently not buying any units (demand = 0), it's selling coffee for 1360 and has ten available.

Here's a full example:

Code:

{
  "cmdr": "kfsone",
  "src": "td/price-json",
  "sys": {
    "name": "Sol",
    "pos": [
      0.0,
      0.0,
      0.0
    ]
  },
  "stn": {
    "name": "Abraham Lincoln",
    "ls": 0
  },
  "m": "2014-12-19 16:47:00",
  "items": {
    "Coffee": {
      "b": [
        1537,
        58099,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Reactive Armour": {
      "b": [
        2407,
        20610,
        3
      ]
    },
    "Progenitor Cells": {
      "b": [
        7313,
        30373,
        2
      ]
    },
    "Animal Meat": {
      "b": [
        1537,
        139436,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Fruit And Vegetables": {
      "b": [
        429,
        209633,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Biowaste": {
      "b": 11,
      "s": [
        14,
        34779,
        3
      ]
    },
    "Palladium": {
      "b": [
        14007,
        133513,
        2
      ]
    },
    "Consumer Technology": {
      "b": [
        7100,
        56084,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Liquor": {
      "b": [
        808,
        46359,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Clothing": {
      "b": [
        443,
        544225,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Fish": {
      "b": [
        533,
        438253,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Hydrogen Fuel": {
      "b": 159,
      "m": "2014-12-19 16:46:00"
    },
    "Wine": {
      "b": [
        354,
        535859,
        2
      ]
    },
    "Domestic Appliances": {
      "b": [
        691,
        213807,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Beer": {
      "b": [
        256,
        497350,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Tea": {
      "b": [
        1729,
        131770,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Water Purifiers": {
      "b": [
        388,
        119966,
        2
      ]
    },
    "Silver": {
      "b": [
        5306,
        1396917,
        2
      ]
    },
    "Power Generators": {
      "b": [
        644,
        68089,
        2
      ]
    },
    "Food Cartridges": {
      "b": [
        234,
        75804,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Synthetic Meat": {
      "b": [
        370,
        148491,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Grain": {
      "b": [
        304,
        1049364,
        2
      ],
      "m": "2014-12-19 16:46:00"
    },
    "Platinum": {
      "b": [
        19418,
        104630,
        2
      ]
    },
    "Basic Medicines": {
      "b": [
        404,
        49893,
        2
      ]
    },
    "Gold": {
      "b": [
        9831,
        135829,
        2
      ]
    },
    "Non-Lethal Weapons": {
      "b": [
        2069,
        12129,
        2
      ]
    }
  },
}

Snake Man · Dec 21, 2014

Next problem will be OCR users who have foreign language selected, ie:

Code:

 'itemName': 'Unterhaltungselektronik'}}

CMDRKNac · Dec 21, 2014

Snake Man said:
Next problem will be OCR users who have foreign language selected, ie:

Code:

'itemName': 'Unterhaltungselektronik'}}

Lol yes, it's already happening: http://www.elitedangerouscentral.com/systems?q=Luhman 16&type=system

We need that itemName validation and some sheet to translate itemnames to English.

Andargor · Dec 21, 2014

kfsone said:
It might streamline building both sides of a service for some people since they won't have to write one half as an HTTP service and one as a ZMQ service.

I concur with this. How about providing both POST and ZMQ submit capability? That way a developer could choose which method is easiest, since they already would have to implement ZMQ.

Andargor · Dec 21, 2014

kfsone said:
For a whole-station update, I'm considering the following for TD:

May I suggest adding optional fields such as:

- Star type
- Station type and economy (outpost vs station, wealthy agricultural, etc.)
- Factions
- Rare commodities available
- Presence of black market
- Services (outfitting, etc.)

Snake Man · Dec 22, 2014

All the above is static data with exception of factions, or dunno if some faction can be "killed off" from the system(?). How many times this static data needs to be sent out then, I think it will be excessive flood if all players broadcast this same data over and over again to the network.

Unless I'm totally misunderstanding what you mean.

I mean market data changes almost every few minutes, that we need to update and pump constantly to the network, but star and station types, heh no.

Andargor · Dec 22, 2014

Snake Man said:
I mean market data changes almost every few minutes, that we need to update and pump constantly to the network, but star and station types, heh no.

You are probably right, however I'm exploring right now and the systems are being generated. So I was thinking of having star information fed to the EDDN. Also FD have indicated they will put more stations in remote systems depending on player activity, so at best the lists are semi-static.

In any case the schema approach is flexible enough to allow additional info such as this.

Andargor · Dec 23, 2014

Quick question about the schema, are arbitrary fields supported? Meaning, only required fields are checked and the others passed through?

Doing some testing, I saw the need for some additional fields:

- Message Hash: I sent duplicate messages to the feed. The gateway timestamp was different, but the payload was the same. Of course the consumer could do some checking, including generating his own hash to determine payload uniqueness, but this would be more efficient to do at the producer end. The "message" field could be hashed (SHA2, FNV-1, whatever), and a "hash" field could contain the result for quick and easy dupe detection.

- Media: What type of media produced the message (manual, OCR, other...)

- Confidence: If it's OCR, perhaps a confidence field? Certainly not required, but could help consumers determine the quality of the message.

My 0.02 CR

wolverine2710 · Dec 24, 2014

Andargor said:
Quick question about the schema, are arbitrary fields supported? Meaning, only required fields are checked and the others passed through?

Doing some testing, I saw the need for some additional fields:

- Message Hash: I sent duplicate messages to the feed. The gateway timestamp was different, but the payload was the same. Of course the consumer could do some checking, including generating his own hash to determine payload uniqueness, but this would be more efficient to do at the producer end. The "message" field could be hashed (SHA2, FNV-1, whatever), and a "hash" field could contain the result for quick and easy dupe detection.

- Media: What type of media produced the message (manual, OCR, other...)

- Confidence: If it's OCR, perhaps a confidence field? Certainly not required, but could help consumers determine the quality of the message.

My 0.02 CR

You have some valid points. Xmas stress and can only reply after Xmas - after this. No internet where I am then and no smartphone. So Saturday for a good reply.... But remember EDDN is a relayer, what goes in, goes out, with some basic checks. Perhaps some more basic stuff can be added but that depends on James atm. Just like EMDR (EVE) the responsibility for the data (sending,checking,deciding validity) largely is the responsibility of the sender and receiver. At least that is how I see it. In hinsight perhaps a more apprpiate name would have been EDDR instead of EDDN. EDDR as in Elite Dangerous Data Relay.
Note: GREAT that you are expirmenting with EDDN ;-)

Andargor · Dec 24, 2014

wolverine2710 said:
But remember EDDN is a relayer, what goes in, goes out, with some basic checks.

I realize that, doesn't change the mission of EDDN, and I'm not asking for more checks from the infrastructure. Just that the producer can make consuming messages easier with simply adding a couple of fields.

And for your stress:

too-damn-high-meme-generator-your-stress-level-is-too-damn-high-466d8c.jpg

jamesremuscat · Dec 26, 2014

Andargor said:
Quick question about the schema, are arbitrary fields supported? Meaning, only required fields are checked and the others passed through?

That's how it works currently, though I don't want that to be the case long-term.

Andargor said:
Doing some testing, I saw the need for some additional fields:

- Message Hash: I sent duplicate messages to the feed. The gateway timestamp was different, but the payload was the same. Of course the consumer could do some checking, including generating his own hash to determine payload uniqueness, but this would be more efficient to do at the producer end. The "message" field could be hashed (SHA2, FNV-1, whatever), and a "hash" field could contain the result for quick and easy dupe detection.

My first thought was "this would make most sense implemented as the gateway appending a hash for the contents of the 'message' property"... then I decided that computing hashes for each and every message isn't something I'd like to commit the gateway to doing

Note that duplicate messages do have non-zero value: if you get the same data from two different people, it's less likely to be faked. (In practice, as I might have said already, I'm not sure anyone will bother going to that level of effort to counter poisoned data.)

Andargor said:
- Media: What type of media produced the message (manual, OCR, other...)

To some extent, the softwareName field could serve this purpose, but I get your point.

Andargor said:
- Confidence: If it's OCR, perhaps a confidence field? Certainly not required, but could help consumers determine the quality of the message.

It sounds like you'd like a place for application-specific data to be dumped, and read by those consumers who understand it... Hmm, I'll think on it

Andargor said:
My 0.02 CR

Many thanks for them

- - - - - Additional Content Posted / Auto Merge - - - - -

Someone's running 389 clients from a single IP address, 80.XX.XX.X62. If that's you, could you PM me and let me know why you need so many?

Release EDDN - Elite Dangerous Data Network. Trading tools sharing info in an unified way.

Tutorial & Guide Writer