Love this idea. Out of curiousity, and to check understanding, if EDDN is a aggregator/firehose (distributed data collated and thrown at any client listening), is there currently anyone providing buffering of this data e.g. the last 24 hours?
For example, EDDN receives 30k messages over an hour. It dutifully forwards them on. Sadly one of the subscribers was down for that period, and so missed out on the new data. This is based on my limited understanding of how ZeroMQ would be handling the pub-sub, so if I'm wrong, please say!
A number of messaging systems have the ability to 'go back in time' or work from an index such that you can say 'ok, the last message I reliably processed was 25, lets' start pulling from 26' - Amazon Kinesis for example. Rather than try and bloat EDDN's mission, I was wondering if anyone had layered this sort of thing on top as yet? Would greater reliability of messages be of interest to tool developers or is occasional loss tolerable?
Edit: There are of course alternatives to a single source of storage - the various subscribers could all repeat the same data back over time, and thus you'd have crowd-sourced storage of anything you as a single subscriber missed, but that approach would be reliant on being able to spot stale data (some form of reliable timestamp for example).
My first thought was "this would make most sense implemented as the gateway appending a hash for the contents of the 'message' property"... then I decided that computing hashes for each and every message isn't something I'd like to commit the gateway to doing
Note that duplicate messages do have non-zero value: if you get the same data from two different people, it's less likely to be faked. (In practice, as I might have said already, I'm not sure anyone will bother going to that level of effort to counter poisoned data.)
Name Speed on 64 bits Speed on 32 bits
XXH64 13.8 GB/s 1.9 GB/s
XXH32 6.8 GB/s 6.0 GB/s
I can see value in the usage of hashes.
I do believe it should be generated at the producer end (i.e. what is POSTing the data), rather than at the gateway or consumer apps.
Love this idea. Out of curiousity, and to check understanding, if EDDN is a aggregator/firehose (distributed data collated and thrown at any client listening), is there currently anyone providing buffering of this data e.g. the last 24 hours?
WDYT?
EDDN is intentionally kept simple in both mission and implementation to allow this kind of development by others. Market data is being used as the current use case because it is dynamic. However, there is a need for static or semi-static data being stored somewhere, such as system/station data (see EDSC/TGC or Biteketkergetek's stuff).
And definitely someone will start archiving, especially if there's to be some kind of a stock market ticker, for example. If you are willing, you could tap into the feed and automate the archiving to github or some other online service.
Why has it to be that complicated .... my python is a but rusty and I am getting currently pyzmq not running. four fricking pages of "easy" install and I have no clue what to do
All I want to do is fetch some data (so that I can generate better data)
Fantastic! So, here's my plan: create a Heroku app that will subscribe to the firehose and update local XML files. Then periodically (say every 10 minutes) push changed files to a GitHub repo. I want to also get the list of star coordinates from EDSC and put that info in the same directory structure. Other data sources can be added later.
This gives app creators:
1) a one-stop-shop for ED data (sources can be merged, because XML)
2) snapshot, changes, and historical data (because Git)
3) option to write online or offline apps (e.g. pull when online only)
Should be interesting!
My recomendation is to use pipWhy has it to be that complicated .... my python is a but rusty and I am getting currently pyzmq not running. four fricking pages of "easy" install and I have no clue what to do
All I want to do is fetch some data (so that I can generate better data)
I very much like to see the mission of EDDN starting to take off. I'm very glad that commanders see the value of it and are taking the time/effort to enhance EDDN.
Your approach is a new, refreshing and exiting one.
Some things to consider. I believe there are some things to consider, especially "Github: Working with large files". From that website:You need to break up your data in max 50 MB files and the total size for a github account is 10GB.GitHub warns you when you try to add a file larger than 50 MB. Pushes containing files larger than 100 MB are rejected for a few reasons.
In many cases, committing large files is unintentional and causes unneeded repository bloat. Every time someone clones a repository with a large file, they'll have to fetch that file, adding excess time to their download.
In addition, if a repository is 10 GB in size, Git's architecture requires another 10 GB of extra free space available at all times. This allows Git to move the files around in its normal course of operations. Unfortunately, this also means that we must be much less flexible with how we store these repositories.
I don't have the knowledge to determine how feasible your git solution will be, perhaps others do.
Atm I don't have numbers about the bandwidth usage and number of uploads to EDDN. The last one will be known once I have an ELK stack setup - Kibana will tell/visualize me everything I want to know and more. There are bandwidth monitoring tools for Linux and perhaps our hosting service (Vivio Technologies) which provided us with free hosting (20 Mbs line) has tools for it.
Please keep me/us in the loop about your progress with your tool. Like I said it IS very much appreciated.
So, here's the progress: I have been listening to the ZeroMQ subscription all day, but not getting a single message. Not sure what I'm doing wrong, or if there's actually no messages today.
For the EDSC data, I tried invoking the API but only get Error 500 (no idea why, followed the example). So, I downloaded the elite.json file that some other user prepared, and chopped it up into one file per system, with directory structure based on system name and quadrant (to avoid having too many files in one dir). Once I get messages from EDDN I should be able to create market files in this structure, and then push that as a repo.
As for running out of space, the easy solution there would be to roll over to a new repo once the limit is reached, and keep the old one as archive for historical charting if needed. For usecases needing latest snapshot and polling for changes, that works just fine.
Once I can get proper data from both EDDN ZeroMQ and EDSC API it should be fairly little work to actually get it done, as all the real heavylifting is done by Git and GitHub. So that's where I'm at right now.
I started EliteOCRReader and later RegulatedNoise (same author and program but now with better number OCR-ing) and as I write this, data is coming in. You can also use the EDDN\src\eddn\Client.py program. If you were using your own firehose listener it looks as if something is wrong. A good way to monitor what is coming in, going out is wireshark (just set a few filters to keep things clean).
TGC aka edsc api is sometimes down, please ask for the status in the crowd sourcing coordinates project thread (see sig or my profile). A good way to test it is to use the jfiddler examples on the API page.
What is that elite.json file you are talking about and where can I find it?
Thanks for the update and good luck with the tool.
Alright, got it working now. I had missed the actual "subscribe" command. So now I'm getting EliteOCR messages, and I can decompress them properly. Now to create a market.xml file in the mentioned directory structure, and have the program push it to Github, and.. that should be it. Deploy to Heroku, profit.
Nice. Looking forward to the end result. Are you going to provide info about how to get the older data from github, what to do with it etc etc?