I did some testing over the last few days on the updates to tradedangerous.db from listings.csv since I still think that using the exceptions as decision logic is flawed. I used .NET since that I what I know best and set up two tests, the first being the insert and update if an exception is thrown but the second did it a very different way. I imported the station and item id as a single value
Code:
select printf("%010d", station_id) || printf("%010d", item_id) as HashKey from StationItem
and read the values into a hash table. Then I read in the listings.csv file, creating a hash key from the station and item IDs in the data and if the key existed in the hash table I added the data to a list for updates and if it did not, the data was added to a second list for inserting and the hash key added to the hash table. After the entire file had been split in this maner I set the insert command string on the command object, called the prepare method and then ran through the insert list binding the data for each item in the list to the parameters and executing the command. Likewise the update list was iterated, the difference being the this time the prepared statement was an update.
I ran the two tests a number of times, around 20 or so, and averaged the results. The second, non-exception method took around 43% of the time that the insert exception update method took (747mS vs 1738mS).
I also did a quick test in the inserts by adding them in the form INSERT INTO StationItem VALUES (....),(....),(....) for the 12,527 inserts and then one execute. The execute many form was faster but the time difference was not really significant (1.4s vs 3.6s)
The listings file contained about 3.4 million rows of data and timings did not include the common elements, connecting to the database, reading in the data, just the method of inserting or updating the data from the listings file.
The increase in speed was mainly due, as eyeonus has already noted, to not using the horribly slow (in CPU terms) data access. The use of a single prepared SQL command and dealing with all the insert and updates in separate batches also cut down the execution time. The downside being that there is a hash table in memory instead.
Keeping the hash table at this point would also aid the listener since the same method for deciding to insert or update could be used instead of the exception method.
Still, the method works, does not use exceptions as decision logic and is significantly faster, although I don't know if the savings would translate to python since python does not have an equivalent hash table although a dictionary object may work instead.