[ase-users] Database issues

Wed Nov 28 09:16:37 CET 2018

Hi,

I have attached small a script that on my computer is not able to add a new key_value_pair, it ends with an Sqlite3OperationalError. I can also confirm that if I split up the data across multiple tables and not put them as a serialized JSON string, you can easily add new key_value_pairs. So if one had the option to store part all data as external tables and not mix them in on each row as a JSON serialized string I think our databases would perform much better. The concrete use case is actually a cluster expansion model. For each DFT structure you then would have a key_value_pair with the so-called correlation functions. So easily a few thousands of those. More generally you have a set of descriptors that describe you data (in this case DFT calculations) and you would feed that into some machine learning algorithm.

The script I have attached highlights the problem. Let's say you have 2000 descriptors stored as a key-value-pair. Suddenly, you realise that some of your DFT calculations sufferend from heavy distortions during relaxation and you would like to feed you machine learning algorithm with only a subset of the calculations (the ones that did not experience heavy distortion). In that case you would like to add a new key_value_pair that tags some calculations as invalid. On the command line maybe:

ase db somedatabase.db id=231 -k valid=False. (if one wants to exclude that particular structure).

In the example I have attached, it is not possible because you get Sqlite3OperationalError. As mentioned before, if you don't have that many key_value_pairs (i.e. stored you descriptors in a separate table) everything works fine.

My original thought, was to not allow queries on the things in the external tables, but keep that as it is (i.e. things you want to use in queries should be key_value_pairs as before), but when you load data one also include the things that are stored in the external tables. Technically there is nothing preventing queries on the external tables, but I did not implement it bacause I wanted to just extend the database functionality in ASE and not touch any of the existing implementations.

I have made a class that inherits from SQLite3Database, and implements the features mentioned above. Is it useful to open Work In Progress merge request and then one can judge whether this is something that is interesting for ASE?

To use (in the way we have implemented it):
from ase.db import connect
db = connect("somedb.db", type="flexsqlite")

So, if you just do as in the past, it just the same as the regular SQLite3Database.
Now, if you want to create a new table when writing/updating you use a special key-word tables.

atoms = Atoms()
db.write(atoms, tables={"descriptors": {"desc1": 0.11, "desc2": -1.0, "desc3": 1.0}, "table2": {"info1": 0.2, "info2": 0.3}})

This would create two new tables: "descriptors", and "table2".

So you basically provide a dictionary for each table. Everything else stays the same.
The benefit is that, if you have a lot of data attached to a row, you avoid to either making huge binary blobs (i.e. use the data field) or huge JSON strings by using key_value_pairs. In both cases updates and modification of the database is cumbersome.

As you suggested, another possiblity is to detect the datatype of the value-field in each dict and then allow any type and not just float and str. In our case float and string is not a limitation. The limitation is that we get Sqlite3OperationalError when we add new key value pairs. And a working solution seems to be to not have so many key_value_pairs, but rather separating data in more tables.

Cheers,
David Kleiven

________________________________
From: Jens Jørgen Mortensen <jjmo at dtu.dk>
Sent: Tuesday, November 27, 2018 12:27 PM
To: David Kleiven; ase-users at listserv.fysik.dtu.dk
Subject: Re: [ase-users] Database issues

On 11/23/18 2:47 PM, David Kleiven via ase-users wrote:

Dear ASE users,

we are working on a project that involves many DFT calculations and we use the database functionality to keep track of everything. Our calculations have a lot of external data for machine learning purposes. Currently, the only way we could find to "attach" such info to each structure is via the key_value_pair. To our knowledge, the key_value_pairs are duplicated in the database 1) They are stored as a serialized JSON string on each row and 2) Distributed in number_key_values and text_key_values. Hence, when you have for instance 5000 key_value_pairs it becomes cumbersome to maintain this duplication. Moreover, we have experienced cases where appending more key_value_pairs leads to errors and a corrupted database. We tested various solution and one way we found is to allow user defined tables (i.e. users can create tables with the same schema as number_key_values). Hence, big chunks of static data can be placed in those tables. One avoids duplication and appending dynamic key_value_pairs is no longer a problem. When you read back data from the database, all data from these external tables are automatically added to the AtomsRow object as if they were regular key_value_pairs.

The syntax for storing a separate table would be

db.write(atoms, tables={"some_table_name": dict_with_data}, ... regular key value pairs...)

and the same syntax for read. All external tables would be added to AtomsRow as if they where key_value_pairs.

Is this solution (separating out big chunks of data in separate tables) interesting to include in ASE via a class that inherits from SQLite3Database and provide this extra functionality in addition to everything that is supported by SQLite3Database? (Note: that in our case using the data field is not a good option as this is essentially a big binary chunk and it is therefore not so easy to 1) manually look at the data via external tools and 2) not so easy to update)

That's an interesting idea. If I understand correctly, you would like to be able to do key=value where value can be (almost) anything and not just float or str as it must be now.  Is that correct?  Maybe you can explain a bit more what your use case is and how using this new feature would look?

Jens Jørgen

PS: Can you create a simple example script that demonstrates the error you mentioned?

Cheers,

David Kleiven

_______________________________________________
ase-users mailing list
ase-users at listserv.fysik.dtu.dk<mailto:ase-users at listserv.fysik.dtu.dk>
https://listserv.fysik.dtu.dk/mailman/listinfo/ase-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.fysik.dtu.dk/pipermail/ase-users/attachments/20181128/1919b9c1/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: error_on_insert_new_kvp.py
Type: text/x-python
Size: 674 bytes
Desc: error_on_insert_new_kvp.py
URL: <http://listserv.fysik.dtu.dk/pipermail/ase-users/attachments/20181128/1919b9c1/attachment-0001.py>