This is the requirement:

1) This is the only table schema we need.

colA text,
colB timestamp,
colC text,
colD uuid,
colE text,
colF text,
colG text,
colH text,
colI text,
colJ int,
colK text,

3) Any text field has no more than chars.

4) We insert the for the last 24hrs at midnight so not having for the previous day is acceptable, each day we’ll insert up to 100k rows.

5) SUPER_TABLE_1 won’t have more than 10 million rows and if it reaches the limit it will automatically clone the table, add a suffix and start inserting there (ex: SUPER_TABLE_2, SUPER_TABLE_3).

6) We should be able to query between any 2 dates but not between years.

7) The only query we need to do is:

9) Additonal filtering will be done using Java or Node (which will be in the same network)

9) We transform that into a CSV and we send to the client.

What Big Data solution would you use? some technologies we have in mind are Cassandra, HBase, Couchbase, MySQL/MariaDB.

We’ll be using AWS and have budget of 5k a month.

Nice to have:

1) Would be nice to filter by any column but filtering like in step 4 it’s completely fine.

2) We could change step 5, having all data in a single table could be fine, maybe even creating multiple tables for the SUPER_TABLE_1 but everytime we query we need to get all columns (always query between 2 dates in order to keep it ).


We get 100k rows of data everyday, date is the primary key, we should be able to filter by 8 possible columns (all strings, one is a date) using exact values (but filtering using Java/Node is acceptable) then pass that data as a CSV (JSON is fine but it may be too big) to another Service in the same network (using POST).

We’ll use AWS with a 5k budget limit, we can’t use other Cloud Services (Firebase for example).

Source link

No tags for this post.


Please enter your comment!
Please enter your name here