| 120million records - scaling up/out advice? [message #2049] |
Wed, 10 October 2007 15:46  |
timrobertson100 Messages: 8 Registered: October 2007 |
Junior Member |
|
|
Hi
I am involved in an effort to make the worlds biodiversity information (plants, animals etc) freely and publically available. Part of this effort involves a central index of the information available on the web collated into a single searchable database, which runs on mysql (http://data.gbif.org). Being an index the data does not partition easily due to the combinations of filters we offer, and we are beginning to hit performance issues. The filters can be seen on the "occurrences" section of the portal.
Our DB is currently running on myisam, single server (2 proc and 5gig ram) with a DB of 90gig size. The core tables are
- occurrence_record 120million rows, 32gig index size
- taxon_concept 20 million rows and 5gig index size
Most queries involve a join on these 2 tables.
The problem is that we anticipate doubling the data size in the short term and within 3-5 years 10x larger, and thus we are looking to scale.
Are there any good papers on scaling strategies? Is our only option really to move to clustering? We can't partition easily as whatever strategy for splitting the data means some queries will touch all data shards.
Anyone claim to be a large DB specialist and want to be involved as a mentor on an open source project to help mobilize biodiversity data? We could do with help on scaling and tuning...
Many thanks for any thoughts
Tim
|
|
|
|
|
| Re: 120million records - scaling up/out advice? [message #2057 is a reply to message #2049 ] |
Thu, 11 October 2007 12:47   |
|
I think you can try to use Sphinx search engine for your search tasks. It supports clustering (when you'd like to use it, it will be there) out of the box and in many cases it shows better performance on large datasets comparing to mysql.
Alexey Kovyrin, MySQL Performance Expert
MySQL Performance Blog
MySQL Consulting
|
|
|
|
| Re: 120million records - scaling up/out advice? [message #2059 is a reply to message #2049 ] |
Thu, 11 October 2007 18:01   |
sterin Messages: 324 Registered: March 2007 Location: Sweden |
Senior Member |
|
|
OK I guess that it looked like a good start until you hit the roof.
So basically you for one server have:
4x15k disks RAID5 for data 128MB cache (serveRaid 6i)
6x10k disks RAID5 for index 256MB cache (default DS400)
(since it shares the DS400 with another server as I understand it)
Some more questions:
1.
Just to make sure, is the bottleneck CPU or IO?
2.
The performance problems you have are they read only (since I got that feeling from your post)? Or are they writes to?
3.
If there are writes too, what is the ratio between reads/writes?
4.
How many concurrent queries are we talking about at peek or is each query too slow even if you run it when nobody else is using the server?
Some assumptions made by me, please correct me if I'm wrong:
A1.
You are running a 64 bit OS.
A2.
The bottleneck is IO since that is what it usually is on large DB's.
Some remarks (although some of these recommendations might change due to your answers on the questions above):
R1.
Get _MUCH_ more RAM than 5GB.
RAM is used as cache to avoid disk reads.
Since disk reads are the most expensive thing there is for a DB server you want to avoid it at all costs.
With your amount of data I would suggest at least 32GB RAM, preferably even more.
R2.
Consider RAID 10 instead of RAID5 due to faster random reads and writes (but it depends on the read/write pattern your DB is producing) and even more disks to distribute the load over.
As for scaling up/out I usually think it is best to keep it simple and run with one (or two servers where one is for backup reason) pretty far to avoid a lot of extra work.
To be able to scale out you need the DB to be pretty static and not many writes.
Because each server needs to perform all writes and you also have a delay in distributing the writes to the other servers.
So if you have a very write heavy or time critical DB then scaling out is basically not an option.
In your case it sounds like it is pretty static DB and then scaling out is an option but yet my opinion is that you save yourself a lot of work by keeping the nr of servers down.
[Updated on: Thu, 11 October 2007 18:13]
|
|
|
| Re: 120million records - scaling up/out advice? [message #2062 is a reply to message #2049 ] |
Fri, 12 October 2007 03:20  |
timrobertson100 Messages: 8 Registered: October 2007 |
Junior Member |
|
|
Hi Sterin
I will reply by email with full details of what we are doing.
Many thanks!
Tim
|
|
|