Databases Reference
In-Depth Information
After working through the diagnostic data to reach this point, the next task was obvi-
ous: measure what was causing the I/O. Unfortunately, this was infeasible on the ver-
sion of GNU/Linux the customer was using. We could have made an educated guess
with some work, but we wanted to explore other options first. As a proxy, we could
have measured how much I/O was coming from MySQL, but again, in this version of
MySQL that wasn't really feasible due to lack of instrumentation.
Instead, we opted to try to observe MySQL's I/O, based on what we know about how
it uses the disk. In general, MySQL writes only data, logs, sort files, and temporary
tables to disk. We eliminated data and logs from consideration, based on the status
counters and other information we discussed earlier. Now, suppose MySQL were to
suddenly write a bunch of data to disk temporary tables or sort files. How could we
observe this? Two easy ways are to watch the amount of free space on the disk, or to
look at the server's open filehandles with the lsof command. We did both, and the
results were convincing enough to satisfy us. Here's what df -h showed every second
during the same incident we've been studying:
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 58G 20G 36G 36% /
/dev/sda3 58G 20G 36G 36% /
/dev/sda3 58G 19G 36G 35% /
/dev/sda3 58G 19G 36G 35% /
/dev/sda3 58G 19G 36G 35% /
/dev/sda3 58G 19G 36G 35% /
/dev/sda3 58G 18G 37G 33% /
/dev/sda3 58G 18G 37G 33% /
/dev/sda3 58G 18G 37G 33% /
And here's the data from lsof , which for some reason we gathered only once per five
seconds. We're simply summing the sizes of all of the files mysqld has open in /tmp ,
and printing out the total for each timestamped sample in the file:
$ awk '
/mysqld.*tmp/ {
total += $7;
}
/^Sun Mar 28/ && total {
printf "%s %7.2f MB\n", $4, total/1024/1024;
total = 0;
}' lsof.txt
18:34:38 1655.21 MB
18:34:43 1.88 MB
18:34:48 1.88 MB
18:34:53 1.88 MB
18:34:58 1.88 MB
Based on this data, it looks like MySQL is writing about 1.5 GB of data to temporary
tables in the beginning phases of the incident, and this matches what we found in the
SHOW PROCESSLIST states (“Copying to tmp table”). The evidence points to a storm of
19. Someone call the 1-800 hotline!
 
Search WWH ::




Custom Search