I've been having the same problem for 2 weeks now. If anyone has any
ideas, I'd love to hear them. We are using both SQL and Windows
Authentication. I was running a Profiler Trace at the time, and am
going through it now but have not seen anything yet.
Thanks in advance.
About once a week, at no fixed time (but so far, between 8am and
11am), my SQL Server 2000 on Windows 2000 (8.00.679) will stop
responding. New connections will at first take forever to connect,
then an error message about the PreLoginHandshake(). We have to kill
the service (read below) to stop it, but it comes back without
problems.
The very first time this happened, I got buf latch errors in the SQL
Error Log, but not this time. (However, the first time we waited for
about 5 minutes after it stopped responding , and then the buf latch
error message showed up. This time we restarted the services before 5
minutes had passed.)
The SQL log shows nothing abnormal. The SQL Agent log shows nothing
abnormal. The Event Log shows nothing abnormal. Stopping the service
through the SQL Service Manager doesn't work - we have to go into the
Services control panel, stop the Agent then stop the service, and when
the service says that it cannot be killed, we then run the "kill"
command on it (from the resource kit, I believe) and it stops
immediately. Once this occurs, we can use the SQL Service manager to
start everything up successfully, with no problems in the logs.
Note: we have not rebooted the server yet, just
stopped/killed/restarted the services. We plan on rebooting this
weekend.
Timeline:
10:20 - an openquery job runs successfully - as far as I know,
everything is okay at this point.
10:22 - for some reason, a transaction log backup job does not run
(ran successfully at 10:02, runs every 20 minutes.). That time
doesn't show up at all in the Job History for that job, nor is there a
log file, nor was a report txt file generated.
10:29 - openquery job 1 fails to run. Connections are sluggish, but
open connections can run queries.
10:30 - openquery job 2 fails to run. (not the same query as 1 or 3)
Connections are sluggish, but open connections can run queries.
10:30-10:40 Enterprise Manager cannot connect - stuck during
connection. This is the case on multiple machines, as well as on the
server. My Enterprise Manager doesn't respond, and I cannot start a
new instance. "Select getdate()" can take several seconds to run, and
I get a "Lost connection" error.
10:40 - openquery job 3 fails to run. Profiler shows my openquery job
was the last thing run - no further profile messages for the next 2-3
minutes.
10:43 - We start shutting down the server.What do the system resources look like when the box starts to drag its
heels? Is the CPU pegged really high? Has it been a long time since the
last reboot? Is SQL using pretty much all the memory?
I had a rare memory fragmentation issue about a year ago on one of the
earlier versions of SQL 2000 (certain pre-SP3). It occurred after adding
more than 3GB of RAM to the box and turning on /3GB in boot.ini. Our backup
(tlog & full) were taking ages to complete and the CPU would go nuts. I had
a Microsoft Premier case going for a couple month trying to sort it out. It
turned out to be a memory fragmentation issue that was most apparent during
backups when the backup process would check for the largest contiguous chunk
of memory to use for a particular part of the backup operation, so even tiny
backups (< 2MB) were taking 30 seconds to complete (when you multiple that
by 100+ databases and repeat that hourly then that becomes some serious
time).
Anyway, for our problem, which to me sounds vaguely similar to your problem,
a (pre-SP3) hotfix sorted us out (after troubleshooting the issue by turning
on a bunch of trace flags at PSS's request). It may not be the same
thing...<shrug>. You need to figure out what's going on with the physical
resources on your box (CPU, memory, NIC, disk I/O, etc.). It may be the
case that SP3a (which you should think about installing some time soon
anyway - remember the SQL Slammer worm?!?!?) fixes your issue...maybe.
HTH
--
Cheers,
Mike
"Michael Bourgon" <bourgon@.gmail.com> wrote in message
news:558b578d.0409301020.364a05e2@.posting.google.com...
> I've been having the same problem for 2 weeks now. If anyone has any
> ideas, I'd love to hear them. We are using both SQL and Windows
> Authentication. I was running a Profiler Trace at the time, and am
> going through it now but have not seen anything yet.
> Thanks in advance.
> About once a week, at no fixed time (but so far, between 8am and
> 11am), my SQL Server 2000 on Windows 2000 (8.00.679) will stop
> responding. New connections will at first take forever to connect,
> then an error message about the PreLoginHandshake(). We have to kill
> the service (read below) to stop it, but it comes back without
> problems.
> The very first time this happened, I got buf latch errors in the SQL
> Error Log, but not this time. (However, the first time we waited for
> about 5 minutes after it stopped responding , and then the buf latch
> error message showed up. This time we restarted the services before 5
> minutes had passed.)
> The SQL log shows nothing abnormal. The SQL Agent log shows nothing
> abnormal. The Event Log shows nothing abnormal. Stopping the service
> through the SQL Service Manager doesn't work - we have to go into the
> Services control panel, stop the Agent then stop the service, and when
> the service says that it cannot be killed, we then run the "kill"
> command on it (from the resource kit, I believe) and it stops
> immediately. Once this occurs, we can use the SQL Service manager to
> start everything up successfully, with no problems in the logs.
> Note: we have not rebooted the server yet, just
> stopped/killed/restarted the services. We plan on rebooting this
> weekend.
> Timeline:
> 10:20 - an openquery job runs successfully - as far as I know,
> everything is okay at this point.
> 10:22 - for some reason, a transaction log backup job does not run
> (ran successfully at 10:02, runs every 20 minutes.). That time
> doesn't show up at all in the Job History for that job, nor is there a
> log file, nor was a report txt file generated.
> 10:29 - openquery job 1 fails to run. Connections are sluggish, but
> open connections can run queries.
> 10:30 - openquery job 2 fails to run. (not the same query as 1 or 3)
> Connections are sluggish, but open connections can run queries.
> 10:30-10:40 Enterprise Manager cannot connect - stuck during
> connection. This is the case on multiple machines, as well as on the
> server. My Enterprise Manager doesn't respond, and I cannot start a
> new instance. "Select getdate()" can take several seconds to run, and
> I get a "Lost connection" error.
> 10:40 - openquery job 3 fails to run. Profiler shows my openquery job
> was the last thing run - no further profile messages for the next 2-3
> minutes.
> 10:43 - We start shutting down the server.|||"Mike Hodgson" <mwh_junk@.hotmail.com> wrote in message news:<OsWUqF7pEHA.132@.TK2MSFTNGP14.phx.gbl>...
> What do the system resources look like when the box starts to drag its
> heels? Is the CPU pegged really high? Has it been a long time since the
> last reboot? Is SQL using pretty much all the memory?
CPU is at about 1%. We've got it set up so that there's about 300mb
free on the system, and the last reboot occured less than a week
before this problem surfaced.
However, the bit about the large database backups is intriguing. I'll
have to check that. Thanks.|||bourgon@.gmail.com (Michael Bourgon) wrote in message news:<558b578d.0410040510.5139efb2@.posting.google.com>...
> CPU is at about 1%. We've got it set up so that there's about 300mb
> free on the system, and the last reboot occured less than a week
> before this problem surfaced.
> However, the bit about the large database backups is intriguing. I'll
> have to check that. Thanks.
As a followup, we believe it was due to this. We saw some potential
memory problems with the process associated with the NIC, and so we
changed the database backups to not backup to a different computer,
but stay on the system. This seems to have stabilized it.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment