When a Monthly Server Restart Becomes a Four-Plant Production Stoppage
We had a problem that brought the kind of attention IT never wants.
About once a month, our SQL Server would slow to a crawl and need to be restarted. The cause was an ODBC connection to our ERP system that wouldn’t terminate properly after running queries through a linked server.
Most of the time, it worked fine. But when it hung, the entire SQL Server ground to a halt. Manufacturing reports wouldn’t load. Real-time data feeds would stall. And because this server supported operations across four plants, that meant production stoppages at all four locations.
Fifteen minutes or more of downtime. Dozens of people unable to work. Phone calls from plant managers. Executives asking questions.
For something that only happened once a month, it created an outsized amount of pain—and visibility.
The Two-Server Solution
I couldn’t fix the ODBC driver itself (we tried). So instead, I changed the architecture to isolate the problem.
Here’s what I did:
- Created a dedicated intermediary server to handle the ODBC connection to the ERP
- Linked that server to our main SQL Server using a standard SQL Server linked server connection
- Moved all ERP ODBC queries to run on the intermediary server
Now when the ODBC connection hangs, it only affects the intermediary server. I can restart that server without touching the main production SQL Server.
The four plants keep running. No one even knows there’s an issue. And IT stops getting those calls.
The Silent Failure Problem
The isolation strategy solved the visibility problem, but it created a new challenge.
The ODBC connection has an annoying quirk: when it times out, it doesn’t throw an error. It just logs a warning message and continues.
SQL Agent jobs treat warnings as success. So our automated data warehouse processes would “complete successfully” even though they’d actually failed to pull data from the ERP. The job would stop partway through, but I wouldn’t know about it until someone noticed missing data later—or until the next night’s run failed because yesterday’s data never loaded.
For a data warehouse, this is a serious problem. The ETL processes have dependencies. If step 3 doesn’t run because step 2 silently failed, you’re building on an incomplete foundation. And you might not notice for days.
Making Failures Visible
The solution was to add intelligence to the SQL Agent job itself.
After each step that queries the ODBC connection, I added a new step that:
- Queries the SQL Agent job history for the previous step
- Searches the step message for the timeout warning text
- Forces the job to fail if the warning is present
Here’s the core logic:
-- Check if the previous step encountered a timeout warning
DECLARE @StepMessage NVARCHAR(MAX);
SELECT TOP 1 @StepMessage = message
FROM msdb.dbo.sysjobhistory
WHERE job_id = CONVERT(UNIQUEIDENTIFIER, $(ESCAPE_NONE(JOBID)))
AND step_id = (SELECT step_id FROM msdb.dbo.sysjobsteps
WHERE job_id = CONVERT(UNIQUEIDENTIFIER, $(ESCAPE_NONE(JOBID)))
AND step_name = 'Pull Data from ERP')
ORDER BY run_date DESC, run_time DESC;
-- Check for timeout warning in the message
IF @StepMessage LIKE '%timeout%' OR @StepMessage LIKE '%connection failed%'
BEGIN
RAISERROR('ODBC connection timeout detected in previous step', 16, 1);
END
Now when the ODBC connection times out, the job actually fails. I get an alert. I restart the intermediary server, and the next scheduled run picks up where it left off.
More importantly, the failure is visible and immediate. I don’t find out three days later that we’ve been building reports on incomplete data.
The Business Impact
Before these changes, we had monthly incidents where the main SQL Server needed restarting during business hours. Each incident meant 15+ minutes of downtime across four manufacturing plants. That’s the kind of disruption that gets escalated to senior leadership.
After implementing the two-server architecture and timeout detection:
- Zero production SQL Server restarts due to ODBC issues in the past year
- The intermediary server restarts occasionally, but it’s invisible to users
- Data warehouse jobs fail loudly instead of silently, so I can address issues before they compound
- IT stops being the reason production stopped
That last point matters more than you might think. When you’re the department that caused four plants to go down, it doesn’t matter that it was only once a month. You’re the problem. When it stops happening, people forget you exist—which is exactly what you want.
Lessons Learned
1. Isolation is a valid strategy.
Sometimes you can’t fix the root cause, but you can contain the damage. The intermediary server adds a layer of complexity, but it prevents one flaky connection from taking down critical infrastructure.
2. Warnings aren’t always harmless.
Just because something doesn’t throw an error doesn’t mean it succeeded. In automated systems, you need to actively monitor for “soft failures” that masquerade as success.
3. SQL Agent job history is underutilized.
Most people use it for troubleshooting after problems occur. But you can query it programmatically during job execution to make your jobs self-aware and responsive to their own behavior.
4. Visibility matters as much as reliability.
A system that fails loudly is often better than one that fails silently. The intermediary server doesn’t prevent ODBC timeouts—it just makes them visible and manageable instead of catastrophic.
Not Every Problem Needs a Perfect Solution
I didn’t fix the ODBC driver. It still times out occasionally. But by accepting that reality and designing around it, we built something more resilient than what we had before.
The system isn’t perfect. But it’s predictable, manageable, and most importantly—it doesn’t stop production.
Sometimes that’s not just good enough. It’s actually the better outcome.