Establishing a Root Cause Failure Analysis Program at a Pharmaceutical Facility
By Anil Agrawalla, CMRP, Life Cycle Engineering
Your clean steam generator system stops and alarms. Production halts; operations quickly calls maintenance. Maintenance jumps into action and determines that the bearing seized in the feed water pump. A bearing order is expedited through procurement, maintenance efficiently makes the repair the next morning, and the production team runs tests before putting the system back into operation. Corrective actions are created to ensure that the bearing is stocked in the MRO storeroom, and to double frequency of the pump’s preventive maintenance. The senior leadership team is satisfied with the response and corrective actions, and praises the team for limiting the production delay to just 24 hours.
Does this scenario sound familiar? A critical piece of equipment fails and the facility scrambles to get the system back to operation. Corrective actions are implemented to reduce future failures, but are done without asking why the equipment failure occurred. Resources aren’t allocated to identify the root cause on a significant failure like this, yet resources are dedicated to other issues that don’t seem as important to the business. This gap creates the opportunity to implement a Root Cause Failure Analysis (RCFA) program.
Implementing an RCFA program at a pharmaceutical manufacturing site can be a large undertaking, especially for failures that don’t affect product quality. Site resources and priority are given to investigations for quality-related issues due to the strict nature of FDA and regulatory requirements for issues that can impact a patient’s health. Non-quality-related investigations can significantly impact business but often don’t get the same scrutiny as quality investigations.
Including non-quality-related failures in a well-planned RCFA program can bring substantial value to the company. Three factors are key to program success when implementing an RCFA program in a pharmaceutical environment:
- Having a well-defined process that provides a strong foundation / framework
- Aligning the RCFA program with the existing quality investigation process to help ease the introduction to the site and reduce the resources required for implementation
- Communicating the program value to senior leadership and the rest of the site to achieve site-wide adoption and ensure persistency.
A six-step RCFA process will provide the needed framework. Alignment with the existing quality investigations occur within each step.
Step 1 - Notification
The notification step is a pre-defined set of triggers that initiate an RCFA investigation whenever a failure event results in significant loss to the facility. As a general note, an RCFA program can investigate any type of significant event. For the scope of this program, events are intended for equipment-related failures. Although the initial failure symptoms experienced are equipment-related, these failures may end up having root causes that are not directly attributed to the equipment.
The triggers used to initiate an RCFA investigation should encompass all aspects of the business including major safety incidents, environmental violations, product quality issues, high maintenance costs, and production downtime events. The triggers should look for single, significant failure events as well as smaller, repetitive failures that occur frequently. Though a single, smaller failure would normally not be investigated, the chronic and repetitive nature can lead to a significant cumulative cost to the company. The type of trigger and its trigger level need to be customized for each company. The triggers should be set so that the cost of the root cause investigation is less than the business cost experienced by the company.
The Quality, Safety, and Environmental departments may already have their own investigation triggers for failure events that have a quality or safety effect. Since the process, tools used, and goals of those investigations are similar to ones used in an RCFA program, a separate analysis is generally not needed. A maintenance or reliability employee trained in the site’s RCFA program should join the existing investigation process to help with identifying and mitigating the root causes. High maintenance repair costs and production downtime can have a significant financial impact, so both of these can be triggers for an investigation. Triggers for chronic events can include repetitive causes that are trended in the CMMS’s failure coding, or excessive stoppages noticed on the production floor during production runs.
A process map of the Notification step, shown in the figure below, displays an example set of triggers. As the figure shows, there is a clear definition and level for each trigger, which reduces the ambiguity of when an RCFA should be triggered. The success factor for this first step of the RCFA program is to ensure that the rest of the facility is aware of the RCFA program and its triggers. All employees, from operators to directors, are responsible for identifying equipment failures that trigger an RCFA investigation and notifying the RCFA process owner.
Step 2 - Clarification
Clarification entails the gathering of information necessary to analyze the failure events. Once an RCFA is triggered, an RCFA facilitator should be quickly assigned to gather any evidence or data related to the triggered event before the evidence disappears. The evidence can range from physical parts, data logs, manufacturing batch records, or interviews from employees involved in the event. It is critical for the RCFA facilitator to be skilled in investigation techniques, particularly those involving interviewing personnel. The facilitator should be able to gather factual event information without alienating the interviewees. These techniques are often taught through the Quality department for quality investigations, and can be utilized to train the RCFA facilitator.
During this step, the RCFA facilitator is also responsible for identifying cross-functional team members needed for the investigation. The cross-functional team – of varying expertise and departments – is led by the RCFA Facilitator. This team should review the failure event and determine information and evidence required for the investigation. The facilitator should log the team member responsible for collecting evidence and the due date. Finally, the RCFA team should quantify the business cost of the failure event being analyzed. Assigning the failure event a monetary value helps explain the effect of the failure to the senior leadership team and others at the company. A sample list of cost factors are listed in the table below, and should be used to calculate the cost of the failure. In addition to the cost of the failure, the facilitator should also log the costs and time spent during the investigation. This cost will be used to show the total cost of the investigation, and will be compared to the benefit achieved from mitigating future failures.
Types of Cost |
|
Maintenance Labor for repair |
Expediting Costs for Parts/Material |
Material/Parts used for Repair |
Lost Production Capacity |
Outside Vendor/Repair Costs |
Cost of Environmental Violations |
Lost Production Labor due to standby |
Cost of Injuries / Loss Time of Work |
Lost Production Material/Product Costs |
Product Quality Defects |
Additional Overtime needed |
Other |
Step 3 - Analysis
In the analysis step the team uses their problem-solving tools to identify the root causes. Certain investigations require a more complex RCFA method due to their complexity. A fishbone analysis may be used for a simple failure event, whereas a fault tree analysis would be more appropriate for a complex failure event. These problem-solving tools used in root cause investigations are similar to those used for quality investigations. Using common problem-solving tools will reduce the resources needed to train team members, since training for quality investigations at a pharmaceutical facility are usually well-documented and already implemented. In addition, senior leaders will be able to easily understand the analysis results due to their previous exposure to these problem-solving tools.
A key aspect of the analysis is that all assumptions and hypotheses should be proven to be true or false by actual data. A root cause cannot be considered actionable unless it has been validated. Each hypothesis and assumption should be logged, and an action should be identified to help prove its validity. If no method exists to prove its validity, that assumption or hypothesis needs to be clearly identified on the problem-solving tool.
The final output from this step is the identification of all the root causes that led to the failure event. Depending on the complexity of the event, there could be a single root cause, or there could be multiple contributing root causes that combined to cause the failure event. These causes can be classified as three different types: physical root, human root, and latent root.
Physical roots are contained in the physical evidence gathered after failure, and are related to the physical failure mechanism rather than being the true root cause. An example is a pump failure cause by a seized roller bearing that was over-greased. While the seized bearing was the physical cause of the failure, it doesn’t address the human behavior that led to the over-greasing.
The human root refers to inappropriate human intervention that led to the failure. In the seized bearing example, the human root cause could have been “Not following the SOP for greasing bearings.” This human oversight doesn’t tell us why the SOP wasn’t followed. Usually, there is a systemic issue causing the SOP to not be followed, which leads into the third root type.
Latent roots ask why the human action was allowed or wasn’t properly detected or prevented. In the same example, the employee may not have followed the SOP because he wasn’t adequately trained on the SOP, or the SOP task wasn’t specific on how to properly grease the bearing. While any of the three types of root causes can be addressed for correction, the latent root is the most effective root to develop corrective actions for preventing future occurrences. It addresses the systemic issue, rather than looking at a specific employee’s error or the physical effects of a failure.
Step 4 – Corrective Action Evaluation
In this step, the team determines corrective actions for each of the root causes identified during the analysis, and determines their financial feasibility. A cost-benefit analysis will help determine which corrective actions, once implemented, will provide the most value to the company. Quantifying the value of these corrective actions will help convey the importance of their implementation to the senior leadership team, and increase their acceptance of the RCFA program. The table below shows examples of costs and benefits to consider when evaluating each corrective action.
Types of Corrective Action Implementation Costs |
Types of Corrective Action Implementation Benefits |
Maintenance labor |
Reduced production downtime |
Maintenance material/parts |
Energy reduction |
Lost production due to scheduled downtime |
Reduced environmental fines |
Engineering costs |
Reduced capacity losses |
Outside vendor costs |
Reduced overtime labor premiums |
Training costs |
Reduced quality losses |
Procedure development costs |
Capital avoidance |
Policy change costs |
Reduced part usage |
Additional cost to implement on other assets |
Reduction in PMs / SOPs |
|
Additional benefit from implementing on other assets |
Once corrective actions have been selected, a metric should be identified to measure the effectiveness of each corrective action’s results and to quantify the benefit of the corrective action. The metrics should directly correlate the corrective action’s ability to reduce the direct effect originating from the root cause.
Step 5 - Verification
The verification step ensures that the corrective actions are implemented promptly, and validates that the corrective actions are as effective as expected. Prior to implementation, the corrective actions and analysis work should be presented to a governing body that can critique and review the analysis. This governing body should consist of senior leadership, technical managers, and any other pertinent employees who are familiar with the RCFA investigation. The goal is to validate the analysis and identified corrective actions, and to acquire appropriate funding for implementing the corrective actions. Once approved, the RCFA Facilitator should track the corrective actions assigned to team members and provide them with a realistic due date. The facilitator will ensure that the corrective actions are complete and have satisfied the expected return on investment. The team should also review if the corrective actions can be applied to similar systems within the facility.
Step 6 - Documentation
The last step of the RCFA process, documentation, includes the final reporting of the results, storage of documents, and communicating results of the RCFA program. Each investigation should conclude with a final report. Aligning the report style with an existing report style from quality investigations will help the company speak a common investigation language. These reports should be made available in a central digital repository that is accessible and searchable by all site employees. If possible, the CMMS can be used to link the investigation, the analysis, and the corrective actions. Along with a final report, the RCFA facilitator needs to communicate any success stories with the rest of the site. Marketing these results will help display the importance and value of the effective RCFA program. Ideally, the repository will also allow for trending and analysis of root cause failure categories to help identify systemic issues.
Part of the documentation step is measuring your program and looking for opportunities to improve. Key Performance Indicators (KPIs) are used to share the status and value of the RCFA program. The KPIs should be used to measure and trend behaviors within the RCFA program, indicate the current snapshot of the program, and show the program’s overall value. An example of a behavioral KPI is the Percentage of Trigger Notifications Received within 1 day of the Failure Event. A low percentage shows that the rest of the site isn’t quickly notifying the RCFA team, which means that there is low visibility, low importance, and/or low level of acceptance of the RCFA program. Another behavioral KPI example is the Percentage of RCFA Investigations Started within 2 Days of Notification. A failure event’s data and information disappear very quickly, so it is important that RCFA investigations start promptly to ensure the integrity of the evidence. Examples of KPIs that show a snapshot of the program status are Percentage of RCFA Investigations Open & Completed, and the Percentage of Corrective Actions that are In Progress, Completed, or Denied. A final metric, used to show the value of the RCFA program, is the Return on Investment of the RCFA Program. This ROI metric incorporates all the costs incurred by the investigation, including the cost of the investigation itself and implementation cost of the corrective actions, compared to the realized costs savings and/or cost avoidance of the corrective actions that were implemented.
Conclusion
Setting up a process to investigate non-quality issues at a pharmaceutical site can be a challenge due to the priority given to FDA and regulatory compliance events. However, the company can reduce risk, avoid costly failures, and increase their manufacturing output by implementing an RCFA program to investigate failure events that aren’t related to quality. Focusing on three factors – process, alignment and communication – will ensure a successful implementation of an RCFA program at your pharmaceutical facility. A well-defined process provides the strong foundation the program needs to become a permanent solution for reducing future risks. Aligning the RCFA program with the quality investigation process helps gain efficiencies in tools utilized, training, and resources required. This alignment also helps ease the introduction to the site and increase management’s acceptance of the program due to their familiarity with the existing problem-solving concept. Communicating the value of the RCFA program is the final key factor in a successful implementation. Highlighting success stories, sharing key performance indicators, and communicating the program’s value in a simple dollars and cents valuation will assure site-wide awareness and sustain the adoption of the RCFA program.
Anil Agrawalla is a Senior Reliability Engineering Subject Matter Expert with Life Cycle Engineering (LCE). Prior to joining LCE, Anil spent more than 10 years as a reliability leader in a variety of industries, including pharmaceuticals and oil and gas. A leader dedicated to helping teams achieve their goals, Anil maintains a strong focus on sustaining gains and delivering value. You can reach Anil at aagrawalla@LCE.com.
© Life Cycle Engineering, Inc.