Customer
Calls Reliability 360; Addressing
Future Challenges @ the Speed of Light
By
Philimar Menard
The Misconception of
Reliability and Availability
After a
10 year reliability journey that takes me across the appliances, power
generation, aerospace, component placement, electronics design, wireless, and
finally the cable industries, I would like to share my astonishing discovery
with my friends and foes alike. My focus
will be on the cable industry where my recent study reveals some very
interesting things.
My dear
friends, I wish I could tell you the result was positive or somewhat
encouraging. For starters, there seems
to be a HUGE misunderstanding of reliability versus availability. This clear ambiguity has become the Achilles’
heel of many industries, including the cable industry. Thus far, studies showed the cable industry
spends $10’s of billions yearly to maintain an availability of 99.98% or
less. The question is why. I would like to explore this uncharted
territory with you via a little study I conducted for nearly a year in this
industry.
Let’s
start with a graph I call the “P. Menard Comparative plot” of the Reliability
and Availability for cable’s hybrid fiber-coax (HFC) network. For those readers familiar with HFC network
operation you know this is the most vital sector for a Multiple System Operator
(MSO).
It is the
pipe that aggregates the downstream and upstream services transmitted back and
forth between the customers and the head-ends (HE). This is truly the work horse of the cable
industry, the most strenuous and costly by far to operate. Without further delay, let’s take a look at
the two curves below:







The graph
above illustrates the underlying cost of operation of the HFC plant. In this N+6 configuration analysis servicing
between 750 and 1000 potential customers, one can see the huge difference in
the system’s availability and reliability.
This simulation was based purely on field data collected on component reliability
of network elements to validate the performance of the HFC network (Amplifiers,
LEs, TAPS, coaxial cable, optical fiber, optical electronics, etc.). The blue line depicts the availability
performance, and the red line shows the overall system reliability curve. To say the two graphs are diverging is an
understatement. To everyone’s
surprise, we end up with a system reliability around 60% the first year and
around 10% the fifth year, while the availability remains constant around
98.96%. What does 60% first year and 10% fifth year reliability
mean? To explain this, I would like to
use a little example. Suppose you are a
maintenance manager running an HFC plant with 100 nodes, this means you will
have 40 failures in the first year and 90 failures within the fifth year
of operation (all requiring a truck roll).
A 90% failure rate within the first 5 years of operation of any system
is a BAD SYSTEM. To my amazement, many
involved in the development and the maintenance of HFC networks find this quite
typical. The term used is, “the cost of
doing business”. The focus is
definitely on the availability of the network, not the reliability. HFC technicians are extremely proficient at
swapping equipment and restoring service primarily because they have to do so
regularly; it is an accepted evil. These
truck rolls are extremely costly and constrain key resources in fire fighting
mode, thus innovation and continuous improvement suffer. The bottom line is that network operation
costs are escalating, increasing 10-20% yearly.
This high cost of ownership is expensive to the cable operator and the customers
- no wonder my cable bill is SO HIGH.
Without getting too technical, I would like to draw the reader’s
attention to the specific difference between system reliability and system availability.
First of all, let’s
define Reliability.
A consideration
of reliability is the backbone to any good business strategy and prospect for
growth. Reliability is the probability
that an item will perform its intended functions without failure
for a specified interval under stated conditions.
There are
several key words in the above definition that require clarification. The first is the word, probability. Probability
is a ratio, and in reliability it is the number of successes divided by the
number of attempts. This means it is a
numeric value, derived from an exact calculation. It is not based on opinion, speculation,
“seat of the pants”, rule of thumb, etc.
The latter are only guesses or hopes, and business cannot be based on
such elusive measures.
The
second important term is “performs its intended functions”. This suggests that the functions an item,
element or system needs to perform have been identified and agreed upon. Many times, an item is deemed unreliable even
though it performs the functions that have been identified for it. The problem is that not all the functions
needing to be performed are identified.
For example, let’s say a part has a certain mass that dampens the vibration
of an assembly. A decision is made to
reduce the mass of the item in order to reduce its cost. If the function of dampening vibration is not
identified, then the change may go through - the item by itself performs its
intended function but the assembly may fail due to increased vibration.
The third
important term in the definition of reliability is “without failure”. This implies a failure has been defined. In some circumstances, this may be
self-evident (smoked board, won’t turn on, etc.). In others, a certain amount of degraded
performance over time may be acceptable.
In the case of an amplifier or line extender (LE), the amount of gain
may degrade over time but as long as the customer does not experience picture
degradation, color issues, etc., it may not be considered a failure. Defining the threshold, the amount of
degradation, or drift that is acceptable is sometimes difficult, but is very
important.
The
fourth item is “for a specified interval”.
Again, this is not an ambiguous statement like “a long time,” but is a
specific number and should be in units of measure relevant to the part. A specified interval of five years does not
mean much to a part. The specified
interval must be translated into relevant terms applicable to the part, element
or system such as: hours of activation, hours of operation, number of cycles,
etc., for it to be meaningful.
The last term
to be considered is “under stated conditions.”
This means that the environment the item, element or system operates in
must be completely defined.
Temperatures, temperature cycles, pressures, pressure cycles,
corrosives, contaminants, maintenance items, and vibration (e.g. household
cleansers) must all be defined for an item to be robust in all operating
conditions. This is the particular requirement that is the most misunderstood
by 99.99% so called reliability engineers.
The engineer needs to take into account the worst case temperature, both
cold and hot, as the gage for temperature stress factors, elevation, vibration,
dust, etc. Through my study, I found 99%
of the vendors omit the temperature factor during their reliability calculation
by setting the pi-T factor = 1, “BAD PRACTICE.”
In addition, an amplifier designed for operation in let’s say, Georgia’s temperature factors will not operate
properly in places like Arizona or Nevada and other extreme
places.
When
developing a technical requirement for an item, element or system, all five terms
in the definition of reliability must be addressed. The reliability paragraph in a specification
file should:
1.
Call out a
probability; for example, 0.95
2.
Define
all functions of the item, element or system.
(It could refer to a different paragraph in the specification file where
the functional requirements are already stated).
3.
Define
what a failure is and what is not; for example, failure to operate when
commanded to, or greater than 20% change in resistance.
4.
Define
the specified interval or mission duration; for example, 1,000 hours energized
or 900,000 cycles (note: you must then adequately define a cycle).
5.
Define
the stated conditions; for example, 50 degrees Celsius energized and 25 degrees
Celsius when not energized.
All of the terms above
are necessary for a thorough reliability specification requirement.
To
illustrate the process stated above, I would like to review the performance of
two very well know IP switches. The first
one is a graph of an IP switch considered best in its class.



This graph illustrates the performance of a
very well designed and reliable IP switch.
In layman terms, this device is so reliable that it will experience
nearly zero failures over 5 years of operation. Although this device has superior
reliability, the sales team for this product finds it difficult to sell to the
cable industry. Why? The answer is Simple; the people in charge
make their decision based on immediate rewards (lowest cost and vendor relations)
without taking into account the COST OF OWNERSHIP and COST OF OPERATION.
Now,
let’s analyze a similar IP switching product line that offers a better initial
price but exhibits nearly 4 times higher COST OF OWNERSHIP and COST OF
OPERATION than the prior.



It is clear that this product line is not comparable to
the first product analyzed. If we look
at the performance at year 5 for this product, it is evident that its survival
rate is about 38% compared to the nearly 98% survival rate of the first product. However, to my amazement, the less reliable
product is the leading IP switching choice for decision makers looking to make
an immediate impact to their short-term strategy without regards to long-term
business needs. This type of mentality
needs to be shifted quickly in order to hold original equipment manufacturers (OEM)
accountable for their poor performances.
Consumers must demand quality and reliability.
Now,
let’s turn our examination to that of Availability by answering these 2
questions:
1) What is
Availability?
2) Where does the
focus for availability lie?
What is
Availability?
My
definition for availability is as follows:
Availability is the probability that an item, element or system is good
and ready to go when needed. For
example, I expect my car to start under all conditions (hot or cold, wet or
dry) and to take me back and forth to my routine destination day-in and
day-out. When I take it to the mechanic
for 2 to 3 hours for routine maintenance (oil and tire changes, fluid checks,
etc.), my car is not available (so we have to shave a little bit of the 99.999%
availability matrix set forth by the manufacturers). Although during routine maintenance, when my
car is not available, this is not a reliability hit since routine maintenance
is scheduled downtime called for by the manufacturer. On the other hand, if I have a transmission
problem, both availability and reliability will take a hit since it is a
failure and not planned maintenance and during the failure and repair of
failure, my car is unavailable to me. Some
complex systems that are prone to failure many times include multiple schemes
of redundancy (standby, dual processing, etc.), but this can be expensive. To
examine systems with multiple schemes of redundancy there is a process called
the Markov model that can simulate the overall system availability for these
cases.
In its
simplest form, availability is a function of mean time between failures (MTBF)
and the mean time between repairs (MTBR) as illustrated in this simple equation
below
A (t) = MTBF
/ (MTBF + MTBR)
Note :
For A (t) to be large, the numerator must be large, therefore the time between failures
must be long. Also, the denominator must
be small,
therefore the
time to repair must be short.
Where does the responsibility
for Availability lie?
Availability
is a shared matrix that needs careful attention by all parties involved
(vendors, engineering, network operation, and field installation). In a great organization that values
continuous improvement here’s how the process goes: The engineering team works with the vendors
to translate their system requirements into technical specification. Upon clear agreement, the vendor develops and
delivers robust and reliable products suitable to meet the targeted MTBF (use
my guideline above for defining reliability requirements). The Operation team works with engineering all
through the design process, testing and validation phases to identify failure
characteristics, troubleshooting guides, and corrective actions in order to
minimize the overall system downtime thus improving the MTBR to support the
customers’ and contractual targets (Availability = 99.999). The field team coordinates the builds with
operation and engineering teams per vendor’s recommendation to reduce infant
mortality and strenuous operation.
As you
can see, availability is a subset of reliability; it requires careful input of
all clearly defined contractual agreements up front to reduce strenuous
operation, and to build long-term value for both the customer and the
business. Without this clear focus,
system operation goes through day-in and day-out in fire fighting mode, which is
costly and time consuming. Fire fighting
ties up key resources in non-productive activities and hurts the company’s
bottom line.
In closing:
We are all
witnesses to the downfall of the big three automobile industry moguls and the
greatest financial tsunami of all time due to a lack of due-diligence and risk
assessment. Risk calculations are not
easy, but necessary. The
skills required are not something you can get over-the-counter, via a book, or some clever
tool. This is an extremely disciplined
skill that requires years of crafting by connecting the dots and staying
abreast of this fast changing world. For
me, a reliability engineer is an engineer on steroids; meaning that you have to
be able to reverse engineer a particular design while finding ways to improve
it based on other core principles [material properties, design for six sigma
(DFSS), design for reliability (DFR), LEAN, etc.].
Stale
ideology, rigid and outdated guidelines, bureaucracy, the good old boy system,
and yesteryears glory days will not propel your company forward. I believe we are at a cross road in America where
everyone has to play their parts. Toyota handily destroyed
3 pillars of American industry by sticking to a long term strategy and focus on
core fundamentals. If I’m not mistaken,
they consider long term as 100 years. Today’s
American companies are focusing quarter to quarter and have very vague visions
and missions. If making money is your
mission, I can pretty much guarantee your company will not be around very long.
Philimar Menard is the Chief Technology Officer of
Q&R Consulting Firm Inc., which specializes in design for six sigma (DFSS),
design for reliability (DFR), continuous improvement, and Lean 6Sigma
(LSS). He helps companies propel
into the next generation by offering sound solutions to availability and
reliability problems that guarantee continuity of operation and engineering
designs done right the first time.
Contact
him at philimar.menard@qrcfi.com.