Software Reliability and
Content Relevance as SystemsТ Potential Reliability
аааааааааааааааааааааааааааааааааааааааааааааа Yuri
Arkhipkin
аааааааааааааааааааааааааааааааааааааааааааааа aryur@yandex.ru
Abstract. This paper argues for the unified quantitative approach to software reliability and content relevance validated by the systemsТ potential reliability law.
Presented are the estimations of minimum needed software test coverage to assure the achievement of Enough and Six Sigma software reliability at minimum and maximum fault flow throughout the test process. These estimations are based on the evaluation results yielded by the Agile Software Reliability Monitoring Model (ASRMM), approaching reliability problem in software engineering. This model integrates qualitative subject matter data and quantitative software reliability metrics. The ASRMM is viewed as a foundation of the Testerbot software reliability engineering that provides continuous reliability evaluations throughout software development process, thus accounting and tracing quantitative software reliability requirements from customer to product.
Also presented are the estimations of minimum needed content semantic coverage to assure the achievement of Any Sigma content relevance for a given example content. These estimations are based on the evaluation results yielded by the Content Relevance Quantification Model (CRQM), approaching the relevance problem in content engineering. This model integrates qualitative subject matter data and quantitative content relevance metrics, providing continuous relevance evaluations throughout content engineering process, thus, making possible to trace content relevance requirements from customer to product. The CRQM is viewed as a foundation of the Sequantic content relevance engineering.
1. Introduction
ааааааааааааааааааааааа Much research was done on models
approaching software reliability quantification. The results seem to be of poor
satisfaction despite of the increasing number of these models. The lack of
explicit evaluations of software elementsТ failure probability may be
considered as one of the main problems in software reliability quantification. Software
elementsТ failure data is yielded while testing (executing) software for a vast
field of subject applications.а This data
may be considered as an ad hoc data much depending on developerТs skill and
software testing skill in particular. Software testing in general may be
considered as a trial failure process of sensitizing software elements (sites)
to define whether yielded results are true or faulty.
Much research was also done on
models approaching content relevance quantification. The results seem to be of
not enough satisfaction despite of the increasing number of these models. ContentТs
grammar variety may be considered as one of the main problems in content relevance
quantification. Content irrelevance (failure) data is yielded due to query
occurrence (generation) through content searching, thus providing data for
queryТs terms refinements and (or) search engineТs improvements. So this data
generally may be considered as an ad hoc data much depending on search engineТs
developer skill and query generation (testing) skill in particular. аContent searching in general may be considered
as a trial failure process of sensitizing content terms to define whether this content
is relevant to the terms of the query or not.
This paper offers to break through
the quantification problems of software reliability and content relevance
engineering by approaching any digital content as a trial failure system
regardless of its grammar.
Chapter 2
introduces briefly some mathematics of the systemsТ potential reliability law
proved by B. S. Fleishman [1]. This law validates the systemsТ failure
intensity quantitative ranges depending on known potential operating elementsТ
number as a part of total system elementsТ number and their mean operating
probability.
Chapter 3
presents the quantitative approach to the software reliability engineering
validated by systemsТ potential reliability law. Software siteТs operating probability
is considered to be equal to the siteТs probability occurrence or potential
occurrence. This probability may be evaluated at any cycle of the software
development. Developer in general needs no external statistic data to monitor
the achieved quantitative reliability level of the software project.
Chapter 4 presents the quantitative approach to the content relevance engineering validated by systemsТ potential reliability law. Content elementТs operating (sensitizing sense) probability is considered to be equal to the termТs content frequency. This frequency may be evaluated at any cycle of the content development. Developer in general needs no external statistic data to index the quantitative relevance level of the content project.
The pragmatics of the presented approach
may be defined by its validity and verifiability that may be explicitly
quantified for the vast field of subject applications in the system
engineering.
2. Potential Reliability of a Trial Failure System
ааааааааааааааааааааааа At every time moment the systemТs
elements belong to either operating or failure state. Moreover the conversion occurs instantly from operating to failure
state, while the reverse conversions are impossible.
ааааааааааааааааааааааа It
is natural in general to consider a system as an operating one at the given
moment, if there exist at least some operating elements comprising a before
stated minimal part of total system elementsТ number. Many uncontrollable causes,
influencing elementsТ failures, make it possible to consider the failuresТ
occurrence as random events.
ааааааааааааааааааааааа Let
the system AR at given moment t consists of n elements {a1,Е, av,Е, an}
with arbitrary interactions. Any element is associated with two mutually
exclusive events A1v and A0v. The
event A1v is associated with operating element av and the event A0v is associated
with its failure. Let the probabilities of the events A1v
and A0v are equal to pv and 1- pv
correspondingly.
ааааааааааааааааааааааа Consider
the set Rn of all possible 2n
states r = (i1,Е,iv,Е,in)
of the system AR. This set depicts the operating and failure states
of the system AR (iv =1 if av is in the state A1v and
iv =0 if av is in the state A0v).
ааааааааааааааааааааааа Let
us divide the set Rn into two parts E1
and E0= Rn\E1. The set E1 is an
operating set of the system AR and E0 is a failure set of
the system AR. Consider by the definition that the system AR
operates at the given moment only if r аE1.
ааааааааааааааааааааааа It
is considered that the systemТs state r isа
a sequence of independent trials with outcomes probabilities pv=P(iv=1), 1-pv=P(iv=0)
(v=1, 2, Е, n) ofа every v-th trial. Then the probability Pv of the system AR to operate at the
given moment may be defined [1] as:
аааааааааааааааааааааааааааааааааааааааааааааа аааааа аааn
ааааааааааааааааааааааа Pv=P(r аE1)=а S ааP pviv ∙ (1-
pv )1-
iv .ааааааааааааааааааааааааааааааааааааааааааааааааааааа (2.1)
ааааааааааааааааааааааааааааааr аE1 аv=1
ааааааааааааааааааааааа Consider the systems comprised of n elements, operating set E1 of which consists of states, each
including more than s operating
elements. So the set E1 includes
all systemТs states r =(i1,Е,iv,Е,in)
for which аsum(iv)>s (v=1, 2,Е, n). Such systems are named as symmetric of
s-th
degree systems and formula (2.1) appears as:
ааааааааааааааааааааааа аааааа ааn
ааааааааааа Pv=аа
Sаааа аP pviv ∙ (1-
pv )1- iv
.аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (2.2)
ааааа sum(iv)>sа
v=1
To define the possibility to operate
for the symmetric system of s-th degree, it is necessary to study the asymptotic behavior
of (2.2) at n g УinfinityФ.
Restricting the study by operating
systems with large but constant elementsТ number n, it is said that these systems have instant operating probability
Pv(t) at
the given moment t. The
probability R(t) that the system AR аwill operate until some moment t (inclusive) depends on whether the
system will operate at all moments t (t=1,Е,t). The sequence of independent trials
schema with operating probability Pv(t)
provides the reliability R(t) defined as [1]:
а
t
аR(t) = аP Pv(t)аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (2.3)
t=1
and further
ааааааааааааааааааааа ааааааааааааtааааааааааааааааааа ааааааааааааааааа аааааааааааа ааааt
1 - S [1- Pv(t)] £ R(t) £ exp(- S [1- Pv(t)] ).ааааааааааааааааааааааааааааааааааааааааааааа (2.4)
ааааааааааа аааааааааааааааааааааааааааt=1аааааааааааааааааааааааааа аааааааааааааааааааааааааt=1
Taking into
consideration that Pv(t) g 1 with n(t) increase, there are possible
different extreme values ofа R(t) with t increase. To refine this point, consider
the ideal system AR with postulated features as follows [1]:
1.а
Operating capability. The system is capable to operate at any t time moment.ааа However if it fails at given time moment,
then nothing can bring it into the operating state.
2.а
Unlimited extension. If the system is operating at given t time moment, then at the next
moment t+1 it may be enhanced by any number
of elements. One time unit is a conditional one for a given system.
3.
Physical
restriction of reaction time. The system is got aware about its state only at
the next t+1 time moment.
4.
Math
restrictions. The symmetric system with independent success and failure trial
results at every given t time
moment is under consideration.
а
ааааааааааа The reliability R(t) limit of the symmetric system of s-th degree with n=const elements that are pair wise independent and uniformly distributed is defined by equation
R(t) = exp(-l∙t),аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (2.5)
where l is a system failure intensity measured in faults per element and is evaluated according to the systemsТ potential reliability law (see math proof in [1]) as follows:
-ln[1-exp(-kL∙n)
+O(ln n)] £ l £ -ln[1-exp(-kU∙n)],а ааааааааааааааааааааааа ааааааааааа (2.6)
where
kL=c∙ln(c/pM) + (1-c)∙ln ((1-c)/(1-pM));ааааааааа аааааааааааааааааааааааааааааааааа ааааааааааа (2.7)
kU=c∙ln(c/pS) + (1-c)∙ln ((1-c)/(1-pS));аааааааааа аааааааааааааааааааааааааааааааааа ааааааааааа (2.8)
c = s/n;
pM=pL/(1+pL-pU);
pL £ pv £ pU
< 0.5;
pL=min(pv)а (v=1,2,Е,n);
pU=max(pv)а (v=1,2,Е,n);
а аааааааааааааааааааааааааааа ааааn
ааа ааааааааааааааааааа аpS=1/n∙
S pv .ааааааааааа ааааааааааааааааааааааа аааааааааааааааааааааааааааааааааааааааааааааааааааааааааа ааааааааааа (2.9)
ааааааааааааааааааааааааааааааа ааv=1
The postulated features above fit a
system approach to software development process including testing and debugging
in particular. The systemsТ potential reliability law application to quantify
software reliability seems to be a fruitful enough approach.
Software reliability is understood as a probability of failure free software operation in defined environment for a specified period of time. In general a failure is a deviation of operation results from customer requirements. The deviation is defined by correspondence between algorithmТs specification and its software implementation. Quality of algorithmТs subject matter specification influences software reliability throughout the development process and lifecycle of software product. To achieve continuous improvement of software engineering process, reliability requirements must be defined in an integrated manner for prediction, evaluation, validation, verification, and certification at specification, coding, testing, maintenance, and correction cycles. Thus reliability monitoring needs to be implemented as an online automated process throughout the software life cycle.
At the beginning we know nothing in general about algorithm to be implemented, but some ideas concerning input data and results. Consider algorithmТs specification based on input data formal definition. Usefulness of a software reliability model depends on the definition method of input data to be tested for exhaustive fault detection. Input data may be considered as a set of requests (sites), so their total number and variety are sufficient for reliability evaluation. Sites are viewed as structural and(or) functional software elements including subject matter data, input variables, decisions, restrictions, memory structures and the like. All sites are potentially fault inherent and may cause a failure. Software input data set may be viewed as a site set. Software site is somewhat like a software path.
аLet the site set structure defines total sitesТ number n(t) at different t time of lifecycle and occurrence probability pv(t) (v=1,2,Е,n(t)) of v-th site to be processed, that defines failure probability 1-pv(t) while processing this site. Failure probability is greater for the sites that may occur more rarely, because it is more difficult in general to sensitize faults by such exotic sites while testing.
It is mostly improbable that all n(t) sites will yield specifiedа results because it is impossible to implement software without faults. But even if all n(t) sites will yield specified results, this fact is undetectable because in this case we need all n(t) sites to be tested. Practically it is impossible because of great values of number n(t) for almost any software product. Not all sites are to be processed even throughout software lifecycle. To yield specified results at the required reliability level, in practice, it is enough for software product to have only s(t) number of assured fault free sites. The number s(t)а of potentially faulty (sensitive) sites, sensitized throughout testing by time t, defines a software test coverage as c(t) = s(t)/n(t) (0<c(t)<1) and is a known parameter of software reliability models.
It is natural in general to consider a software to be operating at a given t time moment, if there exists at least some fixed, before stated, minimal part c(t) of operating (fault free) sites ofа total sitesТ number n(t).
The features (see Chapter 2) provide insight into the test process in general, including fault correction and regressive testing procedures, thus refining software lifecycle process according to the postulated features above.
Any software site may sensitize fault(s) and is a potentially fault inherent one.
Consider a software as a system comprised of n(t) sites (elements), operating set of this system consists of states, each including more than s(t) operatingа sites. Consider the sites are pair wise independent and uniformly distributed over all possible sitesТ number n(t)=const. Software reliability R(t) is evaluated by known equation (2.5) as
R(t) = exp(-l(t)∙t),ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (3.1)
where l(t) is a software failure intensity measured in faults per site and is evaluated (see formulas (2.6),Е, (2.9) and math proof in [1]) as follows:
-ln[1-exp(-kL(t)∙n(t)) +O(ln n(t))] £ l(t) £ -ln[1-exp(-kU(t)∙n(t))],ааааааааааа (3.2)
аwhere
kL(t)=c(t)∙ln(c(t)/pM(t)) + (1-c(t))∙ln ((1-c(t))/(1-pM(t)));аааааааааааааааааааааааааа (3.3)
kU(t)=c(t)∙ln(c(t)/pS(t)) + (1-c(t))∙ln ((1-c(t))/(1-pS(t)));аааааааааааааааааааааааааааа (3.4)
c(t) = s(t)/n(t), а(0£c(t)£1);
pM(t)=pL(t)/(1+pL(t)-pU(t));
pL(t) £ pv(t) £ pU(t) < 0.5;
pL(t)=min(pv(t))а
(v=1,2,Е,n(t));
pU(t)=max(pv(t))а
(v=1,2,Е,n(t));
аааааааааааааааааа ааааааааааа ааааааааааааааааn(t)
ааа ааааааааааааааааааа аpS(t)=1/n(t)∙ S pv(t) .ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (3.5)
ааааааааааааа аааааааааааааааа ааааааааааааааааv=1
We consider O(ln n(t)) to have some constant value that must be defined at large values of n(t), thus minimum and maximum failure intensity according (3.2), (3.3) and (3.4) correspondingly are as follows:
lmin(t) = - ln[1-exp(-kL(t)∙n(t))+O(ln n(t))]аааааааааааааааааааааааааааааааааааааааааааааааа (3.6)
lmax(t) = -ln[1-exp(-kU(t)∙n(t))].ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (3.7)
The Agile Software Reliability Monitoring Model (ASRMM) considers a software as a system, comprised of n(t) sites. Operating set of this system consists of states, each including more than s(t) operatingа sites. Software test process is defined in general according to the postulated features (Chapter 2), including fault correction and regressive testing, thus refining software lifecycle process as for achieving and supporting reliability requirements according to the math interrelations (3.1), (3.4), (3.7) of аtotal sitesТ number n(t), test coverage c(t), mean siteТs occurrence probability ps(t), and failure intensity l(t). Figure 1 displays these interrelations. Time 0£t£1 flow in the ASRMM is a math one and the t-unit is based on every site rv(v=1,2,Е,n(t)) sensitized while testing.
To define initial mean occurrence probability pS(t), imagine a software site rv(v=1,Е,n(t)) as a set of W(t) pair wise independent parameters xi(i=1,Е,W(t)). Each parameter may accept value(s) xij(j=1,Е,Ji(t)) with sij(t) valuesТ number of each. These parameters define semantically sufficient parameterТs values that are sensitive as for yielding specified results, or, otherwise, values that may sensitize faults in software under test. These values are supposed to yield semantically typical results. Software sitesТ set structure may be viewed as a software sensitive sitesТ semantics matrix (SSSSM).
Semantics of the parameter values
is defined by subject matter and the particularities of the algorithmТs
implementation. There may be defined a number Ji(t) of semantic types having sij(t) (si(t)³sij(t)³1) valuesТ number of each type. So each
ааааааа аааааааааааааааааааааааааааJi(t )
parameter
xi has a number si(t)=Ssij(t)а
ofа values, thus total
аааааааа ааааааааааааааааааааааааааj=1
аааааа ааааW(t )аааа ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа ааааааааааааааа аW(t)
sitesТ
number is n(t)=P si(t) and the number of sensitive sites
is s0(t) =P Ji(t) for a given
аааааааааааааааааааааааааааааааааааааааааа i=1аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа
аi=1
аproject. Then at the
specification cycle t=0,
according (3.5), we have pS(0)=1/s0(0). LetТs name pS(0) as an
initial semantic mean of аa software
project under development. Values, denoted by t (0£t£1) time (one unit of t time is ts =1/ n(t)), vary throughout the development process due
to refinements brought by customers, programmers, testers, developers, users,
and the like. These refinements lead to the changes of total sitesТ number n(t) and
sensitive sitesТ number s0(t) because ofа
detected faults, thus changing semantic mean pS(t) value.
While testing, the test coverage value c(t) increases so the difference
c(t) - pS(0)< 0 о 0 andааааааа
аааааааааа when (see Figure 1)
аааааааааааааааааааааа c(t) - pS(0) = 0,аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (3.8)
according (3.4), we haveа kU(t) = 0а that is a starting point of failure intensity
(3.7) value decreasing and software reliability (3.1) growth (see Figure 1).
When the sensitive sitesТ number s0(0) is not defined, then
according (3.5) and (3.8) we have s0(0)/n(0)=1/ s0(0) and
thus s0(0)= sqrt(n(0)).
ааааааааааа Here
is some general description of the software reliability monitoring algorithm
based on the ASRMM.
аSoftware site set structure is defined during specification as a SSSSM. Semantics of the siteТs parameter values is defined by software subject matter and algorithmТs implementation particularities. The SSSSM is viewed as a data base of the ASRMM engine (TESTERBOT) for generating tests, refining semantic mean and semantic shift of input data flow, thus providing monitoring of the software engineering process.
During software concept definition and input data specification at t=0, before testing, we refine s0(0) sensitive sites selected from n(0) total sitesТ number. Sites are ranged according to their occurrence probabilities pv (0) (v=1, 2,Е,n(0)) thus defining initial semantic mean ps(0) of input data flow.
Potential reliability metrics, evaluated according (3.7), if compared with required ones lrq(1), define the appropriateа test coverage value to be achieved to meet the requirements. The difference lrq(1)> l(0) means that it is necessary to define the additional number of sensitive sites for testing to assure the achievement of required reliability. The additional sensitive sites are defined and may be selected by the TESTERBOT either in exhaustive or extreme (random or semantic extrapolation) mode of the test process.
Test process must assure the achievement of required test coverage value. While testing, fault(s) is (are) detected and corrected thus changing sensitive sitesТ number
s0(t+ts):=s0(t)+1 for each fault(s) and in general changing the number of values si(t)³sij(t)³1 of the sitesТ parameters. The sensitive sites parametersТ values are either predefined or randomly selected according semantic ranges. If total sitesТ number n(t) changes then t time unit ts is recalculated. If fault(s) is (are) not detected, the number ofа tested sites must be changed s(t+ts):=s(t)+1 for each tested site thus increasing test coverage c(t).
Value of t time is calculated as t:=t+ts after each tested fault free site.
Changes are calculated for n(t), s(t), s0(t), si(t), Ji(t), W(t) thus refining semantic shift ps(0) - ps(t). If total sitesТ number n(t) is changed, then the time is to be refined t:= t+ts = t+1/n(t) only after each tested faulty site being corrected and retested.
аReliability metric l(t<1) being continuously evaluated as (3.7) throughout testing process and compared with required one, provide a possibility of making decisions on testing process. The achieved reliability level, being continuously monitored, depicts changes either in subject matter requirements or software implementation particularities, thus providing possibilities for optimization throughout engineering process of software product development.
The ASRMM (see also [2]) as a
foundation of the Testerbot engineering yields validated and verifiable reliability evaluations, integrating qualitative subject matter data and quantitative
software reliability metrics. Testerbot engineering
provides continuous reliability evaluations throughout software engineering
process, thus accounting and tracing
quantitative software reliability
requirements from customer to product.
Consider the customerТs four sigma reliability requirements for the software project of the total number as a trillion (1012) software sites of input data setТs elements. At the beginning of the concept definition, the TESTERBOT preliminary evaluations yield the minimum needed software test coverage as 0.0000010032 or 1,003,200 tests to assure the achievement of required four sigma software reliability (0.00621 faults per site). To improve the reliability up to six sigma (2.0∙10-9 faults per site), the extra needed schedule-budget spending on testing and, if needed, on reengineering, must be increased 1.003 times. Minimum failure intensity for the given n(t) is l min (t)=1/n(t), thus defining Enough sigma test coverage estimations to achieve maximum possible reliability for the given software project. To achieve maximum (Enough sigma) reliability, the engineering efforts must be 1.0042 fold. It seems that Six Sigma reliability may be not enough for large scaled, semantically sophisticated, and critical software projects.
During further software input data set structure refinement and specification at t=0, before testing, the TESTERBOT refines initial number s0(0) of sensitive sites selected from the total n(0) sitesТ number. Sites are ranged by the TESTERBOT according to their occurrence probabilities, thus refining initial semantic mean ps(0) of input data flow. So the TESTERBOT evaluations yield the refined minimum needed software test coverage as 0.0000010016 or 1,001,600 tests to assure the achievement of required four sigma software reliability. To improve the reliability up to six sigma, the extra needed schedule-budget spending on testing and, if needed, on reengineering, must be increased 1.00156 times. To achieve Enough sigma reliability, the engineering efforts must be 1.00211 fold.
The TESTERBOT application provides 0.37 % schedule-budget savings or 3720 testsТ number decrease to achieve Enough sigma reliability requirements for the given project.
аThe ASRMM results featuring the Testerbot optimization capabilities (see Figure 2), give an idea of schedule-budget-reliability estimations of the needed test coverage changes due to refinements of total software sitesТ number. These refinements are contributed to the project by customer, developers, testers, and the like throughout the software engineering process.
Testerbot engineering assures the achievement of required software test coverage value thus defining the mostly accurate semantic mean compliance with semantic target. Any detected and corrected fault decreases semantic shift. While testing, fault(s) is (are) detected and corrected thus changing initial sensitive sitesТ number s0(0) and in general changing total sitesТ number n(0), thus refining needed test coverage c(t) to achieve required failure intensity l(t) for any sigma software reliability. Fault flow may vary from minimum, when no faults are detected throughout the test process, to maximum, when every tested site sensitizes fault(s). The ASRMM estimations of minimum needed test coverage at minimum and maximum fault flow are shown in the Figure 2. These online reliability evaluations provide a flexible decision making Testerbot tool for software product lifecycle management.
The Testerbot software reliability engineering:
depicts real software test process either in distributed or concurrent environment and may be viewed as an online TESTERBOT software development tool (PLM solution) to be applied for any test phase and strategy including extreme (agile) and exhaustive ones;
needs no empiric fault data and may be viewed as the TESTERBOT tool institutionalizing all CMMI- SW/SE Maturity levels;
provides protective features, giving impact in safety programming and data security approach;
may be viewed as a test bed or (and) a cradle for many existing or on coming software reliability models.
Each product, released under the TESTERBOT tool, may be equipped with reliability e-certificate needed for acquisition, remedial and continual improvement processes, thus suggesting the quantifiable improvement approach to system development standards.
Any content sensitizes sense or results while processing by brain or computer. Content search aims to provide access to the results
of content processing.
These results are to be of enough quality to rely on, so the content search must be reliable enough to provide relevant content for processing.
Reliability of content search may be viewed as a probability of relevant content detection or coverage by applying a query. Irrelevance may be considered as a deviation of content coverage results from customer requirements. The deviation may be defined as a compliance of query and content specifications with search engine implementation. Quality of subject matter query specification influences the content search relevance.
In general we know
nothing about content to be searched, but
some ideas concerning queries and content searching
results.
The Content Relevance Quantification Model (CRQM) considers any content as a set of n queries with queriesТ variety number or sensitive queriesТ
number
s £ n. Any content
query qv (v=1,2,Е,s)
may sensitize sense and is a potentially sense
inherent one. Input query, applied for
sense detection (sensitization), may discover or recover content with some relevance to the search results.
Queries are viewed as structural and (or) functional content (document, collection, corpus) elements, including subject matter data. Let the query set structure defines queriesТ variety number s and an occurrence probability pv (v=1,2,Е,s) of v-th query, that defines a probability of sensitizing sense in response to this query.
It is mostly improbable, that an input query includes all s queriesТ variety number specifying content under search, because of the great values of number s for almost any content. To yield specified search results at the required relevance level, in practice, it is enough for a content to have only sv sensitive queries. The number sv of potentially sensitive queries, sensitized throughout content search, defines contentТs semantic coverage as cv = sv/n and is a known parameter of search models.
It is natural to consider any content under search to be relevant enough, if there exist at least some fixed, before stated number sv of sensitive queries as a minimal part cv of the total content queriesТ number n.
Consider any content as a system comprised of n queries, sensitive set of this system consists of states, each including more than sv sensitive queries. Consider queries are pair wise independent and uniformly distributed over all possible queriesТ number n=const. Content relevance R or probability of sensitizing sense is quantified according to (2.5 ) by the equations R = exp(-l), or else according to (2.6),Е, (2.9) as
аааааааааааааааааааааааааааааааааааааааааааааа
1-exp(-kL∙n)
+O(ln n) ³ R ³ 1-exp(-kU∙n),ааа аааааааааааааааааааааааааааааааааа (4.1)
where l is a contentТs irrelevance intensity,
defined by (2.6 ), Е, (2.9 ) and measured as irrelevance (probability of not
sensitizing sense) per content query.
Content query set structure is defined while indexing
as a content query semantics matrix (CQSM). Semantics of the queryТs parameter
values is defined by content subject matter and search engineТs algorithm implementation
particularities. The CQSM may be considered as a semantic quantifierТs
(SEQUANTIC tool) data base for queriesТ indexing (quantification), refining
contentТs semantic mean ps, semantic coverage
cv, and semantic shift cv-ps (Figure 1) thus providing content
relevance quantification. Content search assures the achievement of required
semantic coverage value.
Consider a content structured
as total queriesТ number n with
semantic mean ps=1/n. The mostly relevant
query for any content is the input query yielding Rv =1, lv=0.
The mostly irrelevant input query Rvg0 in general is defined at cv=ps=pv and lv gФinfinityФ, so that semantic coverage is
equal to semantic mean and occurrence probability of the query qv. The query discovers a content with higher
relevance, the greater is the inequality cv<ps. The query recovers a content with higher relevance, the greater is the
inequality cv>ps (Figure 1).
Content relevance R, being continuously evaluated
throughout search process and compared with required one, provides a
possibility for making decisions on searching. The achieved relevance level, depicts changes either in subject matter requirements
(input queries) or search engine implementation particularities, thus providing
possibilities for optimization throughout content search engineering process.
The SEQUANTIC tool as a content product lifecycle
management solution may be viewed as a test bed or (and) a cradle for many
existing or on coming content relevance models and search engine
implementations.
Each content product, released under the SEQUANTIC
tool, may be equipped with relevance e-certificate needed for acquisition,
remedial and continual improvement processes, thus suggesting the quantifiable improvement approach to
the content development standards.
The CRQM based Sequantic
engineering yields validated and verifiable relevance estimations, integrating qualitative subject matter
queries and quantitative relevance metrics of content. Sequantic
engineering provides continuous relevance evaluations throughout content
engineering process, thus accounting and
tracing quantitative content
relevance requirements from customer to the content product.
The CRQM improves latent semantic
indexing, especially for unknown and (or) heterogenous
collections, by increasing relevance, precision, and recall of content search,
including the full text search. The CRQM
may be used for data exploration and data integration tasks (due to its
potential accuracy to quantify the contentТs semantics), to solve heterogeneity
problems, and to provide varied levels of Querying services, that facilitates
knowledge discovery at different levels of granularity. аааааааааааааааааааааааааааааааааа
аThe content above, as an example, may be structured as one-term total queriesТ number n=203 and the variety queriesТ
number s=23 with
semantic mean ps=1/s =0.043 (see Figure
3). According (4.1), (2.7),Е, (2.9) quantification, the mostly relevant
recovery l1=1.66533E-15 may be performed by Уcontent У one-term query, and the mostly
irrelevant recovery l4=0.836127403
may be performed by УsensitizeФ query at
the 0.433385606 relevance level. The query УtermФ may be considered
as four sigma (l£0.00621)
relevant discovery query for the content above. Query УtermФ is 0.99659028 relevance
discovery query for the content example.
аConsider a query Уrelevant deviation resultsФ
as an example. Its semantic coverage (see Figure 3) is evaluated as c= c3
+ c19 + c16 = 0.064 + 0.010 + 0.035 =
0.109 and this queryТs recovery relevance according (4.1), (2.7),Е, (2.9) is not less than 0.999526985. Another example of the content
quantification is shown in [3].
The presented content relevance
quantification approach may be applied for the human grammar content after
solving the known language ambiguities.
5. Conclusion
Many research and development points
may be initiated by the presented quantitative system engineering approach. This
approach implies any affordable quantitative accuracy in system elementsТ refinement
suitable for the customer and the developer.
аааааааааааааааааааааааааааааааааааааааааааааа References
1. ╘ыхщ°ьрэ, ┴.╤. ▌ыхьхэЄ√ ЄхюЁшш яюЄхэЎшры№эющ ¤ЇЇхъЄштэюёЄш ёыюцэ√ї ёшёЄхь. У ╤ютхЄёъюх ЁрфшюФ, ╠юёътр, 1971, 224 ё.
ааа ( Fleishman, B.S. Theory elements of complex systemsТ potential effectiveness. УSoviet RadioФ, Moscow, 1971, 224 pp.).
2. Arkhipkin, Y. Enough Sigma Software Test Coverage Approach. http://aryur.narod.ru/
3. Arkhipkin, Y. Content Relevance Quantification Model. http://sequantic.narod.ru/ а
Figure 1. General interrelation of the test/semantic coverage c(t)- (c ), software/content semantics mean ps(t)- (ps), and reliability/relevance metrics R(t)- ( R ), l(t)- (lamb).
Figure 2. Testerbot
reliability estimations yielded by the ASRMM for exhaustive testing
Total software
sitesТ variety number n(0) |
Sigma
software reliability
requirements |
Minimal
test coverage at min fault
flow(testsТ #) |
Minimal
test coverage at max fault flow (testsТ #) |
Schedule-budget
savings % (testsТ #
decrease) |
1,024 |
4s |
0.049ааааааааааааааа а(51) |
0.04101562ааааааа (42) |
17.65аааааааааааааа (9) |
6s |
0.071аааааааааааааааа (73) |
0.051757813ааааа (54) |
35.65аааааааааааа (19) |
|
Enough
sigma |
0.0537аааааааааааааа (55) |
0.04296875ааааааа (44) |
20.00аааааааааааа (11) |
|
89,362 |
4s |
0.00397аааааааааа (355) |
0.00367046ааааа (328) |
а 7.60аааааааааааа (27) |
6s |
0.004644032аа (415) |
0.003994987ааа (358) |
13.73аааааааааааа (57) |
|
Enough
sigma |
0.00430831аааа (385) |
0.00383832ааааа (344) |
10.65аааааааааааа (41) |
|
17,343,286 |
4s |
0.000252028 (4371) |
0.000246147а (4269) |
а 2.33ааа
ааааааа(102) |
6s |
0.000264079
(4580) |
0.000252144а (4375) |
а 4.47аааааааааа (205) |
|
Enough
sigma |
0.00026206аа (4545) |
0.00025104ааа (4354) |
а 4.20аааааааааа (191) |
|
1,000,000,000 |
4s |
0.000032182а ааааааааааааааааа ( 32,182)ааааааааааааа |
0.0000319 аааааааааааа ааааааа(31,900) |
а 0.87 ааааааааааааааааааа (282) |
6s |
0.000032745 аааааааааааааааааа (32,745) |
0.000032189 ааааааааааааааааааа (32,189) |
а 1.70 ааааааааааааааааааа (556) |
|
Enough
sigma |
0.00003277 аааааааааааааааааа (32,770) |
0.00003219 ааааааааааааааааааа (32,190) |
а 1.77 ааааааааааааааааааа (580) |
|
102,500,700,000 |
4s |
0.0000031411 аааааааааааааааа (321,965) |
0.0000031322 ааааааааааааааааа (321,054) |
а 0.28 ааааааааааааааааааа (911) |
6s |
0.0000031584 аааааааааааааааа (323,739) |
0.000003140 ааааааааааааааааа (321,856) |
а 0.59 ааааааааааааааааа (1883) |
|
Enough
sigma |
0.00000316286 аааааааааааааааа (324,196) |
0.00000314316 ааааааааааааааааа (322,176) |
а 0.62 ааааааааааааааааа (2020) |
|
1,000,000,000,000 |
4s |
0.0000010032 ааааааааааааа (1,003,200) |
0.0000010016 аааааааааааааа (1,001,600) |
а 0.16 ааааааааааааааааа (1600) |
6s |
0.0000010063 ааааааааааааа (1,006,300) |
0.00000100317 аааааааааааааа (1,003,170) |
а 0.31 ааааааааааааааааа (3130) |
|
Enough
sigma |
0.00000100744 ааааааааааааа (1,007,440) |
0.00000100372 аааааааааааааа (1,003,720) |
а 0.37 ааааааааааааааааа (3720) |
Figure 3. The CRQM relevance quantification example for the Chapter 4 content а
v |
Query set structure qv |
Occur #
nv |
Occurrence probability pv= nv /n |
Semantic coverage cv |
Irrelevance intensity lv |
Relevance Rv |
1 |
content |
41 |
0.202 |
0.202 |
1.66533E-15 |
~1.0 |
2 |
search |
22 |
0.108 |
0.108 |
0.000579228 |
0.99942094 |
3 |
relevant |
13 |
0.064 |
0.064 |
0.488453851 |
0.613574339 |
4 |
sensitize |
12 |
0.059 |
0.059 |
0.836127403 |
0.433385606 |
5 |
coverage |
5 |
0.025 |
0.025 |
0.497992614 |
0.607749424 |
6 |
query |
32 |
0.158 |
0.158 |
2.37465E-09 |
0.999999998 |
7 |
reliable |
4 |
0.020 |
0.020 |
0.221295993 |
0.801479413 |
8 |
discover |
3 |
0.015 |
0.015 |
0.080497264 |
0.922657428 |
9 |
recover |
2 |
0.010 |
0.010 |
0.021461497 |
0.978767162 |
10 |
irrelevance |
3 |
0.015 |
0.015 |
0.080497264 |
0.922657428 |
11 |
sense |
6 |
0.030 |
0.030 |
0.990193344 |
0.371504856 |
12 |
semantics |
8 |
0.040 |
0.040 |
3.796293118 |
0.022453852 |
13 |
engine |
8 |
0.040 |
0.040 |
3.796293118 |
0.022453852 |
14 |
quantify |
7 |
0.035 |
0.035 |
1.865576521 |
0.154806935 |
15 |
processing |
3 |
0.015 |
0.015 |
0.080497264 |
0.922657428 |
16 |
results |
7 |
0.035 |
0.035 |
1.865576521 |
0.154806935 |
17 |
probability |
4 |
0.020 |
0.020 |
0.221295993 |
0.801479413 |
18 |
requirements |
4 |
0.020 |
0.020 |
0.221295993 |
0.801479413 |
19 |
deviation |
2 |
0.010 |
0.010 |
0.021461497 |
0.978767162 |
20 |
implementation |
4 |
0.020 |
0.020 |
0.221295993 |
0.801479413 |
21 |
specification |
4 |
0.020 |
0.020 |
0.221295993 |
0.801479413 |
22 |
collection |
2 |
0.010 |
0.010 |
0.021461497 |
0.978767162 |
23 |
term |
1 |
0.005 |
0.005 |
0.003415546 |
0.99659028 |
Copyright й
2007 by Y. Arkhipkin