Software Reliability and Content Relevance as SystemsТ Potential Reliability

 

аааааааааааааааааааааааааааааааааааааааааааааа Yuri Arkhipkin

аааааааааааааааааааааааааааааааааааааааааааааа aryur@yandex.ru

 

Abstract. This paper argues for the unified quantitative approach to software reliability and content relevance validated by the systemsТ potential reliability law.

Presented are the estimations of minimum needed software test coverage to assure the achievement of Enough and Six Sigma software reliability at minimum and maximum fault flow throughout the test process. These estimations are based on the evaluation results yielded by the Agile Software Reliability Monitoring Model (ASRMM), approaching reliability problem in software engineering. This model integrates qualitative subject matter data and quantitative software reliability metrics. The ASRMM is viewed as a foundation of the Testerbot software reliability engineering that provides continuous reliability evaluations throughout software development process, thus accounting and tracing quantitative software reliability requirements from customer to product.

Also presented are the estimations of minimum needed content semantic coverage to assure the achievement of Any Sigma content relevance for a given example content. These estimations are based on the evaluation results yielded by the Content Relevance Quantification Model (CRQM), approaching the relevance problem in content engineering. This model integrates qualitative subject matter data and quantitative content relevance metrics, providing continuous relevance evaluations throughout content engineering process, thus, making possible to trace content relevance requirements from customer to product. The CRQM is viewed as a foundation of the Sequantic content relevance engineering.

 

 

 

1. Introduction

 

ааааааааааааааааааааааа Much research was done on models approaching software reliability quantification. The results seem to be of poor satisfaction despite of the increasing number of these models. The lack of explicit evaluations of software elementsТ failure probability may be considered as one of the main problems in software reliability quantification. Software elementsТ failure data is yielded while testing (executing) software for a vast field of subject applications.а This data may be considered as an ad hoc data much depending on developerТs skill and software testing skill in particular. Software testing in general may be considered as a trial failure process of sensitizing software elements (sites) to define whether yielded results are true or faulty.

Much research was also done on models approaching content relevance quantification. The results seem to be of not enough satisfaction despite of the increasing number of these models. ContentТs grammar variety may be considered as one of the main problems in content relevance quantification. Content irrelevance (failure) data is yielded due to query occurrence (generation) through content searching, thus providing data for queryТs terms refinements and (or) search engineТs improvements. So this data generally may be considered as an ad hoc data much depending on search engineТs developer skill and query generation (testing) skill in particular. аContent searching in general may be considered as a trial failure process of sensitizing content terms to define whether this content is relevant to the terms of the query or not.

This paper offers to break through the quantification problems of software reliability and content relevance engineering by approaching any digital content as a trial failure system regardless of its grammar.

Chapter 2 introduces briefly some mathematics of the systemsТ potential reliability law proved by B. S. Fleishman [1]. This law validates the systemsТ failure intensity quantitative ranges depending on known potential operating elementsТ number as a part of total system elementsТ number and their mean operating probability.

Chapter 3 presents the quantitative approach to the software reliability engineering validated by systemsТ potential reliability law. Software siteТs operating probability is considered to be equal to the siteТs probability occurrence or potential occurrence. This probability may be evaluated at any cycle of the software development. Developer in general needs no external statistic data to monitor the achieved quantitative reliability level of the software project.

Chapter 4 presents the quantitative approach to the content relevance engineering validated by systemsТ potential reliability law. Content elementТs operating (sensitizing sense) probability is considered to be equal to the termТs content frequency. This frequency may be evaluated at any cycle of the content development. Developer in general needs no external statistic data to index the quantitative relevance level of the content project.

The pragmatics of the presented approach may be defined by its validity and verifiability that may be explicitly quantified for the vast field of subject applications in the system engineering.

 

 

 

2. Potential Reliability of a Trial Failure System

 

ааааааааааааааааааааааа At every time moment the systemТs elements belong to either operating or failure state. Moreover the conversion occurs instantly from operating to failure state, while the reverse conversions are impossible.

ааааааааааааааааааааааа It is natural in general to consider a system as an operating one at the given moment, if there exist at least some operating elements comprising a before stated minimal part of total system elementsТ number. Many uncontrollable causes, influencing elementsТ failures, make it possible to consider the failuresТ occurrence as random events.

ааааааааааааааааааааааа Let the system AR at given moment t consists of n elements {a1,Е, av,Е, an} with arbitrary interactions. Any element is associated with two mutually exclusive events A1v and A0v. The event A1v is associated with operating element av and the event A0v is associated with its failure. Let the probabilities of the events A1v and A0v are equal to pv and 1- pv correspondingly.

ааааааааааааааааааааааа Consider the set Rn of all possible 2n states r = (i1,iv,Е,in) of the system AR. This set depicts the operating and failure states of the system AR (iv =1 if av is in the state A1v and iv =0 if av is in the state A0v).

ааааааааааааааааааааааа Let us divide the set Rn into two parts E1 and E0= Rn\E1. The set E1 is an operating set of the system AR and E0 is a failure set of the system AR. Consider by the definition that the system AR operates at the given moment only if r \inаE1.

ааааааааааааааааааааааа It is considered that the systemТs state r isа a sequence of independent trials with outcomes probabilities pv=P(iv=1), 1-pv=P(iv=0) (v=1, 2, Е, n) ofа every v-th trial. Then the probability Pv of the system AR to operate at the given moment may be defined [1] as:

аааааааааааааааааааааааааааааааааааааааааааааа аааааа аааn

ааааааааааааааааааааааа Pv=P(r \inаE1)=а S ааP pviv ∙ (1- pv )1- iv .ааааааааааааааааааааааааааааааааааааааааааааааааааааа (2.1)

ааааааааааааааааааааааааааааааr \inаE1 аv=1

ааааааааааааааааааааааа Consider the systems comprised of n elements, operating set E1 of which consists of states, each including more than s operating elements. So the set E1 includes all systemТs states r =(i1,Е,iv,Е,in) for which аsum(iv)>s (v=1, 2,Е, n). Such systems are named as symmetric of s-th degree systems and formula (2.1) appears as:

ааааааааааааааааааааааа аааааа ааn

ааааааааааа Pv=аа Sаааа аP pviv ∙ (1- pv )1- iv .аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (2.2)

ааааа sum(iv)>sа v=1

To define the possibility to operate for the symmetric system of s-th degree, it is necessary to study the asymptotic behavior of (2.2) at n g УinfinityФ.

Restricting the study by operating systems with large but constant elementsТ number n, it is said that these systems have instant operating probability Pv(t) at the given moment t. The probability R(t) that the system AR аwill operate until some moment t (inclusive) depends on whether the system will operate at all moments t (t=1,Е,t). The sequence of independent trials schema with operating probability Pv(t) provides the reliability R(t) defined as [1]:

а t

аR(t) = аP Pv(t)аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (2.3)

t=1

and further

ааааааааааааааааааааа ааааааааааааtааааааааааааааааааа ааааааааааааааааа аааааааааааа ааааt

1 - S [1- Pv(t)] £ R(t) £ exp(- S [1- Pv(t)] ).ааааааааааааааааааааааааааааааааааааааааааааа (2.4)

ааааааааааа аааааааааааааааааааааааааааt=1аааааааааааааааааааааааааа аааааааааааааааааааааааааt=1

Taking into consideration that Pv(t) g 1 with n(t) increase, there are possible different extreme values ofа R(t) with t increase. To refine this point, consider the ideal system AR with postulated features as follows [1]:

1.а Operating capability. The system is capable to operate at any t time moment.ааа However if it fails at given time moment, then nothing can bring it into the operating state.

2.а Unlimited extension. If the system is operating at given t time moment, then at the next moment t+1 it may be enhanced by any number of elements. One time unit is a conditional one for a given system.

3.      Physical restriction of reaction time. The system is got aware about its state only at the next t+1 time moment.

4.      Math restrictions. The symmetric system with independent success and failure trial results at every given t time moment is under consideration.

а

ааааааааааа The reliability R(t) limit of the symmetric system of s-th degree with n=const elements that are pair wise independent and uniformly distributed is defined by equation

R(t) = exp(-l∙t),аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (2.5)

where l is a system failure intensity measured in faults per element and is evaluated according to the systemsТ potential reliability law (see math proof in [1]) as follows:

 

-ln[1-exp(-kL∙n) +O(ln n)] £ l £ -ln[1-exp(-kU∙n)],а ааааааааааааааааааааааа ааааааааааа (2.6)

 

where

 

kL=c∙ln(c/pM) + (1-c)∙ln ((1-c)/(1-pM));ааааааааа аааааааааааааааааааааааааааааааааа ааааааааааа (2.7)

 

kU=c∙ln(c/pS) + (1-c)∙ln ((1-c)/(1-pS));аааааааааа аааааааааааааааааааааааааааааааааа ааааааааааа (2.8)

 

c = s/n;

 

pM=pL/(1+pL-pU);

 

pL £ pv £ pU < 0.5;

 

pL=min(pv)а (v=1,2,Е,n);

 

pU=max(pv)а (v=1,2,Е,n);

 

а аааааааааааааааааааааааааааа ааааn

ааа ааааааааааааааааааа аpS=1/n∙ S pv .ааааааааааа ааааааааааааааааааааааа аааааааааааааааааааааааааааааааааааааааааааааааааааааааааа ааааааааааа (2.9)

ааааааааааааааааааааааааааааааа ааv=1

The postulated features above fit a system approach to software development process including testing and debugging in particular. The systemsТ potential reliability law application to quantify software reliability seems to be a fruitful enough approach.

 

 

 

3. Enough Sigma Software Test Coverage Approach

 

Software reliability is understood as a probability of failure free software operation in defined environment for a specified period of time. In general a failure is a deviation of operation results from customer requirements. The deviation is defined by correspondence between algorithmТs specification and its software implementation. Quality of algorithmТs subject matter specification influences software reliability throughout the development process and lifecycle of software product. To achieve continuous improvement of software engineering process, reliability requirements must be defined in an integrated manner for prediction, evaluation, validation, verification, and certification at specification, coding, testing, maintenance, and correction cycles. Thus reliability monitoring needs to be implemented as an online automated process throughout the software life cycle.

At the beginning we know nothing in general about algorithm to be implemented, but some ideas concerning input data and results. Consider algorithmТs specification based on input data formal definition. Usefulness of a software reliability model depends on the definition method of input data to be tested for exhaustive fault detection. Input data may be considered as a set of requests (sites), so their total number and variety are sufficient for reliability evaluation. Sites are viewed as structural and(or) functional software elements including subject matter data, input variables, decisions, restrictions, memory structures and the like. All sites are potentially fault inherent and may cause a failure. Software input data set may be viewed as a site set. Software site is somewhat like a software path.

аLet the site set structure defines total sitesТ number n(t) at different t time of lifecycle and occurrence probability pv(t) (v=1,2,Е,n(t)) of v-th site to be processed, that defines failure probability 1-pv(t) while processing this site. Failure probability is greater for the sites that may occur more rarely, because it is more difficult in general to sensitize faults by such exotic sites while testing.

It is mostly improbable that all n(t) sites will yield specifiedа results because it is impossible to implement software without faults. But even if all n(t) sites will yield specified results, this fact is undetectable because in this case we need all n(t) sites to be tested. Practically it is impossible because of great values of number n(t) for almost any software product. Not all sites are to be processed even throughout software lifecycle. To yield specified results at the required reliability level, in practice, it is enough for software product to have only s(t) number of assured fault free sites. The number s(t)а of potentially faulty (sensitive) sites, sensitized throughout testing by time t, defines a software test coverage as c(t) = s(t)/n(t) (0<c(t)<1) and is a known parameter of software reliability models.

It is natural in general to consider a software to be operating at a given t time moment, if there exists at least some fixed, before stated, minimal part c(t) of operating (fault free) sites ofа total sitesТ number n(t).

The features (see Chapter 2) provide insight into the test process in general, including fault correction and regressive testing procedures, thus refining software lifecycle process according to the postulated features above.

Any software site may sensitize fault(s) and is a potentially fault inherent one.

Consider a software as a system comprised of n(t) sites (elements), operating set of this system consists of states, each including more than s(t) operatingа sites. Consider the sites are pair wise independent and uniformly distributed over all possible sitesТ number n(t)=const. Software reliability R(t) is evaluated by known equation (2.5) as

 

R(t) = exp(-l(t)∙t),ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (3.1)

 

where l(t) is a software failure intensity measured in faults per site and is evaluated (see formulas (2.6),Е, (2.9) and math proof in [1]) as follows:

 

-ln[1-exp(-kL(t)∙n(t)) +O(ln n(t))] £ l(t) £ -ln[1-exp(-kU(t)∙n(t))],ааааааааааа (3.2)

 

аwhere

 

kL(t)=c(t)∙ln(c(t)/pM(t)) + (1-c(t))∙ln ((1-c(t))/(1-pM(t)));аааааааааааааааааааааааааа (3.3)

 

kU(t)=c(t)∙ln(c(t)/pS(t)) + (1-c(t))∙ln ((1-c(t))/(1-pS(t)));аааааааааааааааааааааааааааа (3.4)

 

c(t) = s(t)/n(t), а(0£c(t)£1);

 

pM(t)=pL(t)/(1+pL(t)-pU(t));

 

pL(t) £ pv(t) £ pU(t) < 0.5;

 

pL(t)=min(pv(t))а (v=1,2,Е,n(t));

 

pU(t)=max(pv(t))а (v=1,2,Е,n(t));

 

аааааааааааааааааа ааааааааааа ааааааааааааааааn(t)

ааа ааааааааааааааааааа аpS(t)=1/n(t)∙ S pv(t) .ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (3.5)

ааааааааааааа аааааааааааааааа ааааааааааааааааv=1

We consider O(ln n(t)) to have some constant value that must be defined at large values of n(t), thus minimum and maximum failure intensity according (3.2), (3.3) and (3.4) correspondingly are as follows:

 

lmin(t) = - ln[1-exp(-kL(t)∙n(t))+O(ln n(t))]аааааааааааааааааааааааааааааааааааааааааааааааа (3.6)

 

lmax(t) = -ln[1-exp(-kU(t)∙n(t))].ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (3.7)

 

The Agile Software Reliability Monitoring Model (ASRMM) considers a software as a system, comprised of n(t) sites. Operating set of this system consists of states, each including more than s(t) operatingа sites. Software test process is defined in general according to the postulated features (Chapter 2), including fault correction and regressive testing, thus refining software lifecycle process as for achieving and supporting reliability requirements according to the math interrelations (3.1), (3.4), (3.7) of аtotal sitesТ number n(t), test coverage c(t), mean siteТs occurrence probability ps(t), and failure intensity l(t). Figure 1 displays these interrelations. Time 0£t£1 flow in the ASRMM is a math one and the t-unit is based on every site rv(v=1,2,Е,n(t)) sensitized while testing.

To define initial mean occurrence probability pS(t), imagine a software site rv(v=1,Е,n(t)) as a set of W(t) pair wise independent parameters xi(i=1,Е,W(t)). Each parameter may accept value(s) xij(j=1,Е,Ji(t)) with sij(t) valuesТ number of each. These parameters define semantically sufficient parameterТs values that are sensitive as for yielding specified results, or, otherwise, values that may sensitize faults in software under test. These values are supposed to yield semantically typical results. Software sitesТ set structure may be viewed as a software sensitive sitesТ semantics matrix (SSSSM).

Semantics of the parameter values is defined by subject matter and the particularities of the algorithmТs implementation. There may be defined a number Ji(t) of semantic types having sij(t) (si(t)³sij(t)³1) valuesТ number of each type. So each

ааааааа аааааааааааааааааааааааааааJi(t )

parameter xi has a number si(t)=Ssij(t)а ofа values, thus total

аааааааа ааааааааааааааааааааааааааj=1

аааааа ааааW(t )аааа ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа ааааааааааааааа аW(t)

sitesТ number is n(t)=P si(t) and the number of sensitive sites is s0(t) =P Ji(t) for a given

аааааааааааааааааааааааааааааааааааааааааа i=1аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа аi=1

аproject. Then at the specification cycle t=0, according (3.5), we have pS(0)=1/s0(0). LetТs name pS(0) as an initial semantic mean of аa software project under development. Values, denoted by t (0£t£1) time (one unit of t time is ts =1/ n(t)), vary throughout the development process due to refinements brought by customers, programmers, testers, developers, users, and the like. These refinements lead to the changes of total sitesТ number n(t) and sensitive sitesТ number s0(t) because ofа detected faults, thus changing semantic mean pS(t) value.

While testing, the test coverage value c(t) increases so the difference

c(t) - pS(0)< 0 о 0 andааааааа

аааааааааа when (see Figure 1)

аааааааааааааааааааааа c(t) - pS(0) = 0,аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа (3.8)

according (3.4), we haveа kU(t) = 0а that is a starting point of failure intensity (3.7) value decreasing and software reliability (3.1) growth (see Figure 1). When the sensitive sitesТ number s0(0) is not defined, then according (3.5) and (3.8) we have s0(0)/n(0)=1/ s0(0) and thus s0(0)= sqrt(n(0)).

ааааааааааа Here is some general description of the software reliability monitoring algorithm based on the ASRMM.

аSoftware site set structure is defined during specification as a SSSSM. Semantics of the siteТs parameter values is defined by software subject matter and algorithmТs implementation particularities. The SSSSM is viewed as a data base of the ASRMM engine (TESTERBOT) for generating tests, refining semantic mean and semantic shift of input data flow, thus providing monitoring of the software engineering process.

During software concept definition and input data specification at t=0, before testing, we refine s0(0) sensitive sites selected from n(0) total sitesТ number. Sites are ranged according to their occurrence probabilities pv (0) (v=1, 2,Е,n(0)) thus defining initial semantic mean ps(0) of input data flow.

Potential reliability metrics, evaluated according (3.7), if compared with required ones lrq(1), define the appropriateа test coverage value to be achieved to meet the requirements. The difference lrq(1)> l(0) means that it is necessary to define the additional number of sensitive sites for testing to assure the achievement of required reliability. The additional sensitive sites are defined and may be selected by the TESTERBOT either in exhaustive or extreme (random or semantic extrapolation) mode of the test process.

Test process must assure the achievement of required test coverage value. While testing, fault(s) is (are) detected and corrected thus changing sensitive sitesТ number

s0(t+ts):=s0(t)+1 for each fault(s) and in general changing the number of values si(t)³sij(t)³1 of the sitesТ parameters. The sensitive sites parametersТ values are either predefined or randomly selected according semantic ranges. If total sitesТ number n(t) changes then t time unit ts is recalculated. If fault(s) is (are) not detected, the number ofа tested sites must be changed s(t+ts):=s(t)+1 for each tested site thus increasing test coverage c(t).

Value of t time is calculated as t:=t+ts after each tested fault free site.

Changes are calculated for n(t), s(t), s0(t), si(t), Ji(t), W(t) thus refining semantic shift ps(0) - ps(t). If total sitesТ number n(t) is changed, then the time is to be refined t:= t+ts = t+1/n(t) only after each tested faulty site being corrected and retested.

аReliability metric l(t<1) being continuously evaluated as (3.7) throughout testing process and compared with required one, provide a possibility of making decisions on testing process. The achieved reliability level, being continuously monitored, depicts changes either in subject matter requirements or software implementation particularities, thus providing possibilities for optimization throughout engineering process of software product development.

The ASRMM (see also [2]) as a foundation of the Testerbot engineering yields validated and verifiable reliability evaluations, integrating qualitative subject matter data and quantitative software reliability metrics. Testerbot engineering provides continuous reliability evaluations throughout software engineering process, thus accounting and tracing quantitative software reliability requirements from customer to product.

Consider the customerТs four sigma reliability requirements for the software project of the total number as a trillion (1012) software sites of input data setТs elements. At the beginning of the concept definition, the TESTERBOT preliminary evaluations yield the minimum needed software test coverage as 0.0000010032 or 1,003,200 tests to assure the achievement of required four sigma software reliability (0.00621 faults per site). To improve the reliability up to six sigma (2.0∙10-9 faults per site), the extra needed schedule-budget spending on testing and, if needed, on reengineering, must be increased 1.003 times. Minimum failure intensity for the given n(t) is l min (t)=1/n(t), thus defining Enough sigma test coverage estimations to achieve maximum possible reliability for the given software project. To achieve maximum (Enough sigma) reliability, the engineering efforts must be 1.0042 fold. It seems that Six Sigma reliability may be not enough for large scaled, semantically sophisticated, and critical software projects.

During further software input data set structure refinement and specification at t=0, before testing, the TESTERBOT refines initial number s0(0) of sensitive sites selected from the total n(0) sitesТ number. Sites are ranged by the TESTERBOT according to their occurrence probabilities, thus refining initial semantic mean ps(0) of input data flow. So the TESTERBOT evaluations yield the refined minimum needed software test coverage as 0.0000010016 or 1,001,600 tests to assure the achievement of required four sigma software reliability. To improve the reliability up to six sigma, the extra needed schedule-budget spending on testing and, if needed, on reengineering, must be increased 1.00156 times. To achieve Enough sigma reliability, the engineering efforts must be 1.00211 fold.

The TESTERBOT application provides 0.37 % schedule-budget savings or 3720 testsТ number decrease to achieve Enough sigma reliability requirements for the given project.

аThe ASRMM results featuring the Testerbot optimization capabilities (see Figure 2), give an idea of schedule-budget-reliability estimations of the needed test coverage changes due to refinements of total software sitesТ number. These refinements are contributed to the project by customer, developers, testers, and the like throughout the software engineering process.

Testerbot engineering assures the achievement of required software test coverage value thus defining the mostly accurate semantic mean compliance with semantic target. Any detected and corrected fault decreases semantic shift. While testing, fault(s) is (are) detected and corrected thus changing initial sensitive sitesТ number s0(0) and in general changing total sitesТ number n(0), thus refining needed test coverage c(t) to achieve required failure intensity l(t) for any sigma software reliability. Fault flow may vary from minimum, when no faults are detected throughout the test process, to maximum, when every tested site sensitizes fault(s). The ASRMM estimations of minimum needed test coverage at minimum and maximum fault flow are shown in the Figure 2. These online reliability evaluations provide a flexible decision making Testerbot tool for software product lifecycle management.

The Testerbot software reliability engineering:

depicts real software test process either in distributed or concurrent environment and may be viewed as an online TESTERBOT software development tool (PLM solution) to be applied for any test phase and strategy including extreme (agile) and exhaustive ones;

needs no empiric fault data and may be viewed as the TESTERBOT tool institutionalizing all CMMI- SW/SE Maturity levels;

provides protective features, giving impact in safety programming and data security approach;

may be viewed as a test bed or (and) a cradle for many existing or on coming software reliability models.

Each product, released under the TESTERBOT tool, may be equipped with reliability e-certificate needed for acquisition, remedial and continual improvement processes, thus suggesting the quantifiable improvement approach to system development standards.

 

 

 

4. Content Relevance Quantification Approachа

 

Any content sensitizes sense or results while processing by brain or computer. Content search aims to provide access to the results of content processing. These results are to be of enough quality to rely on, so the content search must be reliable enough to provide relevant content for processing.

Reliability of content search may be viewed as a probability of relevant content detection or coverage by applying a query. Irrelevance may be considered as a deviation of content coverage results from customer requirements. The deviation may be defined as a compliance of query and content specifications with search engine implementation. Quality of subject matter query specification influences the content search relevance.

In general we know nothing about content to be searched, but some ideas concerning queries and content searching results.

The Content Relevance Quantification Model (CRQM) considers any content as a set of n queries with queriesТ variety number or sensitive queriesТ number s £ n. Any content query qv (v=1,2,Е,s) may sensitize sense and is a potentially sense inherent one. Input query, applied for sense detection (sensitization), may discover or recover content with some relevance to the search results.

Queries are viewed as structural and (or) functional content (document, collection, corpus) elements, including subject matter data. Let the query set structure defines queriesТ variety number s and an occurrence probability pv (v=1,2,Е,s) of v-th query, that defines a probability of sensitizing sense in response to this query.

It is mostly improbable, that an input query includes all s queriesТ variety number specifying content under search, because of the great values of number s for almost any content. To yield specified search results at the required relevance level, in practice, it is enough for a content to have only sv sensitive queries. The number sv of potentially sensitive queries, sensitized throughout content search, defines contentТs semantic coverage as cv = sv/n and is a known parameter of search models.

It is natural to consider any content under search to be relevant enough, if there exist at least some fixed, before stated number sv of sensitive queries as a minimal part cv of the total content queriesТ number n.

Consider any content as a system comprised of n queries, sensitive set of this system consists of states, each including more than sv sensitive queries. Consider queries are pair wise independent and uniformly distributed over all possible queriesТ number n=const. Content relevance R or probability of sensitizing sense is quantified according to (2.5 ) by the equations R = exp(-l), or else according to (2.6),Е, (2.9) as

аааааааааааааааааааааааааааааааааааааааааааааа

1-exp(-kL∙n) +O(ln n) ³ R ³ 1-exp(-kU∙n),ааа аааааааааааааааааааааааааааааааааа (4.1)

 

where l is a contentТs irrelevance intensity, defined by (2.6 ), Е, (2.9 ) and measured as irrelevance (probability of not sensitizing sense) per content query.

Content query set structure is defined while indexing as a content query semantics matrix (CQSM). Semantics of the queryТs parameter values is defined by content subject matter and search engineТs algorithm implementation particularities. The CQSM may be considered as a semantic quantifierТs (SEQUANTIC tool) data base for queriesТ indexing (quantification), refining contentТs semantic mean ps, semantic coverage cv, and semantic shift cv-ps (Figure 1) thus providing content relevance quantification. Content search assures the achievement of required semantic coverage value.

Consider a content structured as total queriesТ number n with semantic mean ps=1/n. The mostly relevant query for any content is the input query yielding Rv =1, lv=0. The mostly irrelevant input query Rvg0 in general is defined at cv=ps=pv and lv gФinfinityФ, so that semantic coverage is equal to semantic mean and occurrence probability of the query qv. The query discovers a content with higher relevance, the greater is the inequality cv<ps. The query recovers a content with higher relevance, the greater is the inequality cv>ps (Figure 1).

Content relevance R, being continuously evaluated throughout search process and compared with required one, provides a possibility for making decisions on searching. The achieved relevance level, depicts changes either in subject matter requirements (input queries) or search engine implementation particularities, thus providing possibilities for optimization throughout content search engineering process.

The SEQUANTIC tool as a content product lifecycle management solution may be viewed as a test bed or (and) a cradle for many existing or on coming content relevance models and search engine implementations.

Each content product, released under the SEQUANTIC tool, may be equipped with relevance e-certificate needed for acquisition, remedial and continual improvement processes, thus suggesting the quantifiable improvement approach to the content development standards.

The CRQM based Sequantic engineering yields validated and verifiable relevance estimations, integrating qualitative subject matter queries and quantitative relevance metrics of content. Sequantic engineering provides continuous relevance evaluations throughout content engineering process, thus accounting and tracing quantitative content relevance requirements from customer to the content product.

The CRQM improves latent semantic indexing, especially for unknown and (or) heterogenous collections, by increasing relevance, precision, and recall of content search, including the full text search. The CRQM may be used for data exploration and data integration tasks (due to its potential accuracy to quantify the contentТs semantics), to solve heterogeneity problems, and to provide varied levels of Querying services, that facilitates knowledge discovery at different levels of granularity. аааааааааааааааааааааааааааааааааа

аThe content above, as an example, may be structured as one-term total queriesТ number n=203 and the variety queriesТ number s=23 with semantic mean ps=1/s =0.043 (see Figure 3). According (4.1), (2.7),Е, (2.9) quantification, the mostly relevant recovery l1=1.66533E-15 may be performed by Уcontent У one-term query, and the mostly irrelevant recovery l4=0.836127403 may be performed by УsensitizeФ query at the 0.433385606 relevance level. The query УtermФ may be considered as four sigma (l£0.00621) relevant discovery query for the content above. Query УtermФ is 0.99659028 relevance discovery query for the content example.

аConsider a query Уrelevant deviation resultsФ as an example. Its semantic coverage (see Figure 3) is evaluated as c= c3 + c19 + c16 = 0.064 + 0.010 + 0.035 = 0.109 and this queryТs recovery relevance according (4.1), (2.7),Е, (2.9) is not less than 0.999526985. Another example of the content quantification is shown in [3].

The presented content relevance quantification approach may be applied for the human grammar content after solving the known language ambiguities.

 

 

 

5. Conclusion

 

Many research and development points may be initiated by the presented quantitative system engineering approach. This approach implies any affordable quantitative accuracy in system elementsТ refinement suitable for the customer and the developer.

 

 

 

аааааааааааааааааааааааааааааааааааааааааааааа References

 

1. ╘ыхщ°ьрэ, ┴.╤. ▌ыхьхэЄ√ ЄхюЁшш яюЄхэЎшры№эющ ¤ЇЇхъЄштэюёЄш ёыюцэ√ї ёшёЄхь. У ╤ютхЄёъюх ЁрфшюФ, ╠юёътр, 1971, 224 ё.

ааа ( Fleishman, B.S. Theory elements of complex systemsТ potential effectiveness. УSoviet RadioФ, Moscow, 1971, 224 pp.).

 

2. Arkhipkin, Y. Enough Sigma Software Test Coverage Approach. http://aryur.narod.ru/

 

3. Arkhipkin, Y. Content Relevance Quantification Model. http://sequantic.narod.ru/ а

 

 

 

 

Figure 1. General interrelation of the test/semantic coverage c(t)- (c ), software/content semantics mean ps(t)- (ps), and reliability/relevance metrics R(t)- ( R ), l(t)- (lamb).

 

 

 

 

 

Figure 2. Testerbot reliability estimations yielded by the ASRMM for exhaustive testing

Total software sitesТ variety number n(0)

Sigma software

reliability requirements

Minimal test coverage at min

fault flow(testsТ #)

Minimal test coverage at max fault flow (testsТ #)

Schedule-budget savings %

(testsТ # decrease)

1,024

4s

0.049ааааааааааааааа а(51)

0.04101562ааааааа (42)

17.65аааааааааааааа (9)

6s

0.071аааааааааааааааа (73)

0.051757813ааааа (54)

35.65аааааааааааа (19)

Enough sigma

0.0537аааааааааааааа (55)

0.04296875ааааааа (44)

20.00аааааааааааа (11)

89,362

4s

0.00397аааааааааа (355)

0.00367046ааааа (328)

а 7.60аааааааааааа (27)

6s

0.004644032аа (415)

0.003994987ааа (358)

13.73аааааааааааа (57)

Enough sigma

0.00430831аааа (385)

0.00383832ааааа (344)

10.65аааааааааааа (41)

17,343,286

4s

0.000252028 (4371)

0.000246147а (4269)

а 2.33ааа ааааааа(102)

6s

0.000264079 (4580)

0.000252144а (4375)

а 4.47аааааааааа (205)

Enough sigma

0.00026206аа (4545)

0.00025104ааа (4354)

а 4.20аааааааааа (191)

1,000,000,000

4s

0.000032182а

ааааааааааааааааа ( 32,182)ааааааааааааа

0.0000319

аааааааааааа ааааааа(31,900)

а 0.87

ааааааааааааааааааа (282)

6s

0.000032745

аааааааааааааааааа (32,745)

0.000032189

ааааааааааааааааааа (32,189)

а 1.70

ааааааааааааааааааа (556)

Enough sigma

0.00003277

аааааааааааааааааа (32,770)

0.00003219

ааааааааааааааааааа (32,190)

а 1.77

ааааааааааааааааааа (580)

102,500,700,000

4s

0.0000031411

аааааааааааааааа (321,965)

0.0000031322

ааааааааааааааааа (321,054)

а 0.28

ааааааааааааааааааа (911)

6s

0.0000031584

аааааааааааааааа (323,739)

0.000003140

ааааааааааааааааа (321,856)

а 0.59

ааааааааааааааааа (1883)

Enough sigma

0.00000316286

аааааааааааааааа (324,196)

0.00000314316

ааааааааааааааааа (322,176)

а 0.62

ааааааааааааааааа (2020)

1,000,000,000,000

4s

0.0000010032

ааааааааааааа (1,003,200)

0.0000010016

аааааааааааааа (1,001,600)

а 0.16

ааааааааааааааааа (1600)

6s

0.0000010063

ааааааааааааа (1,006,300)

0.00000100317

аааааааааааааа (1,003,170)

а 0.31

ааааааааааааааааа (3130)

Enough sigma

0.00000100744

ааааааааааааа (1,007,440)

0.00000100372

аааааааааааааа (1,003,720)

а 0.37

ааааааааааааааааа (3720)

 

 

 

Figure 3. The CRQM relevance quantification example for the Chapter 4 content а

v

Query set structure qv

Occur # nv

Occurrence probability pv= nv /n

Semantic coverage cv

Irrelevance intensity lv

Relevance Rv

1

content

41

0.202

0.202

1.66533E-15

~1.0

2

search

22

0.108

0.108

0.000579228

0.99942094

3

relevant

13

0.064

0.064

0.488453851

0.613574339

4

sensitize

12

0.059

0.059

0.836127403

0.433385606

5

coverage

5

0.025

0.025

0.497992614

0.607749424

6

query

32

0.158

0.158

2.37465E-09

0.999999998

7

reliable

4

0.020

0.020

0.221295993

0.801479413

8

discover

3

0.015

0.015

0.080497264

0.922657428

9

recover

2

0.010

0.010

0.021461497

0.978767162

10

irrelevance

3

0.015

0.015

0.080497264

0.922657428

11

sense

6

0.030

0.030

0.990193344

0.371504856

12

semantics

8

0.040

0.040

3.796293118

0.022453852

13

engine

8

0.040

0.040

3.796293118

0.022453852

14

quantify

7

0.035

0.035

1.865576521

0.154806935

15

processing

3

0.015

0.015

0.080497264

0.922657428

16

results

7

0.035

0.035

1.865576521

0.154806935

17

probability

4

0.020

0.020

0.221295993

0.801479413

18

requirements

4

0.020

0.020

0.221295993

0.801479413

19

deviation

2

0.010

0.010

0.021461497

0.978767162

20

implementation

4

0.020

0.020

0.221295993

0.801479413

21

specification

4

0.020

0.020

0.221295993

0.801479413

22

collection

2

0.010

0.010

0.021461497

0.978767162

23

term

1

0.005

0.005

0.003415546

0.99659028

 

Copyright й 2007 by Y. Arkhipkin



Сайт создан в системе uCoz