SOC News: Article

One Article Review

Accueil - L'article:

Source	ProofPoint
Identifiant	8430359
Date de publication	2023-12-28 14:18:07 (vue: 2023-12-28 17:08:38)
Titre	Concevoir un indice de texte mutable à l'échelle de la pétaoctet rentable Designing a Cost-Efficient, Petabyte-Scale Mutable Full Text Index
Texte	Engineering Insights is an ongoing blog series that gives a behind-the-scenes look into the technical challenges, lessons and advances that help our customers protect people and defend data every day. Each post is a firsthand account by one of our engineers about the process that led up to a Proofpoint innovation. At Proofpoint, running a cost-effective, full-text search engine for compliance use cases is an imperative. Proofpoint customers expect to be able to find documents in multi-petabyte archives for legal and compliance reasons. They also need to index and perform searches quickly to meet these use cases. However, creating full-text search indexes with Proofpoint Enterprise Archive can be costly. So we devote considerable effort toward keeping those costs down. In this blog post, we explore some of the ways we do that while still supporting our customers\' requirements. Separating mutable and immutable data One of the most important and easiest ways to reduce costs is to separate mutable and immutable data. This approach doesn\'t always fit every use case, but for the Proofpoint Enterprise Archive it fits well. For archiving use cases-and especially for SEC 17a-4 compliance-data that is indexed can\'t be modified. That includes data-like text in message bodies and attachments. The Proofpoint Enterprise Archive has features that require the storage and mutation of data alongside a message, in accordance with U.S. Securities and Exchange Commission (SEC) compliance. (For example, to which folders a message is a member, and to which legal matters a message pertains.) To summarize, we have: Large immutable indexes Small mutable indexes By separating data into mutable and immutable categories, we can index these datasets separately. And we can use different infrastructure and provisioning rules to manage that data. The use of different infrastructure allows us to optimize the cost independently. Comparing the relative sizes of mutable and immutable indexes. Immutable index capacity planning and cost Normally, full-text search indexes must be provisioned to handle the load of initial write operations, any subsequent update operations and read operations. By indexing immutable data separately, we no longer need to provision enough capacity to handle the subsequent update operations. This requires less IO operations overall. To reduce IO needs further, the initial index population is managed carefully with explicit IO reservation. Sometimes, this will mean adding more capacity (nodes/servers/VMs) so that the IO needs of existing infrastructure are not overloaded. When you mutate indexes, it is typically best practice to leave an abundance of disk space to support the index merge operations when updates occur. In some cases, this can be as much as 50% free disk space. But with immutable indexes, you don\'t need to have so much spare capacity-and that helps to reduce costs. In summary, the following designs can help keep costs down: Reduce IO needs because documents do not mutate Reduce disk space requirements because free space for mutation isn\'t needed Careful IO planning on initial population, which reduces IO requirements Mutable index capacity planning and cost Meanwhile, mutable indexes benefit from standard practices. They can\'t receive the same reduced capacity as immutable indexes. However, given that they\'re a fraction of the size, it\'s a good trade-off. Comparing the relative free disk space of mutable and Immutable indexes. Optimized join with custom partitioning and routing In a distributed database, join operations can be expensive. We often have 10s to 100s of billions of documents for the archiving use case. When both sides of the join operation have large cardinality, it\'s impractical to use a generalized approach to join the mutable and immutable data. To make this high-cardinality join practical, we partition the data in the same way for both the mutable and immutable data. As a result, we end up with a one-t
Notes	★★★
Envoyé	Oui
Condensat	100s 10s 17a able about abundance accordance account achieved adding advances allows alongside also always amount analyze any approach architect architecture  archive archives archiving are assign attachments author backgrounds because behind benefit best between billions bits bitsets blend blog bodies both brands brings build business but can capacity cardinality career careers careful carefully case cases categories challenges changes cloud collaborate commission comparing complete compliance compress compressed compression considerable constantly cost costly costs cost creating custom customer customers data database datasets data day decreases deep defend designing designs devote different director disk distributed diversity documents doesn don down down: driven driving due during each easiest effective efficient effort end engine engineering engineers enhance enough enterprise especially every evolving example exchange execute existing expect expensive experience experiences explicit explore features final find firsthand fit fits folders following force fraction free from full further generalized given gives good handle has have have: help helps high hire however immutable imperative important impractical includes increasing independently index indexed indexes indexes indexing industry infrastructure initial innovation insight insights instances integer integers intelligence  interested intermediate inventiveness isn items jeremiah join keep keeping key large learning leave led legal less lessons like lived load longer look looks low made make manage managed matching matters mean meanwhile meet member merge message might modified monotonically more most much multi multiplicative must mutable mutate mutate mutation need needed needs ness new nodes/servers/vms normally not number occur off offer often one ongoing operation operations opportunities optimize optimized overall overloaded page parallel partition partitioning partitions passion passionate pattern people per perform performance performed pertains petabyte planning platform  population post practical practice practices pragmatic problems process proofpoint protect protecting proven provision provisioned provisioning quickly read reasons receive reduce reduced reduces reduction reductions refreshing relationship relative represent representation require required requirements requirements requires reservation result results results risks routing routing rules running sacrifice same scale scenes search searches searching sec securities security separate separately separating series set several sides significant size sizes small software solve solving some something sometimes space spare speed standard storage subsequent success such summarize summary support supporting system s advanced team  technical technique techniques text these they this: those thoughts threats threats and compliance through time to:  today total toughest cybersecurity challenges  toward trade transfer transferred transitively typically unoptimized update updates use used using value values way ways well what when which will without write years you “roaring  visit
Tags	Cloud Technical
Stories
Move

L'article ne semble pas avoir été repris aprés sa publication.

L'article ne semble pas avoir été repris sur un précédent.