Building your own Semantic Search Implementation

Featured

Posted on March 9, 2024 by Andrew Markham

Incorporating semantic search on your website can significantly enhance the discoverability of content. Semantic search is a feature available in Optimizely’s ‘Content Graph’, as well as dedicated search providers such as (Algolia, Coveo, and Hawksearch.

I wanted to investigate what other options are available, primarily to gain knowledge but also to see if it was possible to develop a solution and have control over the ML models being used.

So far, I have barely scratched the surface, but I have learnt a lot, and I wanted to share this knowledge via this blog post. I have also created a demo application, which is available on GitHub (https://github.com/andrewmarkham/Machine-Learning).

Technology Used

Large Language Models (LLMs) and Vector Databases are the foundational elements for building Semantic / Neural search solutions.

Python is the dominant language for developing AI / ML applications, but Java and Javascript also have good support. I used Javascript for the demo application.

Large Language Models

Large Language Models are used to process natural language and are trained to perform different tasks, such as Sentence Similarity. The model transforms the input text into a vector embedding, which is stored in a vector database for querying.

A vector embedding is a representation of source data (text, image, etc.) as a multidimensional numerical array. The number of dimensions and representation of the data are specific to the model, so you cannot mix and match models when indexing and querying.

I am using a model from Huggingface; this website is a fantastic resource for accessing different models and datasets.

Vector Database

A vector database stores the data (text is transformed into embeddings and persisted). To perform a query (semantic search), you transform the search query into an embedding and then perform a query with the embedding against the database.

There are many options when sourcing a vector database. Mongo DB and ElasticSearch now include this functionality, or if you are looking at dedicated services, Pinecone and Milvus are two options. In my demo application, I use Postgres with the ‘pgvector’ extension enabled; this adds vector support and can run within a Docker container.

Demo Application

I have created an application that demonstrates two functions:

How to create embeddings from text input and store them in a database
How to search for content semantically.

Create the database

docker-compose.yml

services:
  db:
    hostname: db
    image: ankane/pgvector¹
    ports:
     - 5432:5432
    restart: always
    environment:
      - POSTGRES_DB=vectordb
      - POSTGRES_USER=testuser
      - POSTGRES_PASSWORD=testpwd
      - POSTGRES_HOST_AUTH_METHOD=trust
    volumes:
     - ./init.sql:/docker-entrypoint-initdb.d/init.sql

This image already has the pgvector extension installed. ↩︎

Create the docker-compose.yml file shown above, then start the instance using the command docker-compose up -d. This will start an new container and initialise the database.

/* This file is used to initialise the vector extension in the database */
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS Articles (
  id SERIAL PRIMARY KEY,
  embedding vector(384), /* 384 is the dimension of the Embedding model */
  text text,
  created_at timestamptz DEFAULT now()
);

Indexing and searching

The demo application includes a node.js service (search-server) that 1) indexes text and 2) searches for content. Both these functions use the huggingface inference library (see: https://github.com/huggingface/huggingface.js).

Indexing

app.post('/', async (req: Request, res: Response) => {
    const { text, id } = req.body;

    const hfInference = new HfInference(HUGGINGFACE);

    const embeddings = await hfInference.featureExtraction({
        model: "intfloat/e5-small-v2",
        inputs: `passage: ${text}` 
    }) as [];

    await addRecordToTable(text, embeddings);

    // Send a response
    res.status(200).json({ message: 'Data received successfully' });
});

The highlighted code above uses the huggingface API to produce the embedding. The ‘model‘ parameter informs Huggingface which ML model to use, and the ‘inputs‘ parameter is the text used to create the embeddings.

Note: the ‘passage‘ prefix is a directive the model requires.

async function addRecordToTable(text, embeddings:[]) {

        // Get a PostgreSQL connection pool
        var pool = getPool();

        const client = await pool.connect();

        var s = JSON.stringify(embeddings);

        try {
            await client.query('BEGIN');
            await client.query('INSERT INTO Articles (text, embedding) VALUES ($1, $2)', [text, s]);
            await client.query('COMMIT');
            console.log('Record added successfully!');
        } catch (error) {
            await client.query('ROLLBACK');
            console.error('Error adding record:', error);
        } finally {
            client.release();
        }
    };

The code above adds the text and associated embeddings to the database.

Searching

app.get('/search', async (req: Request, res: Response) => {
    const { text } = req.query;

    if (typeof text === "string") {
        const hfInference = new HfInference(HUGGINGFACE)
        const embeddings = await hfInference.featureExtraction({
            model: "intfloat/e5-small-v2",
            inputs: `query: ${text}` 
        }) as [];

        var results = await query(embeddings);

        res.status(200).json(results.rows);
    }
    else {
        console.log("text is not a string");
        res.status(500).json({ message: 'text is not a string' });
    }
});

Searching is very straightforward. The submitted search phrase is used to create an embedding, which is then used to query the database.

Note: the prefix ‘query‘ is another directive the ML model requires.

async function query(embeddings:[]) {
        // Get a PostgreSQL connection pool
        var pool = getPool();

        const client = await pool.connect();
        var s = JSON.stringify(embeddings);
        try {
            await client.query('BEGIN');

            const res = await client.query('SELECT text,  1 - (embedding <=> $1) AS cosine_similarity FROM Articles ORDER BY cosine_similarity desc LIMIT 5', [s]);

            await client.query('COMMIT');

            return res;
        } catch (error) {
            await client.query('ROLLBACK');
            console.error('Error executing query:', error);
        } finally {
            client.release();
        }
    }

The method above contains the SQL query that is run against the database and returns records similar to the search phrase.

You may have noticed the ‘<=>‘ operator; this means ‘cosine distance’ and is a calculation used to determine similarity. In this instance, the closer the result is to 1.0, the better the match.

Other options are:

Operator	Description
+	element-wise addition
–	element-wise subtraction
*	element-wise multiplication
<->	Euclidean distance
<#>	negative inner product
<=>	cosine distance

pgvector operators, read more

Note: Some ML models will only work with specific operators

It is beyond the scope of this blog to discuss the concepts covered in greater detail. I have attached links to other websites at the end of the blog if you want more detailed information.

Demo

The video below demonstrates the demo application’s various semantic search capabilities.

Semantic Search Demo

The demo site, test data, and a Postman collection for indexing the test data are all available in the GitHub repository.: https://github.com/andrewmarkham/Machine-Learning

Conclusion

I have always considered semantic search or anything AI-related to be limited to using a third-party service, but as I hope this blog post demonstrates, this is not the case.

This doesn’t mean I advocate building your own semantic search solution over using a recognised search provider; rather, you may want to augment the existing search or deliver other use cases such as classification or image/voice search.

This blog post delivers the concepts at a very high level. I am starting to learn more about this subject area and how I can use it to build my own solutions. A wealth of information is available on the internet to gain more knowledge and learn how to leverage these tools yourself.

What, no OpenAI?

I can’t write a blog post about AI and not mention OpenAI.

You can use OpenAI to create embeddings, but I chose not to, as I wanted to demonstrate alternatives. The same is true for vector databases. Many other options are available, and Pinecone seems to be one of the leaders.

Useful links

Introducing Jhoose Security Module V2.0

Featured

Posted on February 19, 2024 by Andrew Markham

Version V2.0 of the Jhoose Security module has been released and is available via the Optimizely nuget feed.

This update not only squashes several bugs it also introduces several new features to help secure your website.

Removed support for CMS 11

I have taken this opportunity to remove support for CMS11 as I felt that it was sensible to simplify the solution and focus on targeting the latest version of the CMS.

The module now only supports .NET6, 7 and 8.

User interface to manage Security Response Headers

In the previous versions of the module it is possible to control the Security Response headers by configuration, but I have now introduced a new optional user interface to manage these headers. This gives extra control; allowing administrators to easily change the settings post deployment.

Enable User Interface

services.AddJhooseSecurity(_configuration, (o) =>
{
    o.UseHeadersUI = true;
});

Adding the user interface is entirely optional, but if you do you will need to review the configuration as the existing settings are not transferred over.

Authentication Policy Overrides

By default any user with the CMSAdmins security role can access the module, but it is possible to change this to an alternate role if required.

services.AddJhooseSecurity(_configuration,
    configurePolicy:   (p) =>
    {
        p.RequireRole("CspAdmin");
    });

API Access

The security headers can be accessed via a Rest API, this is useful if you are using Optimizely to manage the content, but not presentation.

Access to the Rest API is secured by authentication keys, each consumer must include a valid key in the header. Authentication keys are managed within the module.

Webhooks

Consumers can register a webhook, this will then be called whenever a change is made either the Content Security Policy or the Security Headers.

It is recommended that consumers take this approach as caching the headers and then only refreshing when changes occur will help performance.

Example

POST /api/jhoose/headers HTTP/1.1
Accept: application/json
Content-Type: application/json
X-API-Key: ...
{'nonce': '1234567890' }

Note: Nonce

Each request should include a different nonce value. If you are following the recommendations and caching the response then you should also change the nonce value within the cached value.

Conclusion

More information can be found in my github repo. Any suggestions, comments are greatly appreciated.

Exploring the Optimizely GraphQL API to deliver a high-performing static site – Part 1.

Featured

Posted on March 5, 2023 by Andrew Markham

In this series of articles, I will demonstrate how to deliver a site using Optimizely, Next.js and Vercel. In this first instalment I am focusing on creating a simple POC where I reproduce a simplified version of the AlloyTech website but statically generated using Next.js and Vercel.

Solution Architecture

Optimizely is used to manage the content, with any content updates synced to the Content Graph.

The presentation layer is developed using Next.js, a React framework. Next.js can generate either a static site at build time, handle server-side rendering, or a combination of both.

Vercel is a global network providing hosting. The content is cached around the world.

Step 1 – Managing the content

The first step is straightforward and once complete you will end up with an instance of AlloyTech with its content synced to the Content Graph.

Install AlloyTech

Install, build and then run the demo site using the commands below. The first time you run the solution you will be prompted to create an admin account, once this is completed you will have a site you can use for testing.

dotnet new epi-alloy-mvc
dotnet build
dotnet run

Install Content Graph

Follow the steps below to add and configure Content Graph in the AlloyTech demo site. You will need to contact Optimizely to gain access to an AppKey.

dotnet add package Optimizely.ContentGraph.Cms

  "Optimizely": {
    "ContentGraph": {
      "GatewayAddress": "https://cg.optimizely.com",
      "AppKey": "",
      "Secret": "",
      "SingleKey": "",
      "AllowSendingLog": "true"
    }
  }

Once installed and configured you can then sync your content to the content graph using the scheduled job ‘Content Graph content synchronization job’. Content will also get synced when published.

Step 2 – Develop the site with Next.js

Creating a new site with Next.js is very simple, but take a look at https://nextjs.org/docs/getting-started for more detailed instructions.

npx create-next-app@latest --typescript
npm run dev

The site can now be accessed by typing ‘http://localhost:3000/’ in your browser.

Next.js doesn’t include any GraphQL libraries, I added the ApolloClient package for this.

npm install @apollo/client graphql

Recreating the Homepage

I want to produce a simplified version of the AlloyTech home page. I will render out the primary navigation along with the blocks from the main content area, this is analogous to the approach in the C# version of the page.

The page has no knowledge of where the data comes from this is handed via another step. It just uses the object send in the ‘props’.

type PageProps = {
  page: any;
  navigation: any;
};

function Home(props: PageProps) {
  const { page, navigation } = props;
  return (
    <>
      <Head>
        <title>{page.MetaTitle}</title>
        <meta name="description" content={page.MetaDescription} />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
        <link rel="icon" href="/favicon.ico" />
      </Head>

      <MainNavigation navigation={navigation}/>
      
      <main className={styles.main}>
        <ContentAreaRenderer items={page.MainContentArea} />
      </main>
    </>
  )
}

Each Next.js page can include a ‘getStaticProps‘ function which is used during the build process to return the props used in the render. This is where we query the Content Graph to get the data for the home page (and navigation).

Note: ‘getStaticProps’ is just used for static site generation, a different method is called for server-side rendering.

export const getStaticProps: GetStaticProps = async (context) => {

  const httpLink = new HttpLink({ uri: process.env.GRAPHQL_HOST });

  const client = new ApolloClient({
    link: httpLink,
    cache: new InMemoryCache(),
    ssrMode: true
  });
 
  var { data } = await client.query({
    query: StartPageQuery
  })

  var startPage = data.StartPage.items[0];

  var { data } = await client.query({
    query: NavigationQuery
  })
  
  var navigation = data.StartPage.items[0];

  console.log(navigation)

  return {
    props: {
      page: startPage,
      navigation: navigation
    },
  }
}

GraphQL query to get the home page.

import { gql } from '@apollo/client';

const StartPageQuery = gql`
query MyQuery {
  StartPage(locale: en) {
    items {
      Name
      TeaserText
      RouteSegment
      MetaTitle
      MetaKeywords
      MetaDescription
      MainContentArea {
        DisplayOption
        Tag
        ContentLink {
          Id
          Expanded {
            Name
            ContentType
            ... on JumbotronBlock {
              Name
              Heading
              Image {
                Url
              }
              ButtonText
              ContentType
              SubHeading
            }
            ... on TeaserBlock {
              _score
              Name
              Image {
                Url
              }
              Heading
              Text
            }
          }
        }
      }
    }
  }
}`
export default StartPageQuery

Product Pages

The home page is a simple example, but what happens when you have lots of content that use the same template. In AlloyTech there are 3 product pages accessed as child pages of the home page.

Routing

The naming convention of ‘[product-slug].tsx‘ signifies that the page is a dynamic route. The name within the square brackets ‘[]‘ is not important.

Next.js goes into more detail here: https://nextjs.org/docs/routing/introduction.

Generating the Routes

Much like the ‘getStaticProps‘ function, Next.js has an approach for generating the routes, ‘getStaticPaths‘. This is also called at build time.

export const getStaticPaths: GetStaticPaths = async () => {
    const httpLink = new HttpLink({ uri: process.env.GRAPHQL_HOST });

    const client = new ApolloClient({
      link: httpLink,
      cache: new InMemoryCache(),
      ssrMode: true
    });
   
    var { data } = await client.query({
      query: gql`query ProductPagesQuery {
        ProductPage(locale: en) {
          items {
            Name
            RouteSegment
          }
        }
      }`
    })
    var pages = data.ProductPage.items;

    const paths = pages.map((page: any) => ({
      params: { slug: page.RouteSegment}, locale: 'en',
    }));
  
    return { paths, fallback: false };
  };

Generating the page

‘getStaticPaths’ is responsible for building all the routes, each route will then be used to generate a single page, with the route data being passed to ‘getStaticProps‘.

export const getStaticProps: GetStaticProps = async ({params}) => {

  if (!params || !params.slug) {
    return { props: {} };
  }

  const httpLink = new HttpLink({ uri: process.env.GRAPHQL_HOST });

  const client = new ApolloClient({
    link: httpLink,
    cache: new InMemoryCache(),
    ssrMode: true
  });
 
  var { data } = await client.query({
    query: ProductPageQuery,
    variables: {
      segment: params.slug
    }
  })

  var page = data.ProductPage.items[0];

  var { data } = await client.query({
    query: NavigationQuery
  })
  
  var navigation = data.StartPage.items[0];
  return {
    props: {
      page: page,
      navigation: navigation
    },
  }
}

The following GraphQL query gets the specific page matching the route

import { gql } from '@apollo/client';

const ProductPageQuery = gql`
query ProductPageQuery($segment: String) {
  ProductPage(locale: en, where: {RouteSegment: {eq: $segment}}) {
    items {
      Name
      MetaTitle
      MetaKeywords
      MetaDescription
      MainBody
      TeaserText
      RelativePath
      PageImage {
        Url
      }
      RouteSegment
    }
  }
}
`
export default ProductPageQuery

Content Areas / Blocks

For this POC I created my own Content Area Render as this is an Optimizely concept which requires custom development within your Next.js site.

The approach is very simple, the content area render iterates over each item, and uses a factory to determine the component to render. This factory also gets the display option, giving the ability for the blocks to be rendered at different sizes.

function ContentAreaRenderer(props :any) {

    let items :any[] = props.items;

    var factory = new componentFactory()

    return(
        <div className={styles.container}>

        {items?.map(i => {

            const ContentAreaItem = factory.resolve(i);
            const Component = ContentAreaItem.Component;
            
            if (Component != null)
                return (
                <div className={ContentAreaItem.ItemClasses} key={i.ContentLink.Id}>
                    <Component item={i}  />
                </div>)
            else
                return null
        })}

        </div>
    )
}

The ‘componentFactory‘ gets the correct component to render, and also gets the correct display option.

class ContentAreaItem {
    ItemClasses: string;
    Component: any;

    constructor () {
        this.ItemClasses = "fullwidth"
    }
}
interface Dictionary<T> {
    [Key: string]: T;
}

class componentFactory {
  
    components: Dictionary<any> = {};

    constructor(){
        this.components["JumbotronBlock"] = JumbotronBlock;
        this.components["TeaserBlock"] = TeaserBlock;
    } 

    getType(item: any) : string {
        var contentTypes = item.ContentLink.Expanded.ContentType;
        return contentTypes[contentTypes.length - 1]; 
    }

    getDisplayOption(item: any) : string {
        return item.DisplayOption === "" ? "fullwidth" : item.DisplayOption; 
    }

    resolve(item: any): ContentAreaItem {
        var contentType: string = this.getType(item);

        var i = new ContentAreaItem();

        i.Component = this.components[contentType];
        i.ItemClasses = this.getDisplayOption(item);

        return i;
    }
}

Step 3 – Hosting the site with Vercel

Vercel is a platform for sites built using frontend frameworks, when your site is hosted with Vercel you will automatically gain performance benefits due to the edge caching.

Deployment

Deploying your site with Vercel is extremely straightforward.

Create a new project and connect it to the GitHub repo and configure the source location and the build pipeline. Every time the branch is updated the code will be built and the site automatically deployed.

This approach has some real benefits:

Previewing all changes is simple. Each push to the repository will trigger a build and generate a unique URL that can be shared.
It is possible to promote a previous version to ‘Production’ meaning rolling back simply as clicking on a button.

Step 4 – Handling Content Changes

Static sites may deliver blistering performance, but produce challenges when content is modified; the changes are not reflected.

There are several strategies you can adopt to help solve this problem.

Per Request Revalidation – It is possible to regenerate a page when a request comes in, but throttled so that an X number of seconds must have elapsed before the page can be regenerated.
On-Demand Revalidation – You can expose an API endpoint, that when called will regenerate the specific resource.

The problem with Per Request Revalidation is we are moving from static generation to dynamic generation.

import type { NextApiRequest, NextApiResponse } from 'next'

type ErrorData = {
  message: string
}

type SuccessData = {
    revalidated: boolean,
    message: string
  }

export default async function handler(
    req: NextApiRequest,
    res: NextApiResponse<ErrorData | SuccessData>
) {
    if (req.query.secret !== process.env.REVALIDATE_TOKEN) {
      return res.status(401).json({ message: 'Invalid token' })
    }
  
    const { revalidatePath } = req.body;

    try {
      await res.revalidate(revalidatePath)
      return res.json({ message: revalidatePath, revalidated: true })
    } catch (err) {
      return res.status(500).send({ message: 'Error revalidating :' + revalidatePath })
    }
  }

In the example above I have exposed an API endpoint. The request body contains the path of the resource that needs to be invalidated. The ‘revalidate‘ function then triggers regeneration of the page.

public void Configure(IApplicationBuilder app, IWebHostEnvironment env, IContentEvents contentEvents)
{
        contentEvents.PublishedContent += ContentEvents_PublishedContent;
}

private void ContentEvents_PublishedContent(object sender, ContentEventArgs e)
{
        if (e.Content is IRoutable routableContent)
        {
            var url = UrlResolver.Current.GetUrl(e.ContentLink);

            Task.Run(() =>
            {
                var request = new RevalidateRequest { RevalidatePath = url };

                Task.Delay(10000);  // wait 10 seconds

                var r = client.PostJsonAsync<RevalidateRequest>("/api/revalidate/?secret=...", request);

                Task.WaitAll(new[] { r });
            });
        }
}

The C# code above demonstrates how the Optimizely website triggers the revalidation in the static site. I have built in a 10 second delay as you need to allow for the content to be synced with the content graph.

Closing Thoughts

Whilst you are unlikely to reproduce model your content as you would in a normal Optimizley website, this POC does demonstrate the core concepts with using Optimizely as a headless CMS.

Performance Benifits

Whilst not the most scientific of comparisons, the two lighthouse reports below demonstrate the performance improvements you can gain when moving to a statically generated approach.

In the next article I will be looking into using Optimizely in a more headless mode and will also demonstrate other features such as searching, content listing etc.

Examples

You can access the POC source code at my GitHub Account, and the static site at https://graph-ql-three.vercel.app/.

Rolling your site out to China

Featured

Posted on February 5, 2023 by Andrew Markham

A nighttime image of the bund in Shanghi. — Photo by Wolfram K on Pexels.com

If the website you are responsible for needs to be accessed within China there are some obstacles you need to overcome. If no action is taken then users within China will face intermittent performance issues, ranging from slow speeds to complete timeouts. Many of the tools and features we expect to use will no longer work. Google Analytics will not work, and videos hosted with YouTube or Vimeo will also fail.

Planning

These days websites depend on lots of features delivered by external 3rd parties and we generally take them for granted. But within mainland China you will find that a lot of these will probably fail, creating a disjointed experience for your visitors.

You should audit your site to identify all these external services and check whether they work in China. If they don’t you will need to decide how they will be replaced, or potentially removed.

Google Analytics / Tag manager

Baidu Tongji is the Chinese equivalent of Google Analytics. Not all the features are free, so there will be a cost to access the full suite of functions.

YouTube / Vimeo

There are multiple options when you need to stream videos within China; Youku and Tencent are two options.

Google Maps

Baidu offers a mapping service, and apparently, Bing Maps work in China also.

Google Fonts / CDN Resources

If your site uses fonts directly from Google, then you will need to change this approach. The same goes for other resources you may load via a CDN, i.e., jQuery.

In all these cases you should now serve the resources locally.

Other considerations

The areas above are the obvious ones, and you should be thorough in your audit to make sure that you have identified everything.

The majority (if not all) of the Chinese services previously mentioned do not cater for non-native speakers. You will need a team within China or at the very least fluent to help you manage these tools.

Content

The site content is another factor which needs to be considered, it is the representation of your brand.

Review external links, there is no point in linking to sites the visitor cannot access.
Use Chinese social media. Twitter, Facebook etc have no presence in China.
Review content to make sure it doesn’t contain anything sensitive that may fall foul of the Chinese censors.
Don’t rely on automated translation, at the very least get a native speaker to review the site content.

Hosting

Hosting your site within China is the best option, but not possible when the site is hosted by Optimizely DXP.

So, what are the options?

Try and locate the origin servers as close as possible to China. This will help with dynamic content.
Use a China CDN. Optimizely have a partnership with Cloudflare China, which caches static content within China. (You will need to speak to your Account Manager to get this added)
Consider caching more content, for example, some pages could be cached as they change infrequently.

If you self-host or use a different CMS then the options above are still valid. You will just need to contact Cloudflare directly.

Hostname

You will need to register your hostname in China and will have a .cn suffix.

ICP (Internet Content Provider)

This is required if you host your site, or the site is served by Chinese data centres and must be shown in the footer of every page on your website.

There are two types (Filing and License), the Filing is used by non-commercial sites, whereas the License is required by sites that have retail activity. People generally refer to both the Filing and the License as just License.

To start the process with either Optimizely or Cloudflare you require a valid ICP (License) and neither company will help you obtain this. I would recommend that you engage a specialist agency to help you with the application process, they can advise on the current rules and help with the application process.

What happens if I already have an ICP filing/license?

Even if the site already has an ICP filing/license you are probably moving the hosting provider.

If this is the case, you will still need to get the License/Filing modified to reflect the changes.

Testing

You should test the site before you go live. It is possible to set up a test site using the ICP number and using the China CDN, this will allow you to:

Understand the performance, you need to add caching to your dynamic content.
Tackle any issues with access to the site, from experience we needed to amend the WAF rules as some users had trouble accessing the site.

Useful Links

https://world.optimizely.com/blogs/john-hakansson/dates/2021/4/how-to-address-dxp-web-performance-in-china-with-and-without-cdn/

https://www.cloudflare.com/en-gb/china-network/

https://manage.whois.com/kb/answer/1609

https://www.goclickchina.com/blog/icp-license-required-to-operate-website-in-china/

https://nhglobalpartners.com/what-is-icp-license-how-to-get-one/#Servers_and_Domains

GETA SEO Sitemaps – SitemapIndex Generation

Posted on June 27, 2022 by Andrew Markham

This excellent module is used on a large number of Optimizely sites, I think I use it on just about every site I have delivered, but today I found out a new feature that I didn’t know existed.

Perhaps I am late to the party, but I had a requirement for a client who is having problems with their sitemap. The catalog contained well over 50,000 products, well over the limit for a single sitemap file. This meant that we needed to generate multiple sitemap files.

Whilst I could reference these files in the robots.txt file, I really wanted to generate a single sitemapindex.xml file and just reference this.

I fired up dotPeek to have a look around and try and work out the best way of implementing this requirement myself and had various ideas until I stumbled upon the ‘GetaSitemapIndexController’. Turns out the functionality already exists.

Configure your sitemaps

In the example below I have created multiple sitemaps, and then for the commerce ones I have also specified the node id. This represents the category I want to generate within the sitemap.

Generated index file

The sitemap index file will automatically be returned in response to the request for sitemapindex.xml.


<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.xxxxxxxx.com/sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.xxxxxxxx.com/commerce-cat1-sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.xxxxxxxx.com/commerce-cat2-sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.xxxxxxxx.com/commerce-cat3-sitemap.xml</loc>
</sitemap>
</sitemapindex>

Conclusion

Whilst this may not be a common requirement for all sites, it is really useful for larger eCommerce sites. Hopefully, someone finds this post useful.

Creating a cross platform package – Part 2

Featured

Posted on June 25, 2022 by Andrew Markham

Introduction

In the previous post, I covered the steps required to migrate your project to the new format. In this post, I am going to move to the next stage and cover how you adapt the solution to target multiple frameworks.

Project Changes

The first step is to modify the project and specify which frameworks you want to target. This is a simple change, just modify the <TargetFramework> element to be <TargetFrameworks> and then specify the frameworks you wish to target.

<TargetFrameworks>net471;net5.0;net6.0</TargetFrameworks>

After the project as been modified you will also need to to update the package references ensuring they target the correct framework. This is straightforward, simply add a condition to the parent <ItemGroup>.

<ItemGroup Condition="'$(TargetFramework)' == 'net5.0'">
    <PackageReference Include="EPiServer.CMS.UI.Core" Version="[12.0.3,13)" />
    <PackageReference Include="EPiServer.Framework.AspNetCore" Version="[12.0.3,13)" /> 
    <PackageReference Include="EPiServer.Framework" Version="[12.0.3,13)" />
	
    <PackageReference Include="Microsoft.AspNetCore.Http" Version="2.0" />
    <PackageReference Include="Microsoft.AspNetCore.Http.Abstractions" Version="2.0" />
    <PackageReference Include="Microsoft.Extensions.DependencyInjection" Version="5.0" />
</ItemGroup>

There may also be orther sections of your project file that will also require you to use these condition clauses.

Code Changes

After changing the project to target multiple frameworks you will get compilation errors, you will need to fix these by creating different implementations of you code and wrapping each implementation with a preprocessor statement to indicate which framework the code is targeting.

https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/preprocessor-directives

#if NET461
 // Code specific for .net framework 4.6.1
#elif NET5_0
 // Code specific for .net 5.0
#elif NET5_0_OR_GREATER
 // Code specific for .net 5.0 (or greater)
#else
 // Code for anything else
#endif

You may just need to alter a couple of lines within a class, or in some cases you will need to deliver a completely different approach. A good example would be Middleware replacing a .Net Framework HTTPModule.

Wrapping it all up

Everyones journey whist converting their module will differ. The type of module, whether it has a UI etc will detemine the complexity.

Whilst you are modifying the code base I would strongly recommend :

Keep the the ‘DRY’ principle and refactor you code when necessary so that you are not repeating sections of code.
If you have an interface that uses WebForms then it is probably better to replace this with an interface that works for all the different frameworks rather than trying to maintain two different interfaces.

I hope this post helps you migrate your project.

Creating a cross platform package – Part 1

Posted on June 12, 2022 by Andrew Markham

With Optimizely’s transition to .NET5 last year, developers of add-on packages will need to follow suit.

The complexity of delivering a package that supports both frameworks will vary depending on the type of package you are trying to migrate. For example, if you have delivered an admin module based on webforms you will need to re-write this so that it is accessed via the main navigation. In this case, it is probably best to use this in the .Net Framework version as well. In short, you will need to really consider how you refactor your module to support both environments.

This multi-part blog post will take you through the process, with part 1 focusing on converting your existing project to use the new project format and part 2 focusing on how to modify the code and project to support multiple targets.

Migrate to the new project format

The easiest way to do this is probably to create a new project and then bring your code across.

Set the correct framework version

<TargetFramework>net5.0</TargetFramework>

When the project is created it will target either net5.0 or net6.0. This needs to be changed to match the framework version of the original project, i.e. net471.

<TargetFramework>net471</TargetFramework>

Move nuget package references

The nuget package references are no longer managed in the ‘packages.config’ file, they are now part of the project file.

It is straightforward to migrate the references across; ‘packages‘ becomes ‘ItemGroup‘ and ‘package‘ becomes ‘PackageReference‘

<ItemGroup>
    <PackageReference Include="EPiServer.Framework" Version="[11.1.0,12)" />
    <PackageReference Include="EPiServer.Framework.AspNet" Version="[11.1.0,12)" />
    <PackageReference Include="EPiServer.CMS.UI.Core" Version="[11.1.0,12)" />
</ItemGroup>

Remove nuspec file

Again this information is included in the project file. For the most part, transitioning to the new format is relatively straightforward with similar approaches, but some areas (such as adding files created during the build) have to be done differently.

Metadata

This is the information about the nuget package.

<?xml version="1.0" encoding="utf-8"?> 
<package xmlns="http://schemas.microsoft.com/packaging/2010/07/nuspec.xsd">    				
   <metadata> 
        <!-- Required elements--> 
	<id></id> 
	<version></version> 
	<description></description> 
	<authors></authors> 
	<!-- Optional elements --> 
	<!-- ... --> 
    </metadata> 
    <!-- Optional 'files' node --> 
</package>

changes to

<Project Sdk="Microsoft.NET.Sdk.Razor">
  <PropertyGroup>
    <!-- Required elements--> 
    <PackageId></PackageId>
    <Version></Version>
    <Authors></Authors>
    <Description></Description>
    <!-- Optional elements -->
    <RepositoryUrl></RepositoryUrl>
    <Title></Title>
    <Tags></Tags>
    <ReleaseNotes></ReleaseNotes>
  </PropertyGroup>
</Project>

Content Files

You may need to include additional, or remove files from the nuget package. This was handled in the nuspec file with <files> and <contentFiles> nodes.

<files>
    <file src="bin\Debug\*.dll" target="lib" exclude="*.txt" />
</files>

<contentFiles>
     <!-- Include everything in the scripts folder except exe files -->
     <files include="cs/net45/scripts/*" exclude="**/*.exe"  
            buildAction="None" copyToOutput="true" />
</contentFiles>

changes to

<ItemGroup>
  <Content Remove="src\**" />
  <Content Remove="node_modules\**" />
  <Content Remove="*.json" /> 

  <Content Include="deploy\**" Exclude="src\**\*">
    <Pack>true</Pack>
    <PackagePath>content</PackagePath>
    <PackageCopyToOutput>true</PackageCopyToOutput>
  </Content>
</ItemGroup>

NOTE: If you want to include files that are created during the build process then you need to take a different approach. i.e. a separate front-end build process whose output needs to be included.

You will need to create a target file, and reference this in your project. The targets file should be named the same as the built project.

<Content Include="build\net461\<project-name>.targets" PackagePath="build\net461\<project-name>.targets" />

<project-name>.target

<?xml version="1.0" encoding="utf-8"?>
<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003" ToolsVersion="4.0">
    <ItemGroup>
        <SourceScripts Include="$(MSBuildThisFileDirectory)..\..\contentFiles\any\any\modules\_protected\**\*"/>
    </ItemGroup>

    <Target Name="CopyFiles" BeforeTargets="Build">
        <Copy
            SourceFiles="@(SourceScripts)"
            DestinationFolder="$(MSBuildProjectDirectory)\modules\_protected\%(RecursiveDir)"
        />
    </Target>
</Project>

Addtional project settings

<GeneratePackageOnBuild>true</GeneratePackageOnBuild>

When set to true will automatically create the nuget file when the project is built.

<AddRazorSupportForMvc>true</AddRazorSupportForMvc>

Is required when the project includes Razer files.

<RestoreSources>
  https://api.nuget.org/v3/index.json;
  https://nuget.optimizely.com/feed/packages.svc;
</RestoreSources>

Can be used to set the location of the package sources; these can be either external or from the local file system.

Build and Test

Build the project and resolve any issues you encounter, these should be minor.

Once the package is generated you should test to ensure that it contains the correct content.

Wrapping things up

At this point, you should have a solution that builds your project and creates a nuget package but still targets a single framework.

In the next part, I will cover how to convert to target multiple frameworks.

Optimizely Data Platform Visitor Groups

Posted on April 27, 2022 by Andrew Markham

The Optimizely Data Platform (ODP) builds a picture of a customer, their interactions, and their behavior in comparison to other customers on a site.

This module exposes these insights in the form of visitor groups which can then be used to personalise content.

Features

There are currently five different visitor groups available. These are accessed via the ‘Data Platform’ group.

Real-Time Segments

Real-Time segments are new and are different from the ‘Calculated’ segments that are currently available on the platform.

Real-Time segments are based on the last 30 days of data whereas ‘Calculated’ segments are based on all the stored customer data and are calculated at regular intervals. They are more suited to reporting and journey orchestration.

Note: You need to contact Optimizely to get Real-Time Segments enabled on your instance and there is currently no interface to create them.

Note 2: ‘Calculated’ Segments are not available via this visitor group criterion.

Engagement Rank

This metric allows you to build personalisation based on how engaged the customer/visitor is with your site/brand. This is biased toward more recent visits rather than historical visits.

This metric is calculated every 24 hrs.

Order Likelihood

As the name suggests, this criterion returns the likelihood that the customer will place an order.

The possible values are:

Unlikely
Likely
Very Likely
Extremely Likely

This metric is calculated every 24 hrs.

Winback zone

Returns the ‘Winback Zone’ for the current customer. This can be used to identify when a customer is altering their normal interaction patterns with the site; for example, are disengaging.

The options are:

Churned Customers
Winback Customers
Engaged Customers

This metric is calculated every 24hrs.

Observation

This criterion is can be used to build personalisation around 3 different customer order metrics.

Total Revenue
Order Count
Average Order Revenue

This metric is calculated every 24hrs.

Installation

Install the package directly from the Optimizely Nuget repository.

dotnet add package UNRVLD.ODP.VisitorGroups
Install-Package UNRVLD.ODP.VisitorGroups

Configuration (.NET 5.0)

Startup.cs

// Adds the registration for visitor groups
services.AddODPVisitorGroups();

appsettings.json All settings are optional, apart from the PrivateApiKey

{
   "EPiServer": {
      //Other config
      "OdpVisitorGroupOptions": {
         "OdpCookieName": "vuid",
         "CacheTimeoutSeconds": 10,
         "EndPoint": "https://api.zaius.com/v3/graphql",
         "PrivateApiKey": "key-lives-here"
       }
   }
}

Configuration (.Net Framework)

web.config All settings are optional, apart from the PrivateApiKey

  <appSettings>
    <add key="episerver:setoption:UNRVLD.ODP.OdpVisitorGroupOptions.OdpCookieName, UNRVLD.ODP.VisitorGroups" value="vuid" />
    <add key="episerver:setoption:UNRVLD.ODP.OdpVisitorGroupOptions.CacheTimeoutSeconds, UNRVLD.ODP.VisitorGroups" value="1" />
    <add key="episerver:setoption:UNRVLD.ODP.OdpVisitorGroupOptions.EndPoint, UNRVLD.ODP.VisitorGroups" value="https://api.zaius.com/v3/graphql" />
    <add key="episerver:setoption:UNRVLD.ODP.OdpVisitorGroupOptions.PrivateApiKey, UNRVLD.ODP.VisitorGroups" value="key-lives-here" />
  </appSettings>

Credits

I cannot take all the credit for this module, it was co-developed with David Knipe. Thanks for all the help.

Jhoose Security – Updated to support Episerver 11

Posted on April 22, 2022 by Andrew Markham

I have updated the Jhoose security module to support any Episerver 11 site, the only dependency is .Net Framework 4.7.1.

Installation

Install the package directly from the Optimizley Nuget repository. This will install the admin interface along with the middleware to add the CSP header to the response.

Github: https://github.com/andrewmarkham/contentsecuritypolicy

dotnet add package Jhoose.Security.Admin
 --version 1.2.2.148 
Install-Package Jhoose.Security.Admin
 -Version 1.2.2.148

Configuration

The installation process will add the following nodes to the web.config file within your solution.

<configSections>
	<sectionGroup name="JhooseSecurity" type="Jhoose.Security.Configuration.JhooseSecurityOptionsConfigurationSectionGroup, Jhoose.Security">
		<section name="Headers" type="Jhoose.Security.Configuration.HeadersSection, Jhoose.Security" />
		<section name="Options" type="Jhoose.Security.Configuration.OptionsSection, Jhoose.Security" />
	</sectionGroup>
</configSections>

Register the module with the .Net pipeline

<system.webServer>
	<modules runAllManagedModulesForAllRequests="true">
		<add name="JhooseSecurityModule" type="Jhoose.Security.HttpModules.JhooseSecurityModule, Jhoose.Security" />
	</modules>
</system.webServer>

Configuration options for the module

<JhooseSecurity>
	<Options httpsRedirect="true">
		<Exclusions>
			<add path="/episerver" />
		</Exclusions>
	</Options>
	<Headers>
		<StrictTransportSecurityHeader enabled="true" maxAge="31536000" />
		<XFrameOptionsHeader enabled="true" mode="Deny|SameOrigin|AllowFrom" domain=""/>
		<XContentTypeOptionsHeader enabled="true" />
		<XPermittedCrossDomainPoliciesHeader enabled="true" mode="None|MasterOnly|ByContentType|All"/>
		<ReferrerPolicyHeader enabled="true" mode="NoReferrer|NoReferrerWhenDownGrade|Origin|OriginWhenCrossOrigin|SameOrigin|StrictOrigin|StrictOriginWhenCrossOrigin|UnsafeUrl"/>
		<CrossOriginEmbedderPolicyHeader enabled="true" mode ="UnSafeNone|RequireCorp"/>
		<CrossOriginOpenerPolicyHeader  enabled="true" mode="UnSafeNone|SameOriginAllowPopups|SameOrigin"/>
		<CrossOriginResourcePolicyHeader enabled="true" mode="SameSite|SameOrigin|CrossOrigin" />
	</Headers>
</JhooseSecurity>

Exclusions: Any request which starts with a path specified in this property will not include the CSP header.

httpsRedirect: This attribute controls whether all requests should be upgraded to HTTPS.

Nonce HTML helper

It is possible to get a nonce added to your inline <script> and <style> tags.

@using Jhoose.Security.Core.HtmlHelpers;
<script @Html.AddNonce() src="/assets/js/jquery.min.js"></script>

Response Headers

The response headers can be controlled within the web.config

Server Header and X-Powered-By Header

These aren’t removed, the reason being

When hosting within Optimizley DXP, the CDN will obfuscate the server value anyway.
The header cannot be removed programmatically.

IIS 10

<!-- web.config -->
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <system.webServer>
        <security>
            <requestFiltering removeServerHeader="true" />
        </security>

        <httpProtocol>
            <customHeaders>
                <clear />
                <remove name="X-Powered-By" />
            </customHeaders>
        </httpProtocol>
    </system.webServer>
</configuration>

Jhoose Security – Update to include recommended security headers.

Posted on February 28, 2022 by Andrew Markham

I have updated the module to automatically output the OWASP recommended security headers.

These headers are automatically added to the response but can be configured as required, or even disabled.

Code Configuration

        services.AddJhooseSecurity(_configuration, (securityOptions) => {
            
            // define the XFrame Options mode
            securityOptions.XFrameOptions.Mode = XFrameOptionsEnum.SameOrigin;
            
            // disable HSTS
            securityOptions.StrictTransportSecurity.Enabled = false;
        });

Configuration via appSettings

"JhooseSecurity": {
      "ExclusionPaths": [
        "/episerver"
      ],
      "HttpsRedirection": true,
      "StrictTransportSecurity": {
        "MaxAge": 31536000,
        "IncludeSubDomains": true
      },
      "XFrameOptions": {
        "Enabled": false,
        "Mode": 0,
        "Domain": ""
      },
      "XPermittedCrossDomainPolicies": {
        "Mode": 0
      },
      "ReferrerPolicy": {
        "Mode": 0
      },
      "CrossOriginEmbedderPolicy": {
        "Mode": 1
      },
      "CrossOriginOpenerPolicy": {
        "Mode": 2
      },
      "CrossOriginResourcePolicy": {
        "Mode": 1
      }
    }

Managing the server header

The security module doesn’t remove the ‘server header’, this may seem strange, but the approach differs depending on how you are hosting your site. I have included some examples below.

Another consideration, if you are hosting your solution with Optimizely DXP then the CDN will automatically remove the header.

Kestrel

return Host.CreateDefaultBuilder(args)
  .ConfigureCmsDefaults()
  .ConfigureWebHostDefaults(webBuilder =>
{
   webBuilder.ConfigureKestrel(o => o.AddServerHeader = false);
   webBuilder.UseStartup<Startup>();
});

IIS

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <system.webServer>
        <security>
            <requestFiltering removeServerHeader="true" />
        </security>
    </system.webServer>
</configuration>

Installation

dotnet add package Jhoose.Security.Admin  --version 1.1.1.89

Technology Used

Large Language Models

Vector Database

Demo Application

Create the database

docker-compose.yml

Indexing and searching

Indexing

Searching

Demo

Conclusion

What, no OpenAI?

Useful links

Removed support for CMS 11

User interface to manage Security Response Headers

Enable User Interface

Authentication Policy Overrides

API Access

Webhooks

Example

Note: Nonce

Conclusion

Solution Architecture

Step 1 – Managing the content

Install AlloyTech

Install Content Graph

Step 2 – Develop the site with Next.js

Recreating the Homepage

Product Pages

Routing

Generating the Routes

Generating the page

Content Areas / Blocks

Step 3 – Hosting the site with Vercel

Deployment

Step 4 – Handling Content Changes

Closing Thoughts

Performance Benifits

Next Article

Examples

Planning

Google Analytics / Tag manager

YouTube / Vimeo

Google Maps

Google Fonts / CDN Resources

Other considerations

Content

Hosting

So, what are the options?

Hostname

ICP (Internet Content Provider)

What happens if I already have an ICP filing/license?

Testing

Useful Links

Configure your sitemaps

Generated index file

Conclusion

Introduction

Project Changes

Code Changes

Wrapping it all up

Migrate to the new project format

Set the correct framework version

Move nuget package references

Remove nuspec file

Metadata

Content Files

<project-name>.target

Addtional project settings

Build and Test

Wrapping things up

Features

Real-Time Segments

Engagement Rank

Order Likelihood

Winback zone

Observation

Installation

Configuration (.NET 5.0)

Configuration (.Net Framework)