ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [Dataset] OTTO – Multi-Objective Recommender System
    Data miner/Kaggle Notetaking 2022. 11. 30. 16:05
    728x90

    - 데이터셋 소개 

     

    (링크 : https://www.kaggle.com/competitions/otto-recommender-system/data?select=test.jsonl)

     

    OTTO – Multi-Objective Recommender System | Kaggle

     

    www.kaggle.com

     

    • 본 데이터셋의 모델링 목표는 이커머스 클릭, 카드 물품 추가 항목 및 순서를 예측하는 것이다. 따라서 우리는 이전 세션의 로그들을 참고하여 다양한 목적의 추천 시스템을 구축해야 한다. 
    • Train Data에는 이커머스 세션 전체 데이터가 있다. 
    • Test 단계에서는 각각의 session별 aid(article id/제품 id)를 각각의 제품 로그 데이터(event type)와 결합하여 예측해야 한다. (ex. session_number - eventype - aids label 쌍으로 예측함)
    session_type predicted lables (aids)
    12899779_clicks 129004 126836 118524
    12899779_carts 129004 126836 118524
    12899779_orders 129004 126836 118524
    12899780_clicks 129004 126836 118524
    12899780_carts 129004 126836 118524
    • 제품 로그 데이터 타입에는 clicks, carts, orders가 존재
    • 각 session에는 timestamp값이 존재한다.

     

    - 평가 방식

     

    weighted Recall score로 측정

    아래의 수식에서 보듯이, orders의 Recall score를 가장 높은 비중으로 책정하므로, 실제 주문한 제품을 예측하는 것이 가장 중요한 과제라고 할 수 있음.

    각각의 예측되는 aid label의 개수는 최대 20개까지 가능함

     

     

    Multi objective CF 0. Data Overview

    0. Data Overview

    In [1]:
    import json
    import pandas as pd
    from pathlib import Path
    import os
    import random
    import numpy as np
    from datetime import timedelta
    
    In [25]:
    DATA_PATH = Path('/Users/soyoon-yoon/Kaggle mining/Multi_CF')
    TRAIN_PATH = DATA_PATH/'train.jsonl'
    TEST_PATH = DATA_PATH/'test.jsonl'
    
    In [8]:
    sample_size = 10000
    chunks = pd.read_json(TRAIN_PATH, lines=True, chunksize = sample_size)
    
    In [9]:
    with open(TRAIN_PATH, 'r') as f:
        print(f"We have {len(f.readlines()):,} lines in the training data")
    
    We have 2,963,606 lines in the training data
    
    In [12]:
    for c in chunks:
        sample_train_df = c
        print(c)
        break
    
           session                                             events
    10000    10000  [{'aid': 1033792, 'ts': 1659305201724, 'type':...
    10001    10001  [{'aid': 476264, 'ts': 1659305201745, 'type': ...
    10002    10002  [{'aid': 1754433, 'ts': 1659305201774, 'type':...
    10003    10003  [{'aid': 1536959, 'ts': 1659305201875, 'type':...
    10004    10004  [{'aid': 287161, 'ts': 1659305201899, 'type': ...
    ...        ...                                                ...
    19995    19995  [{'aid': 1481519, 'ts': 1659305842045, 'type':...
    19996    19996  [{'aid': 1109584, 'ts': 1659305842183, 'type':...
    19997    19997  [{'aid': 1647277, 'ts': 1659305842315, 'type':...
    19998    19998  [{'aid': 753948, 'ts': 1659305842328, 'type': ...
    19999    19999  [{'aid': 1690380, 'ts': 1659305842502, 'type':...
    
    [10000 rows x 2 columns]
    
    In [13]:
    sample_train_df
    
    Out[13]:
    session events
    10000 10000 [{'aid': 1033792, 'ts': 1659305201724, 'type':...
    10001 10001 [{'aid': 476264, 'ts': 1659305201745, 'type': ...
    10002 10002 [{'aid': 1754433, 'ts': 1659305201774, 'type':...
    10003 10003 [{'aid': 1536959, 'ts': 1659305201875, 'type':...
    10004 10004 [{'aid': 287161, 'ts': 1659305201899, 'type': ...
    ... ... ...
    19995 19995 [{'aid': 1481519, 'ts': 1659305842045, 'type':...
    19996 19996 [{'aid': 1109584, 'ts': 1659305842183, 'type':...
    19997 19997 [{'aid': 1647277, 'ts': 1659305842315, 'type':...
    19998 19998 [{'aid': 753948, 'ts': 1659305842328, 'type': ...
    19999 19999 [{'aid': 1690380, 'ts': 1659305842502, 'type':...

    10000 rows × 2 columns

    In [45]:
    sample_train_df.loc[10543, :]['events'][:10]
    
    Out[45]:
    [{'aid': 602784, 'ts': 1659305228407, 'type': 'clicks'},
     {'aid': 602784, 'ts': 1659305254095, 'type': 'carts'},
     {'aid': 1456023, 'ts': 1659305269630, 'type': 'clicks'},
     {'aid': 1456023, 'ts': 1659305313768, 'type': 'carts'},
     {'aid': 1466811, 'ts': 1659305321905, 'type': 'clicks'},
     {'aid': 1456023, 'ts': 1659305346764, 'type': 'clicks'},
     {'aid': 602784, 'ts': 1659305500999, 'type': 'orders'},
     {'aid': 1456023, 'ts': 1659305500999, 'type': 'orders'},
     {'aid': 884043, 'ts': 1659305559962, 'type': 'clicks'},
     {'aid': 1343414, 'ts': 1659305600039, 'type': 'clicks'}]
    In [22]:
    sample_train_df.set_index('session', drop=True, inplace=True)
    sample_train_df.head()
    
    Out[22]:
    events
    session
    10000 [{'aid': 1033792, 'ts': 1659305201724, 'type':...
    10001 [{'aid': 476264, 'ts': 1659305201745, 'type': ...
    10002 [{'aid': 1754433, 'ts': 1659305201774, 'type':...
    10003 [{'aid': 1536959, 'ts': 1659305201875, 'type':...
    10004 [{'aid': 287161, 'ts': 1659305201899, 'type': ...
    In [37]:
    example_session = sample_train_df.iloc[100].item()
    
    time_elapsed = example_session[-1]["ts"] - example_session[0]["ts"]
    
    # The timestamp is in milliseconds since 00:00:00 UTC on 1 January 1970
    print(f'The first session elapsed: {str(timedelta(milliseconds=time_elapsed))} \n')
    
    The first session elapsed: 24 days, 0:33:11.682000 
    
    
    In [38]:
    # Count the frequency of actions within the session
    action_counts = {}
    for action in example_session:
        action_counts[action['type']] = action_counts.get(action['type'], 0) + 1  
    print(f'The first session contains the following frequency of actions: {action_counts}')
    
    The first session contains the following frequency of actions: {'clicks': 116, 'carts': 6}
    
    In [28]:
    with open(TEST_PATH, 'r') as f:
        print(f"We have {len(f.readlines()):,} lines in the training data")
    
    We have 1,671,803 lines in the training data
    
    In [27]:
    sample_size = 150
    
    chunks = pd.read_json(TEST_PATH, lines=True, chunksize = sample_size)
    
    for c in chunks:
        sample_test_df = c
        break
    
    In [31]:
    sample_test_df.loc[0, 'events']
    #sample_train_df.loc[10000, :]['events']
    
    Out[31]:
    [{'aid': 59625, 'ts': 1661724000278, 'type': 'clicks'}]
    In [43]:
    sample_test_df.loc[2, 'events']
    
    Out[43]:
    [{'aid': 141736, 'ts': 1661724000559, 'type': 'clicks'},
     {'aid': 199008, 'ts': 1661724022851, 'type': 'clicks'},
     {'aid': 57315, 'ts': 1661724170835, 'type': 'clicks'},
     {'aid': 194067, 'ts': 1661724246188, 'type': 'clicks'},
     {'aid': 199008, 'ts': 1661780623778, 'type': 'clicks'},
     {'aid': 199008, 'ts': 1661781274081, 'type': 'clicks'},
     {'aid': 199008, 'ts': 1661781409993, 'type': 'carts'},
     {'aid': 199008, 'ts': 1661804151788, 'type': 'clicks'},
     {'aid': 199008, 'ts': 1662060028567, 'type': 'clicks'},
     {'aid': 199008, 'ts': 1662060064706, 'type': 'clicks'},
     {'aid': 918667, 'ts': 1662060160406, 'type': 'clicks'}]
Designed by Tistory.