Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.DS_Store +0 -0
.gitattributes +1 -0
1_Pooling/config.json +10 -0
README.md +554 -0
config.json +49 -0
config_sentence_transformers.json +10 -0
configuration.py +145 -0
model.safetensors +3 -0
modeling.py +1418 -0
modules.json +20 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +51 -0
tokenizer.json +3 -0
tokenizer_config.json +55 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": true,
+  "pooling_mode_mean_tokens": false,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,554 @@

+---
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- generated_from_trainer
+- dataset_size:5964
+- loss:MultipleNegativesRankingLoss
+base_model: Alibaba-NLP/gte-multilingual-base
+widget:
+- source_sentence: công nghệ blockchain
+  sentences:
+  - '<p>🟢 <strong>OLabs</strong> là công ty thành viên trực thuộc tập đoàn <strong>OpenLive</strong>.
+    <strong>OLabs</strong> là sự sáp nhập 100% của <strong>Công ty Công Nghệ Genex</strong>.
+    Bao gồm nguồn lực về nhân sự, khách hàng, sản phẩm sẽ trực thuộc quản lý của <strong>OLabs</strong>.</p><p>🟢
+    <strong>Tầm nhìn</strong>: Trở thành <strong>Top3</strong> Công ty Công Nghệ về
+    tư vấn, phát triển, triển khai hệ sinh thái, platform chuyển đổi số cho <strong>Doanh
+    Nghiệp</strong> ứng dụng những công nghệ lõi như <strong>Web3, AI, Bigdata</strong>,…</p><p>🟢
+    <strong>Sứ mệnh</strong>: Cung cấp giải pháp phần mềm cho tập đoàn <strong>OpenLive</strong>
+    nhằm đưa <strong>OpenLive</strong> trở thành 1 tập đoàn có những ứng dụng tiên
+    tiến và tiêu biểu nhằm phục vụ mục tiêu hơn <strong>500.000 Doanh nghiệp</strong>
+    vừa và nhỏ tại <strong>Việt Nam</strong> cũng như <strong>Đông Nam Á</strong>.
+    Hướng tới việc làm tăng giá trị Thương Hiệu, gia tăng khách hàng mới, tăng doanh
+    thu và lợi nhuận cho Doanh Nghiệp.</p><p>🟢 <strong>Sản phẩm công nghệ</strong>:
+    Chúng tôi hiện nay có 2 sản phẩm mục tiêu cho tập đoàn <strong>OpenLive</strong></p><p>#️⃣
+    <strong>Loyalty Branding Platform</strong>: <strong>Platform</strong> tiên phong
+    tại Việt Nam ứng dụng công nghệ lõi <strong>Blockchain</strong>, <strong>BCC</strong>,
+    <strong>Web3</strong> vào giải pháp <strong>Loyalty</strong> truyền thống. Giải
+    quyết những nhược điểm hiện tại của hệ thống <strong>Loyalty</strong> truyến thống
+    và tạo ra môi trường trao đổi giá trị chéo giữa các Doanh nghiệp trong hệ sinh
+    thái thông qua giải pháp <strong>Loyalty Point</strong>.</p><p>#️⃣ <strong>Wifi
+    Branding Platform</strong>: là nền tảng hiển thị quảng cáo khi người dùng truy
+    cập vào mạng wifi, biến Wifi thông thường trở thành 1 kênh quảng cáo digital,
+    tương tác khách hàng thời gian thực, tăng nhận diện thương hiệu, giúp trải nghiệm
+    khách hàng thông minh hơn.</p><p>🟢 <strong>Thông tin</strong>:</p><p>&nbsp;➡️
+    <strong>Website:</strong> <a href="http://www.olabs.net/" rel="noopener noreferrer"
+    target="_blank">www.olabs.net</a></p><p>&nbsp;➡️ <strong>Hotline:</strong> 088.608.1166</p>'
+  - <p>🔸<strong>Bizfone </strong>là một công ty hàng đầu trong lĩnh vực công nghệ
+    thông tin, chuyên cung cấp các giải pháp và dịch vụ công nghệ tiên tiến nhằm đáp
+    ứng nhu cầu đa dạng của các doanh nghiệp. Với sự cam kết mang đến chất lượng vượt
+    trội và sự hài lòng cho khách hàng, <strong>Bizfone </strong>không ngừng đầu tư
+    vào nghiên cứu và phát triển các sản phẩm công nghệ mới nhất.</p>
+  - <p>🔸Chúng tôi là công ty chuyên về thương mại thực phẩm hữu cơ hoạt động thông
+    qua các công ty đăng ký ở <strong>Ba Lan</strong> và <strong>Việt Nam</strong>.</p><p>Chúng
+    tôi bán hàng sản xuất theo yêu cầu của chúng tôi dành cho thị trường Việt Nam
+    bởi các nhà sản xuất đã được chứng minh, đáp ứng cả tiêu chí kiểm dịch thực vật
+    cao nhất của <em>Châu Âu</em> và cũng có giấy chứng nhận chính phủ phù hợp. Trong
+    đề nghị của chúng tôi, bạn sẽ tìm thấy những sản phẩm được chọn lọc có chất lượng
+    cao nhất từ khắp nơi <em>Liên minh châu Âu</em>.</p><p>🔸Đồng thời chúng tôi
+    hợp tác với các nhà sản xuất Việt Nam để tạo ra những sản phẩm chất lượng cao
+    nhất từ các nhà máy nhập khẩu các loại thảo mộc, trái cây và rau quả của họ vào
+    thị trường <em>châu Âu</em>.</p><p>🔸Ưu đãi của chúng tôi không ngừng mở rộng,
+    vì vậy chúng tôi mời bạn ghé thăm chúng tôi thường xuyên qua trang web của chúng
+    tôi. Nếu bạn quan tâm đến sản phẩm, không bao gồm trực tiếp trong ưu đãi của chúng
+    tôi, vui lòng liên hệ với chúng tôi qua email, điện thoại hoặc Facebook. Chúng
+    tôi ở đây vì bạn có sẵn 24 giờ một ngày.</p>
+- source_sentence: lối sống xanh
+  sentences:
+  - <p>🔸Là doanh nghiệp cung cấp hàng tiêu dùng nhanh sinh học của <strong>Đức</strong>,
+    chất l��ợng <strong>Đức </strong>nhưng giá <strong>Việt Nam</strong>.</p>
+  - <p>🔸<strong>CÔNG TY CỔ PHẦN SẢN XUẤT THẢO DƯỢC DVTM - VNV</strong> là doanh nghiệp
+    tiên phong trong lĩnh vực sản xuất và kinh doanh các sản phẩm chăm sóc sức khỏe
+    và làm đẹp từ thảo dược tự nhiên. Với tầm nhìn mang đến cho người tiêu dùng những
+    sản phẩm chất lượng cao, an toàn và thân thiện với môi trường, <strong>VNV </strong>đã
+    xây dựng giá trị cốt lõi dựa trên Sức khỏe và Làm đẹp, kết hợp với việc sử dụng
+    nguồn nguyên liệu Sạch và Xanh từ thiên nhiên.</p><p>🔸<strong>VNV </strong>cam
+    kết mang đến những sản phẩm tốt nhất cho cộng đồng, từ các sản phẩm chăm sóc da,
+    tóc, cho đến các dòng thực phẩm chức năng giúp tăng cường sức khỏe và hỗ trợ điều
+    trị bệnh. Mỗi sản phẩm của <strong>VNV </strong>đều trải qua quy trình kiểm định
+    nghiêm ngặt, đáp ứng các tiêu chuẩn về an toàn và hiệu quả.</p><p>🔸Với tôn chỉ
+    "Sạch từ nguyên liệu, xanh từ quy trình sản xuất", <strong>CÔNG TY CỔ PHẦN SẢN
+    XUẤT THẢO DƯỢC DVTM - VNV</strong> không ngừng cải tiến và phát triển, luôn đồng
+    hành cùng sức khỏe và vẻ đẹp của người tiêu dùng Việt Nam.</p>
+  - <p>🔸Sứ mệnh trao gửi hạnh phúc tới người Việt bằng những thực phẩm tốt lành để
+    chăm sóc sức khoẻ thể chất và phát triển nguồn năng lượng tươi mới cho cuộc sống
+    đủ đầy.</p><p>🔸Nỗ lực mỗi ngày với khẩu hiệu <em>Deliver Happiness</em> bằng sự
+    tận tâm để mang tới bạn và gia đình từ chính những sản phẩm tốt cho sức khoẻ,
+    được tuyển chọn khắt khe cùng với sự chăm sóc ân cần từ người bạn đồng hành <strong>Deli
+    Group</strong>.</p><p>🔸Đặt ra quy trình kiểm soát chặt chẽ từ trồng trọt, thu
+    hoạch, sản xuất, bảo quản và vận chuyển. Đảm bảo sản phẩm được gửi tới bạn và
+    gia đình luôn là sản phẩm chất lượng và tươi ngon.</p><p><strong>CAM KẾT:</strong></p><p>🔸Cung
+    cấp tới tạo ra những sản phẩm chất lượng cao, an toàn cho sức khỏe với mức giá
+    hợp lý nhất.</p><p>🔸Mang lại niềm vui qua từng sản phẩm và dịch vụ khách hàng,
+    góp phần lan tỏa hạnh phúc đến mọi người.</p>
+- source_sentence: nghỉ ngơi giải tỏa căng thẳng
+  sentences:
+  - <p><strong>Bodhi Spa - Gội Đầu Dưỡng Sinh, Massage Thái Y Học</strong></p><p>➡️
+    <strong>Bodhi Spa</strong> là địa chỉ chuyên cung cấp các dịch vụ gội đầu dưỡng
+    sinh trị liệu chuyên sâu, đồng thời còn cung cấp đào tạo cho các học viên với
+    sự hỗ trợ của các thiết bị hiện đại giúp họ dễ dàng học và luyện tập.</p><p>➡️
+    Tại đây, <strong>Bodhi Spa</strong> sử dụng các nguyên liệu tự nhiên lành tính
+    để mang đến cho khách hàng cảm giác thư giãn và giải tỏa mệt mỏi.</p><p>➡️ Đội
+    ngũ nhân viên chuyên nghiệp của <strong>Bodhi Spa</strong> có tay nghề cao và
+    tinh thần học hỏi, giúp khách hàng luôn cảm thấy hài lòng sau mỗi lần trải nghiệm
+    các dịch vụ như Gội Đầu Dưỡng Sinh hay Massage Thái Y Học.</p><p><br></p><p><strong>BODHI
+    SPA - MASSAGE HỖ TRỢ TRỊ LIỆU</strong></p><p>➡️ <strong>Địa chỉ:</strong> Số 10,
+    Đường Bà Triệu, Khóm 3, Phường 3, Thành phố Bạc Liêu.</p><p>➡️ <strong>Hotline:</strong>
+    0898403345 - 0931089414.</p>
+  - '<p><strong>CÔNG TY TNHH THƯƠNG MẠI DỊCH VỤ &amp; SẢN XUẤT MIYAKO MIYAKO SERVICES
+    TRADING PRODUCTION COMPANY LIMITED</strong></p><p>🔸<strong>Miyako </strong>là
+    một thương hiệu khăn giấy cao cấp với công nghệ sản xuất cải tiến mới.</p><p>🔸Thương
+    hiệu chuyên cung cấp các loại khăn giấy chất lượng cao, được làm từ nguyên liệu
+    tự nhiên và an toàn cho sức khỏe. Khăn giấy Miyako được sản xuất từ 100% bột giấy
+    nguyên chất, không chứa hóa chất tẩy trắng. Khăn giấy có độ mềm mại, mịn màng,
+    thấm hút tốt và không gây kích ứng da.</p><p>🔸<strong>Miyako </strong>luôn cho
+    ra đa dạng các loại khăn giấy, đáp ứng nhu cầu sử dụng của mọi đối tượng khách
+    hàng. Các sản phẩm của <strong>Miyako </strong>bao gồm:</p><ul><li>Khăn giấy ăn:
+    Khăn giấy ăn Miyako được sử dụng phổ biến trong các gia đình, nhà hàng, khách
+    sạn. Khăn giấy có nhiều kích thước khác nhau, phù hợp với nhu cầu sử dụng.</li><li>Khăn
+    giấy lau tay: Khăn giấy lau tay Miyako được làm từ chất liệu mềm mại, giúp làm
+    sạch tay hiệu quả mà không gây khô da.</li></ul><p>🔸<strong>Miyako </strong>là
+    thương hiệu khăn giấy cao cấp được người tiêu dùng Việt Nam tin tưởng lựa chọn.
+    Thương hiệu không ngừng đổi mới và phát triển, mang đến cho khách hàng những sản
+    phẩm chất lượng cao, đáp ứng nhu cầu sử dụng ngày càng cao của người dùng Việt.</p><p>🔸Một
+    số ưu điểm của khăn giấy Miyako:</p><ul><li>Được làm từ nguyên liệu tự nhiên,
+    an toàn cho sức khỏe</li><li>Độ mềm mại, mịn màng, thấm hút tốt</li><li>Không
+    gây kích ứng da</li><li>Đa dạng các mẫu mã khăn giấy, đáp ứng nhu cầu sử dụng
+    của mọi đối tượng khách hàng</li><li>Hướng tới mục tiêu thương hiệu là người bạn
+    của mọi gia đình</li><li>Luôn nêu cao giữ gìn môi trường sống xanh</li></ul>'
+  - <p>🔸<strong>Spa Tiểu Tịnh</strong>, tọa lạc tại thị trấn <em>Đồng Nai</em>, là
+    một điểm đến lý tưởng cho những ai tìm kiếm sự thư giãn, chăm sóc sắc đẹp và sức
+    khỏe giữa không gian yên bình, tĩnh lặng. Với không gian trang nhã, gần gũi với
+    thiên nhiên, <strong>Tiểu Tịnh</strong> mang đến cho khách hàng những giây phút
+    thư thái, giải tỏa căng thẳng sau những ngày làm việc mệt mỏi.</p><p>🔸<strong>Spa
+    Tiểu Tịnh</strong> nổi bật với các dịch vụ chăm sóc sắc đẹp, sức khỏe và thư giãn
+    toàn diện, bao gồm massage thư giãn, trị liệu da, chăm sóc cơ thể bằng tinh dầu
+    tự nhiên, cùng các liệu trình làm đẹp sử dụng sản phẩm thiên nhiên an toàn và
+    hiệu quả. Đặc biệt, spa chú trọng đến việc kết hợp các phương pháp chăm sóc truyền
+    thống với công nghệ hiện đại, giúp tối ưu hóa hiệu quả điều trị và mang lại trải
+    nghiệm tuyệt vời cho khách hàng.</p><p>🔸Với đội ngũ nhân viên chuyên nghiệp, tận
+    tâm và nhiều năm kinh nghiệm, <strong>Spa Tiểu Tịnh </strong>cam kết mang đến
+    cho khách hàng những liệu trình phù hợp, giúp cải thiện sức khỏe, sắc đẹp và tinh
+    thần. Đến với<strong> Tiểu Tịnh</strong>, bạn không chỉ tận hưởng dịch vụ chất
+    lượng mà còn cảm nhận được sự thư giãn sâu sắc, khôi phục năng lượng và tìm lại
+    sự cân bằng trong cuộc sống.</p><p>🔸<strong>Spa Tiểu Tịnh</strong> là lựa chọn
+    tuyệt vời cho những ai muốn tìm một không gian tĩnh lặng, thư giãn giữa lòng thị
+    trấn <em>Đồng Nai</em>.</p>
+- source_sentence: bác sĩ nha khoa giỏi
+  sentences:
+  - <p>🟢 <strong>Ura Coffee </strong>với không gian rộng rãi, view từ lầu 2 nhìn xuống
+    2 mặt tiền đường Ung Văn Khiêm nhộn nhịp</p><p>🟢 Với tông màu mạnh pha lẫn chút
+    cổ điển mang lại cho quý khách những cảm giác kích thích ngay từ cái nhìn đầu
+    tiên, là nơi phù hợp để học tập, làm việc cũng như hội họp đội nhóm.</p>
+  - <p>🔸<strong>Thẩm mỹ viện Dr Bình</strong> với sự dẫn dắt bởi <em>Bác sĩ Chuyên
+    khoa 2</em> <strong>Phạm Văn Bình</strong>, người có <strong>25 </strong>năm kinh
+    nghiệm trong ngành y. Với chuyên môn sâu rộng, <strong>bác sĩ Bình</strong> không
+    chỉ xuất sắc trong các lĩnh vực phẫu thuật thẩm mỹ, nam khoa, phẫu thuật ngoại
+    khoa, phẫu thuật chuyển giới, mà còn nổi trội trong việc xử lý những ca thẩm mỹ
+    lỗi và địa chỉ thực hiện tốt những ca phẫu thuật cao cấp về thẩm mỹ.</p><p>🔸Ngoài
+    ra, <strong>bác sĩ Bình</strong> còn là một chuyên gia về lĩnh vực tế bào gốc,
+    đại diện phát triển cho thương hiệu tế bào gốc <em>Nhật Bản</em> trên toàn cầu,
+    đem đến những phương pháp hiện đại và tiên tiến nhất cho lĩnh vực làm đẹp, trẻ
+    hóa, hỗ trợ các bệnh nhân điều trị các bệnh mãn tính và ung thư tại Việt Nam và
+    thế giới. Không chỉ dừng lại ở việc khám chữa, <strong>bác sĩ Bình </strong>còn
+    là một nhà đào tạo uy tín, tích cực phát triển cộng đồng ngành y thông qua việc
+    chia sẻ kiến thức và kinh nghiệm của mình.</p><p>🔸Chúng tôi cam kết mang đến cho
+    khách hàng những dịch vụ thẩm mỹ cao cấp, an toàn, hiệu quả, đáp ứng mọi nhu cầu
+    và kỳ vọng. Hãy đến với chúng tôi để trải nghiệm sự thay đ���i hoàn hảo, quý phái
+    và được đảm bảo bằng sự uy tín của <strong>Dr Bình </strong>qua từng ca phẫu thuật.</p>
+  - <p>🔸<strong>NHA KHOA SMILE HUNTER - LÀ NHA KHOA CHUYÊN VỀ RĂNG SỨ - IMPLANT.</strong></p><p>🔸Nụ
+    cười đẹp và hàm răng chắc khỏe là tài sản và món quà quý giá của cuộc sống. Thấu
+    hiểu chân lý này, <strong>Nha khoa Smile Hunter</strong> luôn đặt lợi ích sức
+    khỏe răng miệng của quý khách lên hàng đầu. Đội ngũ bác sỹ giỏi, giàu kinh nghiệm,
+    nhiệt tình sẽ giúp bạn có hàm răng khỏe đẹp, nụ cười sáng ngời với chi phí ít
+    nhất trong thời gian ngắn nhất.</p>
+- source_sentence: đồ decor ngoài trời
+  sentences:
+  - <p><strong>🔸Công ty Kim Vinh</strong> chuyên cung cấp các sản phẩm trang trí nội,
+    ngoại thất chất lượng cao cho mọi không gian sống. Với đa dạng mẫu mã, từ đồ trang
+    trí phòng khách, phòng ngủ, đến các vật dụng ngoài trời như sân vườn, ban công,
+    <strong>Kim Vinh</strong> cam kết mang đến những sản phẩm không chỉ đẹp mắt mà
+    còn bền vững theo thời gian.</p><p>🔸Tất cả các mặt hàng tại công ty đều có sẵn,
+    giúp khách hàng dễ dàng lựa chọn và mua sắm nhanh chóng. Đặc biệt, <strong>Kim
+    Vinh</strong> luôn chú trọng đến dịch vụ giao hàng nhanh chóng, đảm bảo sự hài
+    lòng tuyệt đối cho mọi khách hàng. Chúng tôi tự hào là đối tác tin cậy của nhiều
+    gia đình, doanh nghiệp trong việc làm đẹp không gian sống và làm việc.</p>
+  - '<p>🔸<strong>CÔNG TY CỔ PHẦN THƯƠNG MẠI DƯỢC PHÚ QUÝ</strong></p><p>☎ <strong>Liên
+    hệ </strong>(Lê Thị Cát Thuỷ): 0968 063 134</p><p>☎ <strong>Liên hệ </strong>(Nguyễn
+    Thị Thoa): 0963 226 698</p>'
+  - <p>🔸<strong>Orchard Home Forest</strong> là một khu nghỉ dưỡng sinh thái tọa lạc
+    tại vùng ngoại ô, nơi hòa quyện giữa vẻ đẹp tự nhiên và không gian yên bình, lý
+    tưởng cho những ai tìm kiếm sự thư giãn và tái tạo năng lượng. Với diện tích rộng
+    lớn bao phủ bởi rừng cây xanh mát và những khu vườn trái cây trĩu quả, nơi đây
+    mang đến cho du khách một trải nghiệm gần gũi với thiên nhiên.</p><p>🔸<strong>Orchard
+    Home Forest</strong> không chỉ nổi bật với cảnh quan tươi đẹp, mà còn cung cấp
+    các dịch vụ nghỉ dưỡng cao cấp, như biệt thự nghỉ dưỡng, nhà gỗ tiện nghi, các
+    hoạt động ngoài trời như đi bộ, đạp xe và tham gia vào các chương trình chăm sóc
+    sức khỏe, yoga. Đặc biệt, khu nghỉ dưỡng này chú trọng phát triển bền vững và
+    bảo vệ môi trường, với các biện pháp giảm thiểu tác động đến hệ sinh thái địa
+    phương.</p><p>🔸Những ai ghé thăm <strong>Orchard Home Forest </strong>sẽ cảm nhận
+    được sự thư thái, tĩnh lặng, và làn gió mới cho tâm hồn, giúp tái tạo năng lượng
+    sau những ngày làm việc căng thẳng. Đây là điểm đến lý tưởng cho những ai yêu
+    thích sự kết hợp giữa nghỉ dưỡng và khám phá thiên nhiên.</p>
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+---
+# SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) <!-- at revision 9bbca17d9273fd0d03d5725c7a4b0f6b45142062 -->
+- **Maximum Sequence Length:** 512 tokens
+- **Output Dimensionality:** 768 dimensions
+- **Similarity Function:** Cosine Similarity
+<!-- - **Training Dataset:** Unknown -->
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NewModel
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+  (2): Normalize()
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("sentence_transformers_model_id")
+# Run inference
+sentences = [
+    'đồ decor ngoài trời',
+    '<p><strong>🔸Công ty Kim Vinh</strong> chuyên cung cấp các sản phẩm trang trí nội, ngoại thất chất lượng cao cho mọi không gian sống. Với đa dạng mẫu mã, từ đồ trang trí phòng khách, phòng ngủ, đến các vật dụng ngoài trời như sân vườn, ban công, <strong>Kim Vinh</strong> cam kết mang đến những sản phẩm không chỉ đẹp mắt mà còn bền vững theo thời gian.</p><p>🔸Tất cả các mặt hàng tại công ty đều có sẵn, giúp khách hàng dễ dàng lựa chọn và mua sắm nhanh chóng. Đặc biệt, <strong>Kim Vinh</strong> luôn chú trọng đến dịch vụ giao hàng nhanh chóng, đảm bảo sự hài lòng tuyệt đối cho mọi khách hàng. Chúng tôi tự hào là đối tác tin cậy của nhiều gia đình, doanh nghiệp trong việc làm đẹp không gian sống và làm việc.</p>',
+    '<p>🔸<strong>Orchard Home Forest</strong> là một khu nghỉ dưỡng sinh thái tọa lạc tại vùng ngoại ô, nơi hòa quyện giữa vẻ đẹp tự nhiên và không gian yên bình, lý tưởng cho những ai tìm kiếm sự thư giãn và tái tạo năng lượng. Với diện tích rộng lớn bao phủ bởi rừng cây xanh mát và những khu vườn trái cây trĩu quả, nơi đây mang đến cho du khách một trải nghiệm gần gũi với thiên nhiên.</p><p>🔸<strong>Orchard Home Forest</strong> không chỉ nổi bật với cảnh quan tươi đẹp, mà còn cung cấp các dịch vụ nghỉ dưỡng cao cấp, như biệt thự nghỉ dưỡng, nhà gỗ tiện nghi, các hoạt động ngoài trời như đi bộ, đạp xe và tham gia vào các chương trình chăm sóc sức khỏe, yoga. Đặc biệt, khu nghỉ dưỡng này chú trọng phát triển bền vững và bảo vệ môi trường, với các biện pháp giảm thiểu tác động đến hệ sinh thái địa phương.</p><p>🔸Những ai ghé thăm <strong>Orchard Home Forest </strong>sẽ cảm nhận được sự thư thái, tĩnh lặng, và làn gió mới cho tâm hồn, giúp tái tạo năng lượng sau những ngày làm việc căng thẳng. Đây là điểm đến lý tưởng cho những ai yêu thích sự kết hợp giữa nghỉ dưỡng và khám phá thiên nhiên.</p>',
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 768]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Dataset
+#### Unnamed Dataset
+* Size: 5,964 training samples
+* Columns: <code>query</code>, <code>description</code>, and <code>negative</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | query                                                                            | description                                                                          | negative                                                                             |
+  |:--------|:---------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
+  | type    | string                                                                           | string                                                                               | string                                                                               |
+  | details | <ul><li>min: 4 tokens</li><li>mean: 7.01 tokens</li><li>max: 13 tokens</li></ul> | <ul><li>min: 14 tokens</li><li>mean: 254.77 tokens</li><li>max: 512 tokens</li></ul> | <ul><li>min: 15 tokens</li><li>mean: 289.38 tokens</li><li>max: 512 tokens</li></ul> |
+* Samples:
+  | query                           | description                                                                                                                                                                                                                                                                                                     | negative                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+  |:--------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>mật ong probiotics</code> | <code><p>🔸<strong>Sinh Kế Việt – VIJC</strong> tự hào là đơn vị cung cấp các giải pháp hỗ trợ sức khoẻ cộng đồng từ mật ong Probiotics tự nhiên kết hợp công nghệ hiện đại.</p><p>🔸Thông qua đó chúng tôi hỗ trợ phát triển sinh kế, khởi nghiệp bền vững cho bà con nông dân từ tài nguyên bản địa.</p></code> | <code><p>🔸<strong>Công ty CP Công Nghệ Sinh Học Dương Gia K&amp;T</strong> là đơn vị chuyên sản xuất và cung cấp mật ong lượng tử.</p><p>🔸Sản phẩm Mật Ong Lượng tử – EZ là sản phẩm hữu cơ, nguồn nguyên liệu sản xuất được tuyển chọn từ nhà nuôi phù hợp tiêu chuẩn, quy trình thiết kế chuyên biệt và được lấy hoàn toàn 100% là mật ong nuôi tại tỉnh Đắk Lắk, với giống ong Ý có tên khoa học là Apis Mellifera Ligustica. Sau đó được kích hoạt lượng tử giúp cho mật ong sạch và hấp thụ thêm năng lượng. Không bổ sung bất cứ thành phần hoạt chất nào và hoàn toàn không sử dụng chất bảo quản.</p><p>🔸Quy trình xử lý mật ong bằng phương pháp kích hoạt lượng tử là quy trình duy nhất tại Việt Nam được nghiên cứu và thực hiện bởi tác giả <strong>Dương Minh Tâm</strong> – Tài năng Khoa học và Công nghệ Việt Nam.</p></code>                                                                                                                                                                                                           |
+  | <code>mật ong probiotics</code> | <code><p>🔸<strong>Sinh Kế Việt – VIJC</strong> tự hào là đơn vị cung cấp các giải pháp hỗ trợ sức khoẻ cộng đồng từ mật ong Probiotics tự nhiên kết hợp công nghệ hiện đại.</p><p>🔸Thông qua đó chúng tôi hỗ trợ phát triển sinh kế, khởi nghiệp bền vững cho bà con nông dân từ tài nguyên bản địa.</p></code> | <code><p>🔸<strong>Công ty TNHH TV-TM-DV GĐT</strong>, với trụ sở tại<em> Đồng Tháp</em>, là đơn vị tiên phong trong lĩnh vực thương mại các sản phẩm bổ sung và chăm sóc sức khỏe dành cho mẹ và bé. Với sứ mệnh mang đến những sản phẩm an toàn, chất lượng cao, <strong>GĐT </strong>cam kết cung cấp giải pháp tối ưu nhằm nâng cao sức khỏe cộng đồng, đặc biệt là cho các gia đình Việt.</p><p>🔸Sản phẩm nổi bật của chúng tôi – sirô bổ sung chất xơ <strong>Bukhoe </strong>– đã chiếm được lòng tin của hàng ngàn bà mẹ nhờ hiệu quả trong việc hỗ trợ giảm táo bón và tăng cường hệ vi khuẩn có lợi cho đường ruột. <strong>Bukhoe </strong>được làm từ chất xơ nhập khẩu từ <strong>Mỹ</strong>, sản xuất trong nhà máy đạt tiêu chuẩn <strong>GMP</strong>, đảm bảo chất lượng và an toàn tuyệt đối. Sản phẩm có hương vị thơm ngon, dễ uống và tiện lợi cho cả gia đình sử dụng hàng ngày.</p><p>🔸Điểm đặc biệt của <strong>Bukhoe </strong>là thành phần sirô chiết xuất từ đường bắp, an toàn cho người tiểu đường và trẻ em có ...</code> |
+  | <code>mật ong probiotics</code> | <code><p>🔸<strong>Sinh Kế Việt – VIJC</strong> tự hào là đơn vị cung cấp các giải pháp hỗ trợ sức khoẻ cộng đồng từ mật ong Probiotics tự nhiên kết hợp công nghệ hiện đại.</p><p>🔸Thông qua đó chúng tôi hỗ trợ phát triển sinh kế, khởi nghiệp bền vững cho bà con nông dân từ tài nguyên bản địa.</p></code> | <code><p>🔸<strong>BIO-HOPE</strong> là thương hiệu hàng đầu về sản xuất đông trùng hạ thảo tại Việt Nam. Sở Hữu nhà máy hiện đại được chuyển giao công nghệ sản xuất từ <em>Nhật Bản</em>, <strong>BIO-HOPE</strong> tự hào có hơn <strong>10</strong> năm kinh nghiệm phát triển các sản phẩm chăm sóc sức khỏe cao cấp. Các sản phẩm chủ lực bao gồm đông trùng hạ thảo, đông trùng yến, cao đông trùng linh chi, mật ong đông trùng hạ thảo,...</p><p>🔸Sản phẩm của <strong>BIO-HOPE</strong> không chỉ tăng cường sức đề kháng, giúp ngủ ngon mà còn hỗ trợ phục hồi và duy trì các chức năng quan trọng của cơ thể như gan, thận, phổi và tim mạch. Cam kết của <strong>BIO-HOPE</strong> là mang đến cho khách hàng các giải pháp sức khỏe vượt trội, đáp ứng nhu cầu chăm sóc toàn diện.</p><p>Giải Thưởng Đạt Được:</p><p>🔸"<em>Top 10 Thương Hiệu Phát Triển Quốc Gia 2022</em>" và "<em>Top 10 Thương Hiệu Vàng,</em> <em>Chất Lượng Quốc Tế 2022</em>" là những danh hiệu danh giá mà <strong>BIO-HOPE</strong> đã đạt được, khẳng ...</code> |
+* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 20.0,
+      "similarity_fct": "cos_sim"
+  }
+  ```
+### Training Hyperparameters
+#### Non-Default Hyperparameters
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `learning_rate`: 2e-05
+- `weight_decay`: 0.01
+- `num_train_epochs`: 1
+- `warmup_steps`: 100
+- `fp16`: True
+- `dataloader_num_workers`: 4
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+- `overwrite_output_dir`: False
+- `do_predict`: False
+- `eval_strategy`: no
+- `prediction_loss_only`: True
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `per_gpu_train_batch_size`: None
+- `per_gpu_eval_batch_size`: None
+- `gradient_accumulation_steps`: 1
+- `eval_accumulation_steps`: None
+- `torch_empty_cache_steps`: None
+- `learning_rate`: 2e-05
+- `weight_decay`: 0.01
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-08
+- `max_grad_norm`: 1.0
+- `num_train_epochs`: 1
+- `max_steps`: -1
+- `lr_scheduler_type`: linear
+- `lr_scheduler_kwargs`: {}
+- `warmup_ratio`: 0.0
+- `warmup_steps`: 100
+- `log_level`: passive
+- `log_level_replica`: warning
+- `log_on_each_node`: True
+- `logging_nan_inf_filter`: True
+- `save_safetensors`: True
+- `save_on_each_node`: False
+- `save_only_model`: False
+- `restore_callback_states_from_checkpoint`: False
+- `no_cuda`: False
+- `use_cpu`: False
+- `use_mps_device`: False
+- `seed`: 42
+- `data_seed`: None
+- `jit_mode_eval`: False
+- `use_ipex`: False
+- `bf16`: False
+- `fp16`: True
+- `fp16_opt_level`: O1
+- `half_precision_backend`: auto
+- `bf16_full_eval`: False
+- `fp16_full_eval`: False
+- `tf32`: None
+- `local_rank`: 0
+- `ddp_backend`: None
+- `tpu_num_cores`: None
+- `tpu_metrics_debug`: False
+- `debug`: []
+- `dataloader_drop_last`: False
+- `dataloader_num_workers`: 4
+- `dataloader_prefetch_factor`: None
+- `past_index`: -1
+- `disable_tqdm`: False
+- `remove_unused_columns`: True
+- `label_names`: None
+- `load_best_model_at_end`: False
+- `ignore_data_skip`: False
+- `fsdp`: []
+- `fsdp_min_num_params`: 0
+- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
+- `fsdp_transformer_layer_cls_to_wrap`: None
+- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
+- `deepspeed`: None
+- `label_smoothing_factor`: 0.0
+- `optim`: adamw_torch
+- `optim_args`: None
+- `adafactor`: False
+- `group_by_length`: False
+- `length_column_name`: length
+- `ddp_find_unused_parameters`: None
+- `ddp_bucket_cap_mb`: None
+- `ddp_broadcast_buffers`: False
+- `dataloader_pin_memory`: True
+- `dataloader_persistent_workers`: False
+- `skip_memory_metrics`: True
+- `use_legacy_prediction_loop`: False
+- `push_to_hub`: False
+- `resume_from_checkpoint`: None
+- `hub_model_id`: None
+- `hub_strategy`: every_save
+- `hub_private_repo`: None
+- `hub_always_push`: False
+- `hub_revision`: None
+- `gradient_checkpointing`: False
+- `gradient_checkpointing_kwargs`: None
+- `include_inputs_for_metrics`: False
+- `include_for_metrics`: []
+- `eval_do_concat_batches`: True
+- `fp16_backend`: auto
+- `push_to_hub_model_id`: None
+- `push_to_hub_organization`: None
+- `mp_parameters`:
+- `auto_find_batch_size`: False
+- `full_determinism`: False
+- `torchdynamo`: None
+- `ray_scope`: last
+- `ddp_timeout`: 1800
+- `torch_compile`: False
+- `torch_compile_backend`: None
+- `torch_compile_mode`: None
+- `include_tokens_per_second`: False
+- `include_num_input_tokens_seen`: False
+- `neftune_noise_alpha`: None
+- `optim_target_modules`: None
+- `batch_eval_metrics`: False
+- `eval_on_start`: False
+- `use_liger_kernel`: False
+- `liger_kernel_config`: None
+- `eval_use_gather_object`: False
+- `average_tokens_across_devices`: False
+- `prompts`: None
+- `batch_sampler`: batch_sampler
+- `multi_dataset_batch_sampler`: proportional
+</details>
+### Training Logs
+| Epoch  | Step | Training Loss |
+|:------:|:----:|:-------------:|
+| 0.0268 | 10   | 1.9618        |
+| 0.0536 | 20   | 1.8811        |
+| 0.0804 | 30   | 1.8416        |
+| 0.1072 | 40   | 1.7155        |
+| 0.1340 | 50   | 1.5214        |
+| 0.1609 | 60   | 1.6549        |
+| 0.1877 | 70   | 1.5042        |
+| 0.2145 | 80   | 1.3187        |
+| 0.2413 | 90   | 1.3651        |
+| 0.2681 | 100  | 1.1366        |
+| 0.2949 | 110  | 1.2463        |
+| 0.3217 | 120  | 1.1371        |
+| 0.3485 | 130  | 1.1857        |
+| 0.3753 | 140  | 1.1055        |
+| 0.4021 | 150  | 1.1215        |
+| 0.4290 | 160  | 1.206         |
+| 0.4558 | 170  | 1.0617        |
+| 0.4826 | 180  | 1.169         |
+| 0.5094 | 190  | 1.1628        |
+| 0.5362 | 200  | 1.1601        |
+| 0.5630 | 210  | 1.1011        |
+| 0.5898 | 220  | 0.8756        |
+| 0.6166 | 230  | 1.1203        |
+| 0.6434 | 240  | 0.8605        |
+| 0.6702 | 250  | 0.9246        |
+| 0.6971 | 260  | 0.9078        |
+| 0.7239 | 270  | 1.1245        |
+| 0.7507 | 280  | 0.8946        |
+| 0.7775 | 290  | 0.8846        |
+| 0.8043 | 300  | 0.9328        |
+| 0.8311 | 310  | 1.0018        |
+| 0.8579 | 320  | 0.9624        |
+| 0.8847 | 330  | 0.8516        |
+| 0.9115 | 340  | 0.836         |
+| 0.9383 | 350  | 0.7674        |
+| 0.9651 | 360  | 0.937         |
+| 0.9920 | 370  | 0.9097        |
+### Framework Versions
+- Python: 3.11.13
+- Sentence Transformers: 4.1.0
+- Transformers: 4.53.0
+- PyTorch: 2.6.0+cu124
+- Accelerate: 1.8.1
+- Datasets: 2.14.4
+- Tokenizers: 0.21.2
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### MultipleNegativesRankingLoss
+```bibtex
+@misc{henderson2017efficient,
+    title={Efficient Natural Language Response Suggestion for Smart Reply},
+    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
+    year={2017},
+    eprint={1705.00652},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "architectures": [
+    "NewModel"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration.NewConfig",
+    "AutoModel": "modeling.NewModel",
+    "AutoModelForMaskedLM": "Alibaba-NLP/new-impl--modeling.NewForMaskedLM",
+    "AutoModelForMultipleChoice": "Alibaba-NLP/new-impl--modeling.NewForMultipleChoice",
+    "AutoModelForQuestionAnswering": "Alibaba-NLP/new-impl--modeling.NewForQuestionAnswering",
+    "AutoModelForSequenceClassification": "Alibaba-NLP/new-impl--modeling.NewForSequenceClassification",
+    "AutoModelForTokenClassification": "Alibaba-NLP/new-impl--modeling.NewForTokenClassification"
+  },
+  "classifier_dropout": 0.0,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "layer_norm_type": "layer_norm",
+  "logn_attention_clip1": false,
+  "logn_attention_scale": false,
+  "max_position_embeddings": 8192,
+  "model_type": "new",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pack_qkv": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "rope",
+  "rope_scaling": {
+    "factor": 8.0,
+    "type": "ntk"
+  },
+  "rope_theta": 20000,
+  "torch_dtype": "float32",
+  "transformers_version": "4.53.0",
+  "type_vocab_size": 1,
+  "unpad_inputs": false,
+  "use_memory_efficient_attention": false,
+  "vocab_size": 250048
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "__version__": {
+    "sentence_transformers": "4.1.0",
+    "transformers": "4.53.0",
+    "pytorch": "2.6.0+cu124"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

configuration.py ADDED Viewed

	@@ -0,0 +1,145 @@

+# coding=utf-8
+# Copyright 2024 The GTE Team Authors and Alibaba Group.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" NEW model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class NewConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`NewModel`] or a [`TFNewModel`]. It is used to
+    instantiate a NEW model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the NEW
+    [izhx/new-base-en](https://huggingface.co/izhx/new-base-en) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the NEW model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`NewModel`] or [`TFNewModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`NewModel`] or [`TFNewModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"rope"`):
+            Type of position embedding. Choose one of `"absolute"`, `"rope"`.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
+            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
+            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
+            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
+            these scaling strategies behave:
+            https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
+            experimental feature, subject to breaking API changes in future versions.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+    Examples:
+    ```python
+    >>> from transformers import NewConfig, NewModel
+    >>> # Initializing a NEW izhx/new-base-en style configuration
+    >>> configuration = NewConfig()
+    >>> # Initializing a model (with random weights) from the izhx/new-base-en style configuration
+    >>> model = NewModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "new"
+    def __init__(
+        self,
+        vocab_size=30528,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.0,
+        max_position_embeddings=2048,
+        type_vocab_size=1,
+        initializer_range=0.02,
+        layer_norm_type='layer_norm',
+        layer_norm_eps=1e-12,
+        # pad_token_id=0,
+        position_embedding_type="rope",
+        rope_theta=10000.0,
+        rope_scaling=None,
+        classifier_dropout=None,
+        pack_qkv=True,
+        unpad_inputs=False,
+        use_memory_efficient_attention=False,
+        logn_attention_scale=False,
+        logn_attention_clip1=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_type = layer_norm_type
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.classifier_dropout = classifier_dropout
+        self.pack_qkv = pack_qkv
+        self.unpad_inputs = unpad_inputs
+        self.use_memory_efficient_attention = use_memory_efficient_attention
+        self.logn_attention_scale = logn_attention_scale
+        self.logn_attention_clip1 = logn_attention_clip1

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:16894efe99fb3745a03bbaa9dfb72a0ce0a8668ef5e1216fd3b40951668441cc
+size 1221487872

modeling.py ADDED Viewed

	@@ -0,0 +1,1418 @@

+# coding=utf-8
+# Copyright 2024 The GTE Team Authors and Alibaba Group.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch NEW model."""
+import math
+from dataclasses import dataclass
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPooling,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    ModelOutput,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+try:
+    import xformers.ops as xops
+except ImportError as e:
+    xops = None
+from .configuration import NewConfig
+logger = logging.get_logger(__name__)
+# Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
+# Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
+class IndexFirstAxis(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, indices):
+        ctx.save_for_backward(indices)
+        assert input.ndim >= 2
+        ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
+        second_dim = other_shape.numel()
+        # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
+        # return input[indices]
+        # return torch.gather(
+        #     rearrange(input, "b ... -> b (...)"), 0, repeat(indices, "z -> z d", d=second_dim)
+        # ).reshape(-1, *other_shape)
+        return torch.gather(
+            input.view(ctx.first_axis_dim, second_dim),
+            0,
+            indices.unsqueeze(-1).expand(indices.size(0), second_dim)
+        ).reshape(-1, *other_shape)
+    @staticmethod
+    def backward(ctx, grad_output):
+        (indices,) = ctx.saved_tensors
+        assert grad_output.ndim >= 2
+        other_shape = grad_output.shape[1:]
+        # grad_output = rearrange(grad_output, "b ... -> b (...)")
+        grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
+        grad_input = torch.zeros(
+            [ctx.first_axis_dim, grad_output.shape[1]],
+            device=grad_output.device,
+            dtype=grad_output.dtype,
+        )
+        # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
+        # grad_input[indices] = grad_output
+        # grad_input.scatter_(0, repeat(indices, "z -> z d", d=grad_output.shape[1]), grad_output)
+        grad_input.scatter_(
+            0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
+        )
+        return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
+index_first_axis = IndexFirstAxis.apply
+def unpad_input(hidden_states, attention_mask=None, indices=None):
+    """
+    Arguments:
+        hidden_states: (batch, seqlen, ...)
+        attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
+        indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
+    Return:
+        hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
+    """
+    if indices is None:
+        assert attention_mask is not None
+        indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
+    # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
+    # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
+    # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
+    # so we write custom forward and backward to make it a bit faster.
+    hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
+    return index_first_axis(hidden_states, indices)
+class IndexPutFirstAxis(torch.autograd.Function):
+    @staticmethod
+    def forward(
+        ctx,
+        values: torch.Tensor,
+        indices: torch.Tensor,
+        first_axis_dim
+    ) -> torch.Tensor:
+        ctx.save_for_backward(indices)
+        assert indices.ndim == 1
+        assert values.ndim >= 2
+        output = torch.zeros(
+            first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
+        )
+        output[indices] = values
+        return output
+    @staticmethod
+    def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
+        indices, = ctx.saved_tensors
+        grad_values = grad_output[indices]
+        return grad_values, None, None
+index_put_first_axis = IndexPutFirstAxis.apply
+def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
+    """Add padding to sequences.
+    Arguments:
+        inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
+        indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
+        batch: int batch_size
+        seqlen: int max sequence length
+    Returns:
+        inputs: (batch, seqlen, ...)
+    """
+    output = index_put_first_axis(inputs, indices, batch * seqlen)
+    return output.view(batch, seqlen, *inputs.shape[1:])
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos, sin = cos.to(q.dtype), sin.to(q.dtype)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+class RotaryEmbedding(torch.nn.Module):
+    def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
+        super().__init__()
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # Build here to make `torch.jit.trace` work.
+        self._set_cos_sin_cache(
+            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+        )
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+        return (
+            self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
+            self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
+        )
+class NTKScalingRotaryEmbedding(RotaryEmbedding):
+    """RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
+    def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
+        self.scaling_factor = scaling_factor
+        self.mixed_b = mixed_b
+        super().__init__(dim, max_position_embeddings, base, device)
+        max_position_embeddings = max_position_embeddings * self.scaling_factor
+        self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        if seq_len > self.max_position_embeddings:
+            base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
+            inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+            if self.mixed_b is None:
+                inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim)  # (6)
+            else:
+                a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b  # (13)
+                lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp()  # (12)
+                inv_freq = inv_freq / lambda_1_m  # (10)
+            self.register_buffer("inv_freq", inv_freq, persistent=False)
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+class RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+LAYER_NORM = {
+    'layer_norm': nn.LayerNorm,
+    'rms_norm': RMSNorm
+}
+class NewEmbeddings(nn.Module):
+    """
+    Embedding and Unpadding.
+    """
+    def __init__(self, config: NewConfig):
+        super().__init__()
+        self.padding_idx = config.pad_token_id
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
+        )
+        self.position_embedding_type = config.position_embedding_type
+        if self.position_embedding_type == 'absolute':
+            self.position_embeddings = nn.Embedding(
+                config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
+            )
+        elif self.position_embedding_type == 'rope':
+            self._init_rope(config)
+        else:
+            raise ValueError
+        self.type_vocab_size = config.type_vocab_size
+        if self.type_vocab_size > 0:
+            self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # position_ids is contiguous in memory and excluded when serialized
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings), persistent=False
+        )
+    def _init_rope(self, config):
+        kwargs = dict(
+            dim=int(config.hidden_size / config.num_attention_heads),
+            max_position_embeddings=config.max_position_embeddings,
+            base=config.rope_theta
+        )
+        if config.rope_scaling is None:
+            self.rotary_emb = RotaryEmbedding(**kwargs)
+        else:
+            kwargs.update(scaling_factor=config.rope_scaling["factor"])
+            scaling_type = config.rope_scaling["type"]
+            if scaling_type == 'ntk':
+                kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
+                self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
+            # elif scaling_type == "linear":
+            #     self.rotary_emb = LinearScalingRotaryEmbedding(**kwargs)
+            # elif scaling_type == "dynamic":
+            #     self.rotary_emb = DynamicNTKScalingRotaryEmbedding(**kwargs)
+            else:
+                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
+    def forward(
+        self,
+        unpad_inputs: bool,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        length: Optional[List[int]] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
+        """
+        """
+        if inputs_embeds is None:
+            device, input_shape = input_ids.device, input_ids.shape
+        else:
+            device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
+        batch_size, seq_length = input_shape
+        # Set attention_mask if it's None
+        if attention_mask is None:
+            attention_mask = torch.ones(input_shape, device=device)
+            if length is not None:
+                for i, l in enumerate(length):
+                    attention_mask[i, l:] = 0
+        # Set attention_mask_bool for unpadding
+        if unpad_inputs:
+            attention_mask_bool = attention_mask.bool()
+            if length is None:
+                length = attention_mask.sum(-1).tolist()
+        # Get word embeddings
+        if inputs_embeds is None:
+            if unpad_inputs:
+                input_ids = input_ids[attention_mask_bool].unsqueeze(0)
+            inputs_embeds = self.word_embeddings(input_ids)
+        else:
+            if unpad_inputs:
+                inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
+        embeddings = inputs_embeds
+        # Set and unpad position_ids
+        if position_ids is None:
+            if seq_length > self.position_ids.size(0):
+                self.register_buffer(
+                    "position_ids", torch.arange(seq_length, device=embeddings.device), persistent=False
+                )
+            if unpad_inputs:
+                # [1, cumsum_seq_len]
+                position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
+            else:
+                # [bs, seq_len]
+                position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
+        elif unpad_inputs:
+            position_ids = position_ids[attention_mask_bool].unsqueeze(0)  # [1, cumsum_seq_len]
+        # Compute rotary embedding
+        if self.position_embedding_type == 'rope':
+            rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
+            rope_cos = rope_cos[position_ids].unsqueeze(2)  # [bs, seq_len, 1, dim]
+            rope_sin = rope_sin[position_ids].unsqueeze(2)  # [bs, seq_len, 1, dim]
+            rope_embeds = rope_cos, rope_sin
+        else:
+            rope_embeds = None
+        if self.type_vocab_size > 0:
+            if token_type_ids is None:
+                token_type_ids = position_ids.mul(0)
+            else:
+                if self.type_vocab_size < 2:
+                    token_type_ids.mul_(0)
+                if unpad_inputs:
+                    token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
+            token_type_embeddings = self.token_type_embeddings(token_type_ids)
+            embeddings = embeddings + token_type_embeddings
+        # BERT position
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings = embeddings + position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings, attention_mask, rope_embeds, length
+class NewAttention(nn.Module):
+    def __init__(self, config: NewConfig, pack_qkv=None, use_memory_efficient_attention=None):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+        self.hidden_size = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        if pack_qkv is None:
+            pack_qkv = config.pack_qkv
+        self.pack_qkv = pack_qkv
+        if self.pack_qkv:
+            self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
+        else:
+            self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+            self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+            self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
+        if use_memory_efficient_attention is None:
+            use_memory_efficient_attention = self.config.use_memory_efficient_attention
+        self.use_memory_efficient_attention = use_memory_efficient_attention
+        self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
+        if self.use_memory_efficient_attention:
+            assert self.memory_efficient_attention is not None, 'please install xformers'
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: torch.FloatTensor,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        qkv_inputs: Optional[Tuple] = None,  # For RetroMAE
+    ) -> Tuple[torch.Tensor, ...]:
+        shape_hd = (self.num_attention_heads, self.attention_head_size)
+        # qkv
+        if self.pack_qkv and qkv_inputs is None:
+            qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
+        else:
+            if qkv_inputs is None:
+                qkv_inputs = (hidden_states, hidden_states, hidden_states)
+            qkv_pack = [
+                getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
+            ]
+        query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
+        if self.config.position_embedding_type == 'rope':
+            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
+        dtype = query_states.dtype
+        if self.config.logn_attention_scale and attention_scale is not None:
+            # https://kexue.fm/archives/8823
+            query_states = query_states * attention_scale.to(dtype)
+        if padding_inputs is not None:
+            query_states = pad_input(query_states.squeeze(), *padding_inputs)
+            key_states = pad_input(key_states.squeeze(), *padding_inputs)
+            value_states = pad_input(value_states.squeeze(), *padding_inputs)
+        if self.use_memory_efficient_attention:
+            assert self.memory_efficient_attention is not None, "xformers is not loaded"
+            assert output_attentions is False, "memory_efficient_attention do not output attentions"
+            assert head_mask is None, "Not support yet"
+            attention_probs = None
+            if torch.is_tensor(attention_bias):
+                attention_bias = attention_bias.to(dtype)
+            context_layer = self.memory_efficient_attention(
+                query_states,
+                key_states,
+                value_states,
+                attn_bias=attention_bias,
+                p=self.dropout.p
+            )
+        else:
+            if output_attentions and isinstance(self, NewSdpaAttention):
+                raise RuntimeError("SDPA do not output attentions")
+            context_layer, attention_probs = self._attention(
+                query_states, key_states, value_states, attention_bias, head_mask
+            )
+        if padding_inputs is not None:
+            context_layer = unpad_input(context_layer, indices=padding_inputs[0])
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(new_context_layer_shape)
+        # output proj
+        attn_output = self.o_proj(context_layer)
+        # add attentions if we output them
+        outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
+        return outputs
+    def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
+        """
+        Args:
+            q/k/v: (B, L, n_head, head_dim),
+        Returns:
+            attn_output: (B L, n_head, head_dim)
+        """
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_bias is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_bias
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, dim=-1)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        if self.dropout.p > 0:
+            attention_probs = self.dropout(attention_probs)
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+        context_layer = torch.matmul(attention_probs, value_states)
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        return context_layer, attention_probs
+class NewSdpaAttention(NewAttention):
+    """
+    New attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
+    `NewAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
+    SDPA API.
+    """
+    def __init__(self, config: NewConfig, **kwargs):
+        super().__init__(config, **kwargs)
+        # torch.backends.cuda.enable_mem_efficient_sdp(False)
+        # logger.warning(
+        #     "Disable memory efficient attention kernel for `NewSdpaAttention`, you can set "
+        #     "`use_memory_efficient_attention=True` if it expected to use."
+        # )
+    def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
+        attn_output = torch.nn.functional.scaled_dot_product_attention(
+            query_states.transpose(1, 2),
+            key_states.transpose(1, 2),
+            value_states.transpose(1, 2),
+            attn_mask=attention_bias,
+            dropout_p=self.dropout.p if self.training else 0.0,
+        )
+        attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
+        return attn_output, None
+NEW_ATTENTION_CLASSES = {
+    "eager": NewAttention,
+    # "flash_attention_2": ,  # TODO
+    "sdpa": NewSdpaAttention,
+}
+class NewGatedMLP(nn.Module):
+    """
+    GLU Variants Improve Transformer.
+    """
+    def __init__(self, config: NewConfig):
+        super().__init__()
+        self.intermediate_size = config.intermediate_size
+        self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
+        self.act_fn = ACT2FN[config.hidden_act]
+        if config.hidden_dropout_prob > 0:
+            self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
+        else:
+            self.hidden_dropout = None
+    def forward(self, hidden_states):
+        up_gate = self.up_gate_proj(hidden_states)
+        up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
+        gate = self.act_fn(gate)
+        gated_states = gate * up_states
+        if self.hidden_dropout is not None:
+            gated_states = self.hidden_dropout(gated_states)
+        down_states = self.down_proj(gated_states)
+        return down_states
+class NewLayer(nn.Module):
+    def __init__(
+        self,
+        config: NewConfig,
+        pack_qkv=None,
+        use_memory_efficient_attention=None,
+        attn_implementation=None
+    ):
+        super().__init__()
+        if attn_implementation is None:
+            attn_implementation = config._attn_implementation
+        if use_memory_efficient_attention is None:
+            use_memory_efficient_attention = config.use_memory_efficient_attention
+        if use_memory_efficient_attention:
+            if attn_implementation != 'eager':
+                logger.warning_once(f"Override {attn_implementation=} to 'eager' as {use_memory_efficient_attention=}")
+                attn_implementation = 'eager'  # Since it will be SDPA by default for torch>=2.1.1
+        self.attention = NEW_ATTENTION_CLASSES[attn_implementation](
+            config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
+        )
+        self.mlp = NewGatedMLP(config)
+        ln_class = LAYER_NORM[config.layer_norm_type]
+        self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
+        self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
+        if config.hidden_dropout_prob > 0:
+            self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
+        else:
+            self.hidden_dropout = None
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: torch.FloatTensor,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        qkv_inputs: Optional[Tuple] = None,  # For RetroMAE
+    ) -> Tuple[torch.Tensor, ...]:
+        # Multi head self attention
+        residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
+        attention_outputs = self.attention(
+            hidden_states,
+            attention_bias,
+            rope_embeds,
+            padding_inputs,
+            attention_scale,
+            head_mask,
+            output_attentions=output_attentions,
+            qkv_inputs=qkv_inputs,
+        )
+        hidden_states = attention_outputs[0]
+        if self.hidden_dropout is not None:
+            hidden_states = self.hidden_dropout(hidden_states)
+        hidden_states = residual + hidden_states
+        # In pretraining, after the attention of last layer, we only need the masked tokens.
+        if subset_indices is not None:
+            hidden_states = hidden_states[subset_indices]
+        hidden_states = self.attn_ln(hidden_states)
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.mlp(hidden_states)
+        if self.hidden_dropout is not None:
+            hidden_states = self.hidden_dropout(hidden_states)
+        hidden_states = residual + hidden_states
+        hidden_states = self.mlp_ln(hidden_states)
+        # add self attentions if we output attention weights
+        outputs = (hidden_states,) + attention_outputs[1:]
+        return outputs
+class NewEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([NewLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: Optional[torch.FloatTensor] = None,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        output_hidden_states: Optional[bool] = False,
+        return_dict: Optional[bool] = True,
+    ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            if i >= len(self.layer) - 1:
+                layer_subset_indices = subset_indices
+            else:
+                layer_subset_indices = None
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
+                    hidden_states,
+                    attention_bias,
+                    rope_embeds,
+                    padding_inputs,
+                    attention_scale,
+                    layer_subset_indices,
+                    layer_head_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_bias,
+                    rope_embeds,
+                    padding_inputs,
+                    attention_scale,
+                    layer_subset_indices,
+                    layer_head_mask,
+                    output_attentions,
+                )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    all_hidden_states,
+                    all_self_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+# Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->New
+class NewPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+class NewPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = NewConfig
+    base_model_prefix = "new"
+    supports_gradient_checkpointing = True
+    _supports_sdpa = True
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+class NewModel(NewPreTrainedModel):
+    """
+    The bare New Model transformer outputting raw hidden-states without any specific head on top.
+    """
+    def __init__(self, config: NewConfig, add_pooling_layer=False):
+        super().__init__(config)
+        self.config = config
+        self.embeddings = NewEmbeddings(config)
+        self.encoder = NewEncoder(config)
+        self.pooler = NewPooler(config) if add_pooling_layer else None
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        length: Optional[List[int]] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
+        r"""
+        length  (`list` of length `batch_size`, *optional*):
+            If is `None`, return padded `last_hidden_state`.
+        subset_indices  ():
+            pass
+        unpad_inputs  (`bool`, *optional*):
+            pass
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
+        output_padded = length is None
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
+            input_shape = input_ids.size()
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        # TODO: not used
+        # # Prepare head mask if needed
+        # # 1.0 in head_mask indicate we keep the head
+        # # attention_probs has shape bsz x n_heads x N x N
+        # # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        # head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+        # Get embeddings, may unpad them
+        (embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
+            unpad_inputs,
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            length=length,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds
+        )
+        batch_size, seq_length = input_shape
+        if unpad_inputs and self.config.use_memory_efficient_attention:
+            attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
+        else:
+            # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+            # ourselves in which case we just need to make it broadcastable to all heads.
+            attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
+            if self.config.use_memory_efficient_attention:
+                # Invalid shape for attention bias: torch.Size([48, 1, 1, 512]) (expected (48, 12, 512, 512))
+                attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
+        padding_inputs = None
+        if unpad_inputs and (output_padded or not self.config.use_memory_efficient_attention):
+            indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+            if not self.config.use_memory_efficient_attention:
+                padding_inputs = (indices, *input_shape)
+        attention_scale = None
+        if self.config.logn_attention_scale:
+            logger.warning_once("TODO: logn_attention_scale")
+        #     # attention scale log_512(input_len)
+        #     attention_scale = attention_mask.sum(1).log() / torch.tensor(self.config.max_position_embeddings).log()
+        #     # inference-time logn scale need clip 1
+        #     if self.config.logn_attention_clip1:
+        #         attention_scale.clip_(1)
+        #     attention_scale = attention_scale[:, None, None, None]
+        # else:
+        #     attention_scale = None
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_bias=attention_bias,
+            rope_embeds=rope_embeds,
+            padding_inputs=padding_inputs,
+            attention_scale=attention_scale,
+            subset_indices=subset_indices,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        if unpad_inputs and output_padded:
+            sequence_output = pad_input(
+                sequence_output.squeeze(), indices, batch_size, seq_length
+            )
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+        return BaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+class NewLMPredictionHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.transform_act_fn = ACT2FN[config.hidden_act]
+        self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.norm(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+class NewForMaskedLM(NewPreTrainedModel):
+    _tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
+    def __init__(self, config: NewConfig):
+        super().__init__(config)
+        self.new = NewModel(config, add_pooling_layer=False)
+        self.lm_head = NewLMPredictionHead(config)
+        self.loss_fct = nn.CrossEntropyLoss()
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head.decoder = new_embeddings
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is None or not self.new.config.unpad_inputs:
+            length = None
+            subset_indices = None
+        else:
+            length = attention_mask.sum(-1).tolist()
+            labels = labels[attention_mask.bool()].unsqueeze(0)
+            subset_indices = labels > -100
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            length=length,
+            subset_indices=subset_indices,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+        masked_lm_loss = None
+        if labels is not None:
+            if subset_indices is None:
+                mask = attention_mask.bool()
+                prediction_scores = prediction_scores[mask]
+                labels = labels[mask]
+            else:
+                labels = labels[subset_indices]
+            masked_lm_loss = self.loss_fct(prediction_scores, labels)
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForSequenceClassification(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+        self.new = NewModel(config, add_pooling_layer=True)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForMultipleChoice(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.new = NewModel(config, add_pooling_layer=True)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+            num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+            `input_ids` above)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+        input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
+        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
+        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
+        position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
+        inputs_embeds = (
+            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
+            if inputs_embeds is not None
+            else None
+        )
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.view(-1, num_choices)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+@dataclass
+class NewTokenClassifierOutput(ModelOutput):
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+class NewForTokenClassification(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.new = NewModel(config, add_pooling_layer=False)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], NewTokenClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return NewTokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            last_hidden_state=sequence_output,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForQuestionAnswering(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.new = NewModel(config, add_pooling_layer=False)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        start_positions: Optional[torch.Tensor] = None,
+        end_positions: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
+        r"""
+        start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1).contiguous()
+        end_logits = end_logits.squeeze(-1).contiguous()
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions = start_positions.clamp(0, ignored_index)
+            end_positions = end_positions.clamp(0, ignored_index)
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Normalize",
+    "type": "sentence_transformers.models.Normalize"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 512,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:883b037111086fd4dfebbbc9b7cee11e1517b5e0c0514879478661440f137085
+size 17082987

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 8192,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "XLMRobertaTokenizerFast",
+  "unk_token": "<unk>"
+}