New Publications of Linguistic Data Consortium - ☆☆ 卢伟个人主页 _____ ☆☆☆☆☆ _____ Lu Wei's Homepage ☆☆

New Publications of Linguistic Data Consortium

-  LDC and Oxford University Receive Digging into Data Challenge Grant  -
-  LDC to Close for Winter Break  -
-  Planned Maintenance - Jan 10, 2010  -
New Publications:
LDC2009T29
-  ACL Anthology Reference Corpus  -

LDC2009T30
-  Arabic Gigaword Fourth Edition  -
--------------------------------------------------------------------------------
LDC and Oxford University Receive Digging into Data Challenge Grant
LDC and its research team partner Oxford University are one of eight international research teams to have been awarded the first Digging into Data Challenge grants for projects that promote innovative humanities and social science research using large-scale data analysis. Four leading research agencies sponsor the international competition: The Joint Information Systems Committee (JISC) from the United Kingdom, the National Endowment for the Humanities and the National Science Foundation (NSF) from the United States and the Social Sciences and Humanities Research Council from Canada.

LDC and Oxford University (with the participation of the The British Library) have been funded by NSF and JISC, respectively, for a project entitled “Mining a Year of Speech,” which will focus on creating tools to enable rapid and flexible access to more than 9,000 hours of spoken audio files. Those files contain a wide variety of speech drawn from some of the leading British and American spoken word corpora, allowing for news kinds of linguistic analysis.

Further information about the Digging into Data Challenge can be found on the project website.

LDC to Close for Winter Break

LDC will be closed from Friday, December 25, 2009 through Friday, January 1, 2010 in accordance with the University of Pennsylvania Winter Break Policy.  Our offices will reopen on Monday, January 4, 2010.  Requests received for membership renewals and corpora will be processed at that time.

Best wishes for a happy and safe holiday season!

Planned Maintenance - Jan 10, 2010

Please take note:

As a result of planned electrical maintenance, LDC's website, including the Intranet and catalog, will not be accessible on Sunday, January 10, 2010 from 12 AM EST to approximately 4 AM EST.  We apologize for any inconvenience this will cause.

New Publications

(1)  ACL Anthology Reference Corpus is a digital archive of 10,291 research papers in computational linguistics sponsored by the Association for Computational Linguistics (ACL). Also available from the ACL, this release contains most of the papers that appear up to February 2007 in the web-based ACL Anthology, a dynamic repository that currently hosts over 16,500 articles drawn from a range of conferences and workshops as well as past issues of the Computational Linguistics journal. The ACL Anthology Reference Corpus is designed to be a standard, real-world digital collection testbed for experiments in bibliographic and bibliometric research.

The ACL is the international scientific and professional society for scholars working on problems involving natural language and computation. Membership includes the ACL quarterly journal, Computational Linguistics, reduced registration at most ACL-sponsored conferences, discounts on ACL-sponsored publications and participation in ACL Special Interest Groups. Since 1988, Computational Linguistics has been the primary forum for research on computational linguistics and natural language processing.

The material in the ACL Anthology Reference Corpus was scanned at 600dpi grayscale for archival storage, down-sampled to 300dpi black-and-white, assembled into articles and stored in the PDF Image with Hidden Text format. Author and title metadata was extracted from the OCRed text and used to build HTML index pages. Older materials, such as conference proceedings from the 1960s and early volumes of Computational Linguistics, were manually digitized from microfiche slides.

ACL Reference Anthology includes:

10,921 PDF files in the pdf/anthology-PDF tree.
13,551 files with metadata described in the metadata/anthology-XML tree
84,542 pages in the PDF files
ACL Anthology Reference Corpus is distributed on four DVD-ROM.

2009 Subscription Members will automatically receive two copies of this corpus.  2009 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$75.  ACL Anthology Reference Corpus is made available for research-only use under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license.

(2)  Arabic Gigaword Fourth Edition is a comprehensive archive of Arabic newswire text that has been acquired over several years at LDC. Arabic Gigaword Fourth Edition includes all of the content of Arabic Gigaword Third Edition (LDC2007T40) as well as newly-collected data. In addition, three new sources have been added in the fourth edition: Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi.

Nine distinct international sources of Arabic newswire are represented here:

Al-Ahram (ahr_arb)
Asharq Al-Awsat (aaw_arb)
Agence France Presse (afp_arb)
Assabah (asb_arb)
Al Hayat (hyt_arb)
An Nahar (nhr_arb)
Al-Quds Al-Arabi (qds_arb)
Ummah Press (umh_arb)
Xinhua News Agency (xin_arb)
The seven-character codes shown above represent both the directory names where the data files are found and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character.

These news services all use Modern Standard Arabic (MSA), so there should be a fairly limited scope for orthographic and lexical variation due to regional Arabic dialects.

New in the Fourth Edition

New Sources
      This release marks the first edition of Arabic Gigaword to include content from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the period from November 2006 through December 2008.  

New Data for Existing Sources
      This release contains all data collected by LDC from January 2007 through December 2008, except for Ummah Press for which data from January 2005 through December 2008 is included.

The table below shows data quantity by source under the following categories: data source (Source); the number of files per source (#Files); compressed file size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated words tokens in the text (K-words); and the number of documents per source (#DOCs).

Arabic Gigaword Fourth Edition is distributed on one DVD-ROM.

2009 Subscription Members will automatically receive two copies of this corpus.  2009 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$5000.
--------------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc@ldc.upenn.edu
Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

评论: 0 | 引用: 0 | 查看次数: 170
发表评论
昵 称:
密 码: 游客发言不需要密码.
内 容:
验证码: 验证码
选 项:
虽然发表评论不用注册,但是为了保护您的发言权,建议您注册帐号.
字数限制 1000 字 | UBB代码 关闭 | [img]标签 关闭